Linux-SCSI Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH RFC v7 00/12] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs
@ 2020-06-10 17:29 John Garry
  2020-06-10 17:29 ` [PATCH RFC v7 01/12] blk-mq: rename BLK_MQ_F_TAG_SHARED as BLK_MQ_F_TAG_QUEUE_SHARED John Garry
                   ` (12 more replies)
  0 siblings, 13 replies; 123+ messages in thread
From: John Garry @ 2020-06-10 17:29 UTC (permalink / raw)
  To: axboe, jejb, martin.petersen, don.brace, kashyap.desai,
	sumit.saxena, ming.lei, bvanassche, hare, hch,
	shivasharan.srikanteshwara
  Cc: linux-block, linux-scsi, esc.storagedev, chenxiang66,
	megaraidlinux.pdl, John Garry

Hi all,

Here is v7 of the patchset.

In this version of the series, we keep the shared sbitmap for driver tags,
and introduce changes to fix up the tag budgeting across request queues (
and associated debugfs changes).

Some performance figures:

Using 12x SAS SSDs on hisi_sas v3 hw. mq-deadline results are included,
but it is not always an appropriate scheduler to use.

Tag depth 		4000 (default)			260**

Baseline:
none sched:		2290K IOPS			894K
mq-deadline sched:	2341K IOPS			2313K

Final, host_tagset=0 in LLDD*
none sched:		2289K IOPS			703K
mq-deadline sched:	2337K IOPS			2291K

Final:
none sched:		2281K IOPS			1101K
mq-deadline sched:	2322K IOPS			1278K

* this is relevant as this is the performance in supporting but not
  enabling the feature
** depth=260 is relevant as some point where we are regularly waiting for
   tags to be available. Figures were are a bit unstable here for testing.

A copy of the patches can be found here:
https://github.com/hisilicon/kernel-dev/commits/private-topic-blk-mq-shared-tags-rfc-v7

And to progress this series, we the the following to go in first, when ready:
https://lore.kernel.org/linux-scsi/20200430131904.5847-1-hare@suse.de/

Comments welcome, thanks!

Differences to v6:
- tag budgeting across request queues and associated changes
- add any reviewed tags
- rebase
- I did not include any change related to updating shared sbitmap per-hctx
  wait pointer, based on lack of evidence of performance improvement. This
  was discussed here originally:
  https://lore.kernel.org/linux-scsi/ecaeccf029c6fe377ebd4f30f04df9f1@mail.gmail.com/
  I may revisit.

Differences to v5:
- For now, drop the shared scheduler tags
- Fix megaraid SAS queue selection and rebase
- Omit minor unused arguments patch, which has now been merged
- Add separate patch to introduce sbitmap pointer
- Fixed hctx_tags_bitmap_show() for shared sbitmap
- Various tidying

Hannes Reinecke (5):
  blk-mq: rename blk_mq_update_tag_set_depth()
  scsi: Add template flag 'host_tagset'
  megaraid_sas: switch fusion adapters to MQ
  smartpqi: enable host tagset
  hpsa: enable host_tagset and switch to MQ

John Garry (6):
  blk-mq: Use pointers for blk_mq_tags bitmap tags
  blk-mq: Facilitate a shared sbitmap per tagset
  blk-mq: Record nr_active_requests per queue for when using shared
    sbitmap
  blk-mq: Record active_queues_shared_sbitmap per tag_set for when using
    shared sbitmap
  blk-mq: Add support in hctx_tags_bitmap_show() for a shared sbitmap
  scsi: hisi_sas: Switch v3 hw to MQ

Ming Lei (1):
  blk-mq: rename BLK_MQ_F_TAG_SHARED as BLK_MQ_F_TAG_QUEUE_SHARED

 block/bfq-iosched.c                         |   4 +-
 block/blk-core.c                            |   2 +
 block/blk-mq-debugfs.c                      | 120 ++++++++++++++-
 block/blk-mq-tag.c                          | 157 ++++++++++++++------
 block/blk-mq-tag.h                          |  21 ++-
 block/blk-mq.c                              |  64 +++++---
 block/blk-mq.h                              |  33 +++-
 block/kyber-iosched.c                       |   4 +-
 drivers/scsi/hisi_sas/hisi_sas.h            |   3 +-
 drivers/scsi/hisi_sas/hisi_sas_main.c       |  36 ++---
 drivers/scsi/hisi_sas/hisi_sas_v3_hw.c      |  87 +++++------
 drivers/scsi/hpsa.c                         |  44 +-----
 drivers/scsi/hpsa.h                         |   1 -
 drivers/scsi/megaraid/megaraid_sas.h        |   1 -
 drivers/scsi/megaraid/megaraid_sas_base.c   |  59 +++-----
 drivers/scsi/megaraid/megaraid_sas_fusion.c |  24 +--
 drivers/scsi/scsi_lib.c                     |   2 +
 drivers/scsi/smartpqi/smartpqi_init.c       |  38 +++--
 include/linux/blk-mq.h                      |   9 +-
 include/linux/blkdev.h                      |   3 +
 include/scsi/scsi_host.h                    |   6 +-
 21 files changed, 463 insertions(+), 255 deletions(-)

-- 
2.26.2


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [PATCH RFC v7 01/12] blk-mq: rename BLK_MQ_F_TAG_SHARED as BLK_MQ_F_TAG_QUEUE_SHARED
  2020-06-10 17:29 [PATCH RFC v7 00/12] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs John Garry
@ 2020-06-10 17:29 ` John Garry
  2020-06-10 17:29 ` [PATCH RFC v7 02/12] blk-mq: rename blk_mq_update_tag_set_depth() John Garry
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 123+ messages in thread
From: John Garry @ 2020-06-10 17:29 UTC (permalink / raw)
  To: axboe, jejb, martin.petersen, don.brace, kashyap.desai,
	sumit.saxena, ming.lei, bvanassche, hare, hch,
	shivasharan.srikanteshwara
  Cc: linux-block, linux-scsi, esc.storagedev, chenxiang66,
	megaraidlinux.pdl, John Garry

From: Ming Lei <ming.lei@redhat.com>

BLK_MQ_F_TAG_SHARED actually means that tags is shared among request
queues, all of which should belong to LUNs attached to same HBA.

So rename it to make the point explicitly.

Suggested-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: John Garry <john.garry@huawei.com>
---
 block/blk-mq-debugfs.c |  2 +-
 block/blk-mq-tag.c     |  2 +-
 block/blk-mq-tag.h     |  4 ++--
 block/blk-mq.c         | 20 ++++++++++----------
 include/linux/blk-mq.h |  2 +-
 5 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 15df3a36e9fa..52d11f8422a7 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -237,7 +237,7 @@ static const char *const alloc_policy_name[] = {
 #define HCTX_FLAG_NAME(name) [ilog2(BLK_MQ_F_##name)] = #name
 static const char *const hctx_flag_name[] = {
 	HCTX_FLAG_NAME(SHOULD_MERGE),
-	HCTX_FLAG_NAME(TAG_SHARED),
+	HCTX_FLAG_NAME(TAG_QUEUE_SHARED),
 	HCTX_FLAG_NAME(BLOCKING),
 	HCTX_FLAG_NAME(NO_SCHED),
 	HCTX_FLAG_NAME(STACKING),
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 96a39d0724a2..85aa1690cbcf 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -65,7 +65,7 @@ static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx,
 {
 	unsigned int depth, users;
 
-	if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_SHARED))
+	if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED))
 		return true;
 	if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
 		return true;
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index d38e48f2a0a4..c810a346db8e 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -56,7 +56,7 @@ extern void __blk_mq_tag_idle(struct blk_mq_hw_ctx *);
 
 static inline bool blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx)
 {
-	if (!(hctx->flags & BLK_MQ_F_TAG_SHARED))
+	if (!(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED))
 		return false;
 
 	return __blk_mq_tag_busy(hctx);
@@ -64,7 +64,7 @@ static inline bool blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx)
 
 static inline void blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
 {
-	if (!(hctx->flags & BLK_MQ_F_TAG_SHARED))
+	if (!(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED))
 		return;
 
 	__blk_mq_tag_idle(hctx);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 9a36ac1c1fa1..d255c485ca5f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -281,7 +281,7 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
 		rq->tag = BLK_MQ_NO_TAG;
 		rq->internal_tag = tag;
 	} else {
-		if (data->hctx->flags & BLK_MQ_F_TAG_SHARED) {
+		if (data->hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED) {
 			rq_flags = RQF_MQ_INFLIGHT;
 			atomic_inc(&data->hctx->nr_active);
 		}
@@ -1116,7 +1116,7 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx *hctx,
 	wait_queue_entry_t *wait;
 	bool ret;
 
-	if (!(hctx->flags & BLK_MQ_F_TAG_SHARED)) {
+	if (!(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED)) {
 		blk_mq_sched_mark_restart_hctx(hctx);
 
 		/*
@@ -1282,7 +1282,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 				 * For non-shared tags, the RESTART check
 				 * will suffice.
 				 */
-				if (hctx->flags & BLK_MQ_F_TAG_SHARED)
+				if (hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED)
 					no_tag = true;
 				break;
 			}
@@ -2579,7 +2579,7 @@ blk_mq_alloc_hctx(struct request_queue *q, struct blk_mq_tag_set *set,
 	spin_lock_init(&hctx->lock);
 	INIT_LIST_HEAD(&hctx->dispatch);
 	hctx->queue = q;
-	hctx->flags = set->flags & ~BLK_MQ_F_TAG_SHARED;
+	hctx->flags = set->flags & ~BLK_MQ_F_TAG_QUEUE_SHARED;
 
 	INIT_LIST_HEAD(&hctx->hctx_list);
 
@@ -2796,9 +2796,9 @@ static void queue_set_hctx_shared(struct request_queue *q, bool shared)
 
 	queue_for_each_hw_ctx(q, hctx, i) {
 		if (shared)
-			hctx->flags |= BLK_MQ_F_TAG_SHARED;
+			hctx->flags |= BLK_MQ_F_TAG_QUEUE_SHARED;
 		else
-			hctx->flags &= ~BLK_MQ_F_TAG_SHARED;
+			hctx->flags &= ~BLK_MQ_F_TAG_QUEUE_SHARED;
 	}
 }
 
@@ -2824,7 +2824,7 @@ static void blk_mq_del_queue_tag_set(struct request_queue *q)
 	list_del_rcu(&q->tag_set_list);
 	if (list_is_singular(&set->tag_list)) {
 		/* just transitioned to unshared */
-		set->flags &= ~BLK_MQ_F_TAG_SHARED;
+		set->flags &= ~BLK_MQ_F_TAG_QUEUE_SHARED;
 		/* update existing queue */
 		blk_mq_update_tag_set_depth(set, false);
 	}
@@ -2841,12 +2841,12 @@ static void blk_mq_add_queue_tag_set(struct blk_mq_tag_set *set,
 	 * Check to see if we're transitioning to shared (from 1 to 2 queues).
 	 */
 	if (!list_empty(&set->tag_list) &&
-	    !(set->flags & BLK_MQ_F_TAG_SHARED)) {
-		set->flags |= BLK_MQ_F_TAG_SHARED;
+	    !(set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)) {
+		set->flags |= BLK_MQ_F_TAG_QUEUE_SHARED;
 		/* update existing queue */
 		blk_mq_update_tag_set_depth(set, true);
 	}
-	if (set->flags & BLK_MQ_F_TAG_SHARED)
+	if (set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)
 		queue_set_hctx_shared(q, true);
 	list_add_tail_rcu(&q->tag_set_list, &set->tag_list);
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index d6fcae17da5a..233209e8030d 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -392,7 +392,7 @@ struct blk_mq_ops {
 
 enum {
 	BLK_MQ_F_SHOULD_MERGE	= 1 << 0,
-	BLK_MQ_F_TAG_SHARED	= 1 << 1,
+	BLK_MQ_F_TAG_QUEUE_SHARED = 1 << 1,
 	/*
 	 * Set when this device requires underlying blk-mq device for
 	 * completing IO:
-- 
2.26.2


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [PATCH RFC v7 02/12] blk-mq: rename blk_mq_update_tag_set_depth()
  2020-06-10 17:29 [PATCH RFC v7 00/12] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs John Garry
  2020-06-10 17:29 ` [PATCH RFC v7 01/12] blk-mq: rename BLK_MQ_F_TAG_SHARED as BLK_MQ_F_TAG_QUEUE_SHARED John Garry
@ 2020-06-10 17:29 ` John Garry
  2020-06-11  2:57   ` Ming Lei
  2020-06-10 17:29 ` [PATCH RFC v7 03/12] blk-mq: Use pointers for blk_mq_tags bitmap tags John Garry
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 123+ messages in thread
From: John Garry @ 2020-06-10 17:29 UTC (permalink / raw)
  To: axboe, jejb, martin.petersen, don.brace, kashyap.desai,
	sumit.saxena, ming.lei, bvanassche, hare, hch,
	shivasharan.srikanteshwara
  Cc: linux-block, linux-scsi, esc.storagedev, chenxiang66,
	megaraidlinux.pdl, Hannes Reinecke, John Garry

From: Hannes Reinecke <hare@suse.de>

The function does not set the depth, but rather transitions from
shared to non-shared queues and vice versa.
So rename it to blk_mq_update_tag_set_shared() to better reflect
its purpose.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: John Garry <john.garry@huawei.com>
---
 block/blk-mq-tag.c | 18 ++++++++++--------
 block/blk-mq.c     |  8 ++++----
 2 files changed, 14 insertions(+), 12 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 85aa1690cbcf..bedddf168253 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -454,24 +454,22 @@ static int bt_alloc(struct sbitmap_queue *bt, unsigned int depth,
 				       node);
 }
 
-static struct blk_mq_tags *blk_mq_init_bitmap_tags(struct blk_mq_tags *tags,
-						   int node, int alloc_policy)
+static int blk_mq_init_bitmap_tags(struct blk_mq_tags *tags,
+				   int node, int alloc_policy)
 {
 	unsigned int depth = tags->nr_tags - tags->nr_reserved_tags;
 	bool round_robin = alloc_policy == BLK_TAG_ALLOC_RR;
 
 	if (bt_alloc(&tags->bitmap_tags, depth, round_robin, node))
-		goto free_tags;
+		return -ENOMEM;
 	if (bt_alloc(&tags->breserved_tags, tags->nr_reserved_tags, round_robin,
 		     node))
 		goto free_bitmap_tags;
 
-	return tags;
+	return 0;
 free_bitmap_tags:
 	sbitmap_queue_free(&tags->bitmap_tags);
-free_tags:
-	kfree(tags);
-	return NULL;
+	return -ENOMEM;
 }
 
 struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
@@ -492,7 +490,11 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
 	tags->nr_tags = total_tags;
 	tags->nr_reserved_tags = reserved_tags;
 
-	return blk_mq_init_bitmap_tags(tags, node, alloc_policy);
+	if (blk_mq_init_bitmap_tags(tags, node, alloc_policy) < 0) {
+		kfree(tags);
+		tags = NULL;
+	}
+	return tags;
 }
 
 void blk_mq_free_tags(struct blk_mq_tags *tags)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index d255c485ca5f..c20d75c851f2 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2802,8 +2802,8 @@ static void queue_set_hctx_shared(struct request_queue *q, bool shared)
 	}
 }
 
-static void blk_mq_update_tag_set_depth(struct blk_mq_tag_set *set,
-					bool shared)
+static void blk_mq_update_tag_set_shared(struct blk_mq_tag_set *set,
+					 bool shared)
 {
 	struct request_queue *q;
 
@@ -2826,7 +2826,7 @@ static void blk_mq_del_queue_tag_set(struct request_queue *q)
 		/* just transitioned to unshared */
 		set->flags &= ~BLK_MQ_F_TAG_QUEUE_SHARED;
 		/* update existing queue */
-		blk_mq_update_tag_set_depth(set, false);
+		blk_mq_update_tag_set_shared(set, false);
 	}
 	mutex_unlock(&set->tag_list_lock);
 	INIT_LIST_HEAD(&q->tag_set_list);
@@ -2844,7 +2844,7 @@ static void blk_mq_add_queue_tag_set(struct blk_mq_tag_set *set,
 	    !(set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)) {
 		set->flags |= BLK_MQ_F_TAG_QUEUE_SHARED;
 		/* update existing queue */
-		blk_mq_update_tag_set_depth(set, true);
+		blk_mq_update_tag_set_shared(set, true);
 	}
 	if (set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)
 		queue_set_hctx_shared(q, true);
-- 
2.26.2


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [PATCH RFC v7 03/12] blk-mq: Use pointers for blk_mq_tags bitmap tags
  2020-06-10 17:29 [PATCH RFC v7 00/12] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs John Garry
  2020-06-10 17:29 ` [PATCH RFC v7 01/12] blk-mq: rename BLK_MQ_F_TAG_SHARED as BLK_MQ_F_TAG_QUEUE_SHARED John Garry
  2020-06-10 17:29 ` [PATCH RFC v7 02/12] blk-mq: rename blk_mq_update_tag_set_depth() John Garry
@ 2020-06-10 17:29 ` John Garry
  2020-06-10 17:29 ` [PATCH RFC v7 04/12] blk-mq: Facilitate a shared sbitmap per tagset John Garry
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 123+ messages in thread
From: John Garry @ 2020-06-10 17:29 UTC (permalink / raw)
  To: axboe, jejb, martin.petersen, don.brace, kashyap.desai,
	sumit.saxena, ming.lei, bvanassche, hare, hch,
	shivasharan.srikanteshwara
  Cc: linux-block, linux-scsi, esc.storagedev, chenxiang66,
	megaraidlinux.pdl, John Garry

Introduce pointers for the blk_mq_tags regular and reserved bitmap tags,
with the goal of later being able to use a common shared tag bitmap across
all HW contexts in a set.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: John Garry <john.garry@huawei.com>
---
 block/bfq-iosched.c    |  4 ++--
 block/blk-mq-debugfs.c |  8 ++++----
 block/blk-mq-tag.c     | 41 ++++++++++++++++++++++-------------------
 block/blk-mq-tag.h     |  7 +++++--
 block/blk-mq.c         |  4 ++--
 block/kyber-iosched.c  |  4 ++--
 6 files changed, 37 insertions(+), 31 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 50c8f034c01c..a1123d4d586d 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -6372,8 +6372,8 @@ static void bfq_depth_updated(struct blk_mq_hw_ctx *hctx)
 	struct blk_mq_tags *tags = hctx->sched_tags;
 	unsigned int min_shallow;
 
-	min_shallow = bfq_update_depths(bfqd, &tags->bitmap_tags);
-	sbitmap_queue_min_shallow_depth(&tags->bitmap_tags, min_shallow);
+	min_shallow = bfq_update_depths(bfqd, tags->bitmap_tags);
+	sbitmap_queue_min_shallow_depth(tags->bitmap_tags, min_shallow);
 }
 
 static int bfq_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int index)
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 52d11f8422a7..a400b6698dff 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -450,11 +450,11 @@ static void blk_mq_debugfs_tags_show(struct seq_file *m,
 		   atomic_read(&tags->active_queues));
 
 	seq_puts(m, "\nbitmap_tags:\n");
-	sbitmap_queue_show(&tags->bitmap_tags, m);
+	sbitmap_queue_show(tags->bitmap_tags, m);
 
 	if (tags->nr_reserved_tags) {
 		seq_puts(m, "\nbreserved_tags:\n");
-		sbitmap_queue_show(&tags->breserved_tags, m);
+		sbitmap_queue_show(tags->breserved_tags, m);
 	}
 }
 
@@ -485,7 +485,7 @@ static int hctx_tags_bitmap_show(void *data, struct seq_file *m)
 	if (res)
 		goto out;
 	if (hctx->tags)
-		sbitmap_bitmap_show(&hctx->tags->bitmap_tags.sb, m);
+		sbitmap_bitmap_show(&hctx->tags->bitmap_tags->sb, m);
 	mutex_unlock(&q->sysfs_lock);
 
 out:
@@ -519,7 +519,7 @@ static int hctx_sched_tags_bitmap_show(void *data, struct seq_file *m)
 	if (res)
 		goto out;
 	if (hctx->sched_tags)
-		sbitmap_bitmap_show(&hctx->sched_tags->bitmap_tags.sb, m);
+		sbitmap_bitmap_show(&hctx->sched_tags->bitmap_tags->sb, m);
 	mutex_unlock(&q->sysfs_lock);
 
 out:
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index bedddf168253..be39db3c88d7 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -35,9 +35,9 @@ bool __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx)
  */
 void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags, bool include_reserve)
 {
-	sbitmap_queue_wake_all(&tags->bitmap_tags);
+	sbitmap_queue_wake_all(tags->bitmap_tags);
 	if (include_reserve)
-		sbitmap_queue_wake_all(&tags->breserved_tags);
+		sbitmap_queue_wake_all(tags->breserved_tags);
 }
 
 /*
@@ -113,10 +113,10 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 			WARN_ON_ONCE(1);
 			return BLK_MQ_NO_TAG;
 		}
-		bt = &tags->breserved_tags;
+		bt = tags->breserved_tags;
 		tag_offset = 0;
 	} else {
-		bt = &tags->bitmap_tags;
+		bt = tags->bitmap_tags;
 		tag_offset = tags->nr_reserved_tags;
 	}
 
@@ -162,9 +162,9 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 						data->ctx);
 		tags = blk_mq_tags_from_data(data);
 		if (data->flags & BLK_MQ_REQ_RESERVED)
-			bt = &tags->breserved_tags;
+			bt = tags->breserved_tags;
 		else
-			bt = &tags->bitmap_tags;
+			bt = tags->bitmap_tags;
 
 		/*
 		 * If destination hw queue is changed, fake wake up on
@@ -198,10 +198,10 @@ void blk_mq_put_tag(struct blk_mq_tags *tags, struct blk_mq_ctx *ctx,
 		const int real_tag = tag - tags->nr_reserved_tags;
 
 		BUG_ON(real_tag >= tags->nr_tags);
-		sbitmap_queue_clear(&tags->bitmap_tags, real_tag, ctx->cpu);
+		sbitmap_queue_clear(tags->bitmap_tags, real_tag, ctx->cpu);
 	} else {
 		BUG_ON(tag >= tags->nr_reserved_tags);
-		sbitmap_queue_clear(&tags->breserved_tags, tag, ctx->cpu);
+		sbitmap_queue_clear(tags->breserved_tags, tag, ctx->cpu);
 	}
 }
 
@@ -325,9 +325,9 @@ static void __blk_mq_all_tag_iter(struct blk_mq_tags *tags,
 	WARN_ON_ONCE(flags & BT_TAG_ITER_RESERVED);
 
 	if (tags->nr_reserved_tags)
-		bt_tags_for_each(tags, &tags->breserved_tags, fn, priv,
+		bt_tags_for_each(tags, tags->breserved_tags, fn, priv,
 				 flags | BT_TAG_ITER_RESERVED);
-	bt_tags_for_each(tags, &tags->bitmap_tags, fn, priv, flags);
+	bt_tags_for_each(tags, tags->bitmap_tags, fn, priv, flags);
 }
 
 /**
@@ -441,8 +441,8 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
 			continue;
 
 		if (tags->nr_reserved_tags)
-			bt_for_each(hctx, &tags->breserved_tags, fn, priv, true);
-		bt_for_each(hctx, &tags->bitmap_tags, fn, priv, false);
+			bt_for_each(hctx, tags->breserved_tags, fn, priv, true);
+		bt_for_each(hctx, tags->bitmap_tags, fn, priv, false);
 	}
 	blk_queue_exit(q);
 }
@@ -460,15 +460,18 @@ static int blk_mq_init_bitmap_tags(struct blk_mq_tags *tags,
 	unsigned int depth = tags->nr_tags - tags->nr_reserved_tags;
 	bool round_robin = alloc_policy == BLK_TAG_ALLOC_RR;
 
-	if (bt_alloc(&tags->bitmap_tags, depth, round_robin, node))
+	if (bt_alloc(&tags->__bitmap_tags, depth, round_robin, node))
 		return -ENOMEM;
-	if (bt_alloc(&tags->breserved_tags, tags->nr_reserved_tags, round_robin,
-		     node))
+	if (bt_alloc(&tags->__breserved_tags, tags->nr_reserved_tags,
+		     round_robin, node))
 		goto free_bitmap_tags;
 
+	tags->bitmap_tags = &tags->__bitmap_tags;
+	tags->breserved_tags = &tags->__breserved_tags;
+
 	return 0;
 free_bitmap_tags:
-	sbitmap_queue_free(&tags->bitmap_tags);
+	sbitmap_queue_free(&tags->__bitmap_tags);
 	return -ENOMEM;
 }
 
@@ -499,8 +502,8 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
 
 void blk_mq_free_tags(struct blk_mq_tags *tags)
 {
-	sbitmap_queue_free(&tags->bitmap_tags);
-	sbitmap_queue_free(&tags->breserved_tags);
+	sbitmap_queue_free(&tags->__bitmap_tags);
+	sbitmap_queue_free(&tags->__breserved_tags);
 	kfree(tags);
 }
 
@@ -550,7 +553,7 @@ int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
 		 * Don't need (or can't) update reserved tags here, they
 		 * remain static and should never need resizing.
 		 */
-		sbitmap_queue_resize(&tags->bitmap_tags,
+		sbitmap_queue_resize(tags->bitmap_tags,
 				tdepth - tags->nr_reserved_tags);
 	}
 
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index c810a346db8e..cebf7a4b280a 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -13,8 +13,11 @@ struct blk_mq_tags {
 
 	atomic_t active_queues;
 
-	struct sbitmap_queue bitmap_tags;
-	struct sbitmap_queue breserved_tags;
+	struct sbitmap_queue *bitmap_tags;
+	struct sbitmap_queue *breserved_tags;
+
+	struct sbitmap_queue __bitmap_tags;
+	struct sbitmap_queue __breserved_tags;
 
 	struct request **rqs;
 	struct request **static_rqs;
diff --git a/block/blk-mq.c b/block/blk-mq.c
index c20d75c851f2..90b645c3092c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1093,7 +1093,7 @@ static int blk_mq_dispatch_wake(wait_queue_entry_t *wait, unsigned mode,
 		struct sbitmap_queue *sbq;
 
 		list_del_init(&wait->entry);
-		sbq = &hctx->tags->bitmap_tags;
+		sbq = hctx->tags->bitmap_tags;
 		atomic_dec(&sbq->ws_active);
 	}
 	spin_unlock(&hctx->dispatch_wait_lock);
@@ -1111,7 +1111,7 @@ static int blk_mq_dispatch_wake(wait_queue_entry_t *wait, unsigned mode,
 static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx *hctx,
 				 struct request *rq)
 {
-	struct sbitmap_queue *sbq = &hctx->tags->bitmap_tags;
+	struct sbitmap_queue *sbq = hctx->tags->bitmap_tags;
 	struct wait_queue_head *wq;
 	wait_queue_entry_t *wait;
 	bool ret;
diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
index a38c5ab103d1..075e99c207ef 100644
--- a/block/kyber-iosched.c
+++ b/block/kyber-iosched.c
@@ -359,7 +359,7 @@ static unsigned int kyber_sched_tags_shift(struct request_queue *q)
 	 * All of the hardware queues have the same depth, so we can just grab
 	 * the shift of the first one.
 	 */
-	return q->queue_hw_ctx[0]->sched_tags->bitmap_tags.sb.shift;
+	return q->queue_hw_ctx[0]->sched_tags->bitmap_tags->sb.shift;
 }
 
 static struct kyber_queue_data *kyber_queue_data_alloc(struct request_queue *q)
@@ -502,7 +502,7 @@ static int kyber_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int hctx_idx)
 	khd->batching = 0;
 
 	hctx->sched_data = khd;
-	sbitmap_queue_min_shallow_depth(&hctx->sched_tags->bitmap_tags,
+	sbitmap_queue_min_shallow_depth(hctx->sched_tags->bitmap_tags,
 					kqd->async_depth);
 
 	return 0;
-- 
2.26.2


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [PATCH RFC v7 04/12] blk-mq: Facilitate a shared sbitmap per tagset
  2020-06-10 17:29 [PATCH RFC v7 00/12] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs John Garry
                   ` (2 preceding siblings ...)
  2020-06-10 17:29 ` [PATCH RFC v7 03/12] blk-mq: Use pointers for blk_mq_tags bitmap tags John Garry
@ 2020-06-10 17:29 ` John Garry
  2020-06-11  3:37   ` Ming Lei
  2020-06-10 17:29 ` [PATCH RFC v7 05/12] blk-mq: Record nr_active_requests per queue for when using shared sbitmap John Garry
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 123+ messages in thread
From: John Garry @ 2020-06-10 17:29 UTC (permalink / raw)
  To: axboe, jejb, martin.petersen, don.brace, kashyap.desai,
	sumit.saxena, ming.lei, bvanassche, hare, hch,
	shivasharan.srikanteshwara
  Cc: linux-block, linux-scsi, esc.storagedev, chenxiang66,
	megaraidlinux.pdl, John Garry

Some SCSI HBAs (such as HPSA, megaraid, mpt3sas, hisi_sas_v3 ..) support
multiple reply queues with single hostwide tags.

In addition, these drivers want to use interrupt assignment in
pci_alloc_irq_vectors(PCI_IRQ_AFFINITY). However, as discussed in [0],
CPU hotplug may cause in-flight IO completion to not be serviced when an
interrupt is shutdown. That problem is solved in commit bf0beec0607d
("blk-mq: drain I/O when all CPUs in a hctx are offline").

However, to take advantage of that blk-mq feature, the HBA HW queuess are
required to be mapped to that of the blk-mq hctx's; to do that, the HBA HW queues
need to be exposed to the upper layer.

In making that transition, the per-SCSI command request tags are no
longer unique per Scsi host - they are just unique per hctx. As such, the
HBA LLDD would have to generate this tag internally, which has a certain
performance overhead.

However another problem is that blk-mq assumes the host may accept
(Scsi_host.can_queue * #hw queue) commands. In commit 6eb045e092ef ("scsi:
 core: avoid host-wide host_busy counter for scsi_mq"), the Scsi host busy
counter was removed, which would stop the LLDD being sent more than
.can_queue commands; however, it should still be ensured that the block
layer does not issue more than .can_queue commands to the Scsi host.

To solve this problem, introduce a shared sbitmap per blk_mq_tag_set,
which may be requested at init time.

New flag BLK_MQ_F_TAG_HCTX_SHARED should be set when requesting the
tagset to indicate whether the shared sbitmap should be used.

Even when BLK_MQ_F_TAG_HCTX_SHARED is set, a full set of tags and requests
are still allocated per hctx; the reason for this is that if tags and
requests were only allocated for a single hctx - like hctx0 - it may break
block drivers which expect a request be associated with a specific hctx,
i.e. not always hctx0. This will introduce extra memory usage.

This change is based on work originally from Ming Lei in [1] and from
Bart's suggestion in [2].

[0] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/
[1] https://lore.kernel.org/linux-block/20190531022801.10003-1-ming.lei@redhat.com/
[2] https://lore.kernel.org/linux-block/ff77beff-5fd9-9f05-12b6-826922bace1f@huawei.com/T/#m3db0a602f095cbcbff27e9c884d6b4ae826144be

Signed-off-by: John Garry <john.garry@huawei.com>
---
 block/blk-mq-tag.c     | 39 +++++++++++++++++++++++++++++++++++++--
 block/blk-mq-tag.h     | 10 +++++++++-
 block/blk-mq.c         | 24 +++++++++++++++++++++++-
 block/blk-mq.h         |  5 +++++
 include/linux/blk-mq.h |  6 ++++++
 5 files changed, 80 insertions(+), 4 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index be39db3c88d7..92843e3e1a2a 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -228,7 +228,7 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
 	 * We can hit rq == NULL here, because the tagging functions
 	 * test and set the bit before assigning ->rqs[].
 	 */
-	if (rq && rq->q == hctx->queue)
+	if (rq && rq->q == hctx->queue && rq->mq_hctx == hctx)
 		return iter_data->fn(hctx, rq, iter_data->data, reserved);
 	return true;
 }
@@ -466,6 +466,7 @@ static int blk_mq_init_bitmap_tags(struct blk_mq_tags *tags,
 		     round_robin, node))
 		goto free_bitmap_tags;
 
+	/* We later overwrite these in case of per-set shared sbitmap */
 	tags->bitmap_tags = &tags->__bitmap_tags;
 	tags->breserved_tags = &tags->__breserved_tags;
 
@@ -475,7 +476,32 @@ static int blk_mq_init_bitmap_tags(struct blk_mq_tags *tags,
 	return -ENOMEM;
 }
 
-struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
+bool blk_mq_init_shared_sbitmap(struct blk_mq_tag_set *tag_set)
+{
+	unsigned int depth = tag_set->queue_depth - tag_set->reserved_tags;
+	int alloc_policy = BLK_MQ_FLAG_TO_ALLOC_POLICY(tag_set->flags);
+	bool round_robin = alloc_policy == BLK_TAG_ALLOC_RR;
+	int node = tag_set->numa_node;
+
+	if (bt_alloc(&tag_set->__bitmap_tags, depth, round_robin, node))
+		return false;
+	if (bt_alloc(&tag_set->__breserved_tags, tag_set->reserved_tags,
+		     round_robin, node))
+		goto free_bitmap_tags;
+	return true;
+free_bitmap_tags:
+	sbitmap_queue_free(&tag_set->__bitmap_tags);
+	return false;
+}
+
+void blk_mq_exit_shared_sbitmap(struct blk_mq_tag_set *tag_set)
+{
+	sbitmap_queue_free(&tag_set->__bitmap_tags);
+	sbitmap_queue_free(&tag_set->__breserved_tags);
+}
+
+struct blk_mq_tags *blk_mq_init_tags(struct blk_mq_tag_set *set,
+				     unsigned int total_tags,
 				     unsigned int reserved_tags,
 				     int node, int alloc_policy)
 {
@@ -502,6 +528,10 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
 
 void blk_mq_free_tags(struct blk_mq_tags *tags)
 {
+	/*
+	 * Do not free tags->{bitmap, breserved}_tags, as this may point to
+	 * shared sbitmap
+	 */
 	sbitmap_queue_free(&tags->__bitmap_tags);
 	sbitmap_queue_free(&tags->__breserved_tags);
 	kfree(tags);
@@ -560,6 +590,11 @@ int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
 	return 0;
 }
 
+void blk_mq_tag_resize_shared_sbitmap(struct blk_mq_tag_set *set, unsigned int size)
+{
+	sbitmap_queue_resize(&set->__bitmap_tags, size - set->reserved_tags);
+}
+
 /**
  * blk_mq_unique_tag() - return a tag that is unique queue-wide
  * @rq: request for which to compute a unique tag
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index cebf7a4b280a..cf39dd13a24d 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -25,7 +25,12 @@ struct blk_mq_tags {
 };
 
 
-extern struct blk_mq_tags *blk_mq_init_tags(unsigned int nr_tags, unsigned int reserved_tags, int node, int alloc_policy);
+extern bool blk_mq_init_shared_sbitmap(struct blk_mq_tag_set *tag_set);
+extern void blk_mq_exit_shared_sbitmap(struct blk_mq_tag_set *tag_set);
+extern struct blk_mq_tags *blk_mq_init_tags(struct blk_mq_tag_set *tag_set,
+					    unsigned int nr_tags,
+					    unsigned int reserved_tags,
+					    int node, int alloc_policy);
 extern void blk_mq_free_tags(struct blk_mq_tags *tags);
 
 extern unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data);
@@ -34,6 +39,9 @@ extern void blk_mq_put_tag(struct blk_mq_tags *tags, struct blk_mq_ctx *ctx,
 extern int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
 					struct blk_mq_tags **tags,
 					unsigned int depth, bool can_grow);
+extern void blk_mq_tag_resize_shared_sbitmap(struct blk_mq_tag_set *set,
+					     unsigned int size);
+
 extern void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags, bool);
 void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
 		void *priv);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 90b645c3092c..77120dd4e4d5 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2229,7 +2229,7 @@ struct blk_mq_tags *blk_mq_alloc_rq_map(struct blk_mq_tag_set *set,
 	if (node == NUMA_NO_NODE)
 		node = set->numa_node;
 
-	tags = blk_mq_init_tags(nr_tags, reserved_tags, node,
+	tags = blk_mq_init_tags(set, nr_tags, reserved_tags, node,
 				BLK_MQ_FLAG_TO_ALLOC_POLICY(set->flags));
 	if (!tags)
 		return NULL;
@@ -3349,11 +3349,28 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
 	if (ret)
 		goto out_free_mq_map;
 
+	if (blk_mq_is_sbitmap_shared(set)) {
+		if (!blk_mq_init_shared_sbitmap(set)) {
+			ret = -ENOMEM;
+			goto out_free_mq_rq_maps;
+		}
+
+		for (i = 0; i < set->nr_hw_queues; i++) {
+			struct blk_mq_tags *tags = set->tags[i];
+
+			tags->bitmap_tags = &set->__bitmap_tags;
+			tags->breserved_tags = &set->__breserved_tags;
+		}
+	}
+
 	mutex_init(&set->tag_list_lock);
 	INIT_LIST_HEAD(&set->tag_list);
 
 	return 0;
 
+out_free_mq_rq_maps:
+	for (i = 0; i < set->nr_hw_queues; i++)
+		blk_mq_free_rq_map(set->tags[i]);
 out_free_mq_map:
 	for (i = 0; i < set->nr_maps; i++) {
 		kfree(set->map[i].mq_map);
@@ -3372,6 +3389,9 @@ void blk_mq_free_tag_set(struct blk_mq_tag_set *set)
 	for (i = 0; i < set->nr_hw_queues; i++)
 		blk_mq_free_map_and_requests(set, i);
 
+	if (blk_mq_is_sbitmap_shared(set))
+		blk_mq_exit_shared_sbitmap(set);
+
 	for (j = 0; j < set->nr_maps; j++) {
 		kfree(set->map[j].mq_map);
 		set->map[j].mq_map = NULL;
@@ -3408,6 +3428,8 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
 		if (!hctx->sched_tags) {
 			ret = blk_mq_tag_update_depth(hctx, &hctx->tags, nr,
 							false);
+			if (!ret && blk_mq_is_sbitmap_shared(set))
+				blk_mq_tag_resize_shared_sbitmap(set, nr);
 		} else {
 			ret = blk_mq_tag_update_depth(hctx, &hctx->sched_tags,
 							nr, true);
diff --git a/block/blk-mq.h b/block/blk-mq.h
index a139b0631817..1a283c707215 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -158,6 +158,11 @@ struct blk_mq_alloc_data {
 	struct blk_mq_hw_ctx *hctx;
 };
 
+static inline bool blk_mq_is_sbitmap_shared(struct blk_mq_tag_set *tag_set)
+{
+	return tag_set->flags & BLK_MQ_F_TAG_HCTX_SHARED;
+}
+
 static inline struct blk_mq_tags *blk_mq_tags_from_data(struct blk_mq_alloc_data *data)
 {
 	if (data->flags & BLK_MQ_REQ_INTERNAL)
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 233209e8030d..7b31cdb92a71 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -231,6 +231,9 @@ enum hctx_type {
  * @flags:	   Zero or more BLK_MQ_F_* flags.
  * @driver_data:   Pointer to data owned by the block driver that created this
  *		   tag set.
+ * @__bitmap_tags: A shared tags sbitmap, used over all hctx's
+ * @__breserved_tags:
+ *		   A shared reserved tags sbitmap, used over all hctx's
  * @tags:	   Tag sets. One tag set per hardware queue. Has @nr_hw_queues
  *		   elements.
  * @tag_list_lock: Serializes tag_list accesses.
@@ -250,6 +253,8 @@ struct blk_mq_tag_set {
 	unsigned int		flags;
 	void			*driver_data;
 
+	struct sbitmap_queue	__bitmap_tags;
+	struct sbitmap_queue	__breserved_tags;
 	struct blk_mq_tags	**tags;
 
 	struct mutex		tag_list_lock;
@@ -398,6 +403,7 @@ enum {
 	 * completing IO:
 	 */
 	BLK_MQ_F_STACKING	= 1 << 2,
+	BLK_MQ_F_TAG_HCTX_SHARED = 1 << 3,
 	BLK_MQ_F_BLOCKING	= 1 << 5,
 	BLK_MQ_F_NO_SCHED	= 1 << 6,
 	BLK_MQ_F_ALLOC_POLICY_START_BIT = 8,
-- 
2.26.2


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [PATCH RFC v7 05/12] blk-mq: Record nr_active_requests per queue for when using shared sbitmap
  2020-06-10 17:29 [PATCH RFC v7 00/12] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs John Garry
                   ` (3 preceding siblings ...)
  2020-06-10 17:29 ` [PATCH RFC v7 04/12] blk-mq: Facilitate a shared sbitmap per tagset John Garry
@ 2020-06-10 17:29 ` John Garry
  2020-06-11  4:04   ` Ming Lei
  2020-06-10 17:29 ` [PATCH RFC v7 06/12] blk-mq: Record active_queues_shared_sbitmap per tag_set " John Garry
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 123+ messages in thread
From: John Garry @ 2020-06-10 17:29 UTC (permalink / raw)
  To: axboe, jejb, martin.petersen, don.brace, kashyap.desai,
	sumit.saxena, ming.lei, bvanassche, hare, hch,
	shivasharan.srikanteshwara
  Cc: linux-block, linux-scsi, esc.storagedev, chenxiang66,
	megaraidlinux.pdl, John Garry

The per-hctx nr_active value can no longer be used to fairly assign a share
of tag depth per request queue for when using a shared sbitmap, as it does
not consider that the tags are shared tags over all hctx's.

For this case, record the nr_active_requests per request_queue, and make
the judgment based on that value.

Also introduce a debugfs version of per-hctx blk_mq_debugfs_attr, omitting
hctx_active_show() (as blk_mq_hw_ctx.nr_active is no longer maintained for
the case of shared sbitmap) and other entries which we can add which would
be revised specifically for when using a shared sbitmap.

Co-developed-with: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: John Garry <john.garry@huawei.com>
---
 block/blk-core.c       |  2 ++
 block/blk-mq-debugfs.c | 23 ++++++++++++++++++++++-
 block/blk-mq-tag.c     | 10 ++++++----
 block/blk-mq.c         |  6 +++---
 block/blk-mq.h         | 28 +++++++++++++++++++++++++++-
 include/linux/blkdev.h |  2 ++
 6 files changed, 62 insertions(+), 9 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 03252af8c82c..c622453c1363 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -529,6 +529,8 @@ struct request_queue *__blk_alloc_queue(int node_id)
 	q->backing_dev_info->capabilities = BDI_CAP_CGROUP_WRITEBACK;
 	q->node = node_id;
 
+	atomic_set(&q->nr_active_requests_shared_sbitmap, 0);
+
 	timer_setup(&q->backing_dev_info->laptop_mode_wb_timer,
 		    laptop_mode_timer_fn, 0);
 	timer_setup(&q->timeout, blk_rq_timed_out_timer, 0);
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index a400b6698dff..0fa3af41ab65 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -796,6 +796,23 @@ static const struct blk_mq_debugfs_attr blk_mq_debugfs_hctx_attrs[] = {
 	{},
 };
 
+static const struct blk_mq_debugfs_attr blk_mq_debugfs_hctx_shared_sbitmap_attrs[] = {
+	{"state", 0400, hctx_state_show},
+	{"flags", 0400, hctx_flags_show},
+	{"dispatch", 0400, .seq_ops = &hctx_dispatch_seq_ops},
+	{"busy", 0400, hctx_busy_show},
+	{"ctx_map", 0400, hctx_ctx_map_show},
+	{"sched_tags", 0400, hctx_sched_tags_show},
+	{"sched_tags_bitmap", 0400, hctx_sched_tags_bitmap_show},
+	{"io_poll", 0600, hctx_io_poll_show, hctx_io_poll_write},
+	{"dispatched", 0600, hctx_dispatched_show, hctx_dispatched_write},
+	{"queued", 0600, hctx_queued_show, hctx_queued_write},
+	{"run", 0600, hctx_run_show, hctx_run_write},
+	{"active", 0400, hctx_active_show},
+	{"dispatch_busy", 0400, hctx_dispatch_busy_show},
+	{}
+};
+
 static const struct blk_mq_debugfs_attr blk_mq_debugfs_ctx_attrs[] = {
 	{"default_rq_list", 0400, .seq_ops = &ctx_default_rq_list_seq_ops},
 	{"read_rq_list", 0400, .seq_ops = &ctx_read_rq_list_seq_ops},
@@ -878,13 +895,17 @@ void blk_mq_debugfs_register_hctx(struct request_queue *q,
 				  struct blk_mq_hw_ctx *hctx)
 {
 	struct blk_mq_ctx *ctx;
+	struct blk_mq_tag_set *set = q->tag_set;
 	char name[20];
 	int i;
 
 	snprintf(name, sizeof(name), "hctx%u", hctx->queue_num);
 	hctx->debugfs_dir = debugfs_create_dir(name, q->debugfs_dir);
 
-	debugfs_create_files(hctx->debugfs_dir, hctx, blk_mq_debugfs_hctx_attrs);
+	if (blk_mq_is_sbitmap_shared(set))
+		debugfs_create_files(hctx->debugfs_dir, hctx, blk_mq_debugfs_hctx_shared_sbitmap_attrs);
+	else
+		debugfs_create_files(hctx->debugfs_dir, hctx, blk_mq_debugfs_hctx_attrs);
 
 	hctx_for_each_ctx(hctx, ctx, i)
 		blk_mq_debugfs_register_ctx(hctx, ctx);
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 92843e3e1a2a..7db16e49f6f6 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -60,9 +60,11 @@ void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
  * For shared tag users, we track the number of currently active users
  * and attempt to provide a fair share of the tag depth for each of them.
  */
-static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx,
+static inline bool hctx_may_queue(struct blk_mq_alloc_data *data,
 				  struct sbitmap_queue *bt)
 {
+	struct blk_mq_hw_ctx *hctx = data->hctx;
+	struct request_queue *q = data->q;
 	unsigned int depth, users;
 
 	if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED))
@@ -84,15 +86,15 @@ static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx,
 	 * Allow at least some tags
 	 */
 	depth = max((bt->sb.depth + users - 1) / users, 4U);
-	return atomic_read(&hctx->nr_active) < depth;
+	return __blk_mq_active_requests(hctx, q) < depth;
 }
 
 static int __blk_mq_get_tag(struct blk_mq_alloc_data *data,
 			    struct sbitmap_queue *bt)
 {
 	if (!(data->flags & BLK_MQ_REQ_INTERNAL) &&
-	    !hctx_may_queue(data->hctx, bt))
-		return BLK_MQ_NO_TAG;
+	    !hctx_may_queue(data, bt))
+		return -1;
 	if (data->shallow_depth)
 		return __sbitmap_queue_get_shallow(bt, data->shallow_depth);
 	else
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 77120dd4e4d5..0f7e062a1665 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -283,7 +283,7 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
 	} else {
 		if (data->hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED) {
 			rq_flags = RQF_MQ_INFLIGHT;
-			atomic_inc(&data->hctx->nr_active);
+			__blk_mq_inc_active_requests(data->hctx, data->q);
 		}
 		rq->tag = tag;
 		rq->internal_tag = BLK_MQ_NO_TAG;
@@ -527,7 +527,7 @@ void blk_mq_free_request(struct request *rq)
 
 	ctx->rq_completed[rq_is_sync(rq)]++;
 	if (rq->rq_flags & RQF_MQ_INFLIGHT)
-		atomic_dec(&hctx->nr_active);
+		__blk_mq_dec_active_requests(hctx, q);
 
 	if (unlikely(laptop_mode && !blk_rq_is_passthrough(rq)))
 		laptop_io_completion(q->backing_dev_info);
@@ -1073,7 +1073,7 @@ bool blk_mq_get_driver_tag(struct request *rq)
 	if (rq->tag >= 0) {
 		if (shared) {
 			rq->rq_flags |= RQF_MQ_INFLIGHT;
-			atomic_inc(&data.hctx->nr_active);
+			__blk_mq_inc_active_requests(rq->mq_hctx, rq->q);
 		}
 		data.hctx->tags->rqs[rq->tag] = rq;
 	}
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 1a283c707215..9c1e612c2298 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -202,6 +202,32 @@ static inline bool blk_mq_get_dispatch_budget(struct blk_mq_hw_ctx *hctx)
 	return true;
 }
 
+static inline void __blk_mq_inc_active_requests(struct blk_mq_hw_ctx *hctx,
+						struct request_queue *q)
+{
+	if (blk_mq_is_sbitmap_shared(q->tag_set))
+		atomic_inc(&q->nr_active_requests_shared_sbitmap);
+	else
+		atomic_inc(&hctx->nr_active);
+}
+
+static inline void __blk_mq_dec_active_requests(struct blk_mq_hw_ctx *hctx,
+						struct request_queue *q)
+{
+	if (blk_mq_is_sbitmap_shared(q->tag_set))
+		atomic_dec(&q->nr_active_requests_shared_sbitmap);
+	else
+		atomic_dec(&hctx->nr_active);
+}
+
+static inline int __blk_mq_active_requests(struct blk_mq_hw_ctx *hctx,
+					   struct request_queue *q)
+{
+	if (blk_mq_is_sbitmap_shared(q->tag_set))
+		return atomic_read(&q->nr_active_requests_shared_sbitmap);
+	return atomic_read(&hctx->nr_active);
+}
+
 static inline void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx,
 					   struct request *rq)
 {
@@ -210,7 +236,7 @@ static inline void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx,
 
 	if (rq->rq_flags & RQF_MQ_INFLIGHT) {
 		rq->rq_flags &= ~RQF_MQ_INFLIGHT;
-		atomic_dec(&hctx->nr_active);
+		__blk_mq_dec_active_requests(hctx, rq->q);
 	}
 }
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 8fd900998b4e..c536278bec9e 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -488,6 +488,8 @@ struct request_queue {
 	struct timer_list	timeout;
 	struct work_struct	timeout_work;
 
+	atomic_t		nr_active_requests_shared_sbitmap;
+
 	struct list_head	icq_list;
 #ifdef CONFIG_BLK_CGROUP
 	DECLARE_BITMAP		(blkcg_pols, BLKCG_MAX_POLS);
-- 
2.26.2


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [PATCH RFC v7 06/12] blk-mq: Record active_queues_shared_sbitmap per tag_set for when using shared sbitmap
  2020-06-10 17:29 [PATCH RFC v7 00/12] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs John Garry
                   ` (4 preceding siblings ...)
  2020-06-10 17:29 ` [PATCH RFC v7 05/12] blk-mq: Record nr_active_requests per queue for when using shared sbitmap John Garry
@ 2020-06-10 17:29 ` John Garry
  2020-06-11 13:16   ` Hannes Reinecke
  2020-06-10 17:29 ` [PATCH RFC v7 07/12] blk-mq: Add support in hctx_tags_bitmap_show() for a " John Garry
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 123+ messages in thread
From: John Garry @ 2020-06-10 17:29 UTC (permalink / raw)
  To: axboe, jejb, martin.petersen, don.brace, kashyap.desai,
	sumit.saxena, ming.lei, bvanassche, hare, hch,
	shivasharan.srikanteshwara
  Cc: linux-block, linux-scsi, esc.storagedev, chenxiang66,
	megaraidlinux.pdl, John Garry

For when using a shared sbitmap, no longer should the number of active
request queues per hctx be relied on for when judging how to share the tag
bitmap.

Instead maintain the number of active request queues per tag_set, and make
the judgment based on that.

And since the blk_mq_tags.active_queues is no longer maintained, do not
show it in debugfs.

Originally-from: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: John Garry <john.garry@huawei.com>
---
 block/blk-mq-debugfs.c | 25 ++++++++++++++++++++--
 block/blk-mq-tag.c     | 47 ++++++++++++++++++++++++++++++++----------
 block/blk-mq.c         |  2 ++
 include/linux/blk-mq.h |  1 +
 include/linux/blkdev.h |  1 +
 5 files changed, 63 insertions(+), 13 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 0fa3af41ab65..05b4be0c03d9 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -458,17 +458,37 @@ static void blk_mq_debugfs_tags_show(struct seq_file *m,
 	}
 }
 
+static void blk_mq_debugfs_tags_shared_sbitmap_show(struct seq_file *m,
+				     struct blk_mq_tags *tags)
+{
+	seq_printf(m, "nr_tags=%u\n", tags->nr_tags);
+	seq_printf(m, "nr_reserved_tags=%u\n", tags->nr_reserved_tags);
+
+	seq_puts(m, "\nbitmap_tags:\n");
+	sbitmap_queue_show(tags->bitmap_tags, m);
+
+	if (tags->nr_reserved_tags) {
+		seq_puts(m, "\nbreserved_tags:\n");
+		sbitmap_queue_show(tags->breserved_tags, m);
+	}
+}
+
 static int hctx_tags_show(void *data, struct seq_file *m)
 {
 	struct blk_mq_hw_ctx *hctx = data;
 	struct request_queue *q = hctx->queue;
+	struct blk_mq_tag_set *set = q->tag_set;
 	int res;
 
 	res = mutex_lock_interruptible(&q->sysfs_lock);
 	if (res)
 		goto out;
-	if (hctx->tags)
-		blk_mq_debugfs_tags_show(m, hctx->tags);
+	if (hctx->tags) {
+		if (blk_mq_is_sbitmap_shared(set))
+			blk_mq_debugfs_tags_shared_sbitmap_show(m, hctx->tags);
+		else
+			blk_mq_debugfs_tags_show(m, hctx->tags);
+	}
 	mutex_unlock(&q->sysfs_lock);
 
 out:
@@ -802,6 +822,7 @@ static const struct blk_mq_debugfs_attr blk_mq_debugfs_hctx_shared_sbitmap_attrs
 	{"dispatch", 0400, .seq_ops = &hctx_dispatch_seq_ops},
 	{"busy", 0400, hctx_busy_show},
 	{"ctx_map", 0400, hctx_ctx_map_show},
+	{"tags", 0400, hctx_tags_show},
 	{"sched_tags", 0400, hctx_sched_tags_show},
 	{"sched_tags_bitmap", 0400, hctx_sched_tags_bitmap_show},
 	{"io_poll", 0600, hctx_io_poll_show, hctx_io_poll_write},
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 7db16e49f6f6..6ca06b1c3a99 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -23,9 +23,19 @@
  */
 bool __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx)
 {
-	if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state) &&
-	    !test_and_set_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
-		atomic_inc(&hctx->tags->active_queues);
+	struct request_queue *q = hctx->queue;
+	struct blk_mq_tag_set *set = q->tag_set;
+
+	if (blk_mq_is_sbitmap_shared(set)){
+		if (!test_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags) &&
+		    !test_and_set_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags))
+			atomic_inc(&set->active_queues_shared_sbitmap);
+
+	} else {
+		if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state) &&
+		    !test_and_set_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
+			atomic_inc(&hctx->tags->active_queues);
+	}
 
 	return true;
 }
@@ -47,11 +57,19 @@ void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags, bool include_reserve)
 void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
 {
 	struct blk_mq_tags *tags = hctx->tags;
-
-	if (!test_and_clear_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
-		return;
-
-	atomic_dec(&tags->active_queues);
+	struct request_queue *q = hctx->queue;
+	struct blk_mq_tag_set *set = q->tag_set;
+
+	if (blk_mq_is_sbitmap_shared(q->tag_set)){
+		if (!test_and_clear_bit(QUEUE_FLAG_HCTX_ACTIVE,
+					&q->queue_flags))
+			return;
+		atomic_dec(&set->active_queues_shared_sbitmap);
+	} else {
+		if (!test_and_clear_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
+			return;
+		atomic_dec(&tags->active_queues);
+	}
 
 	blk_mq_tag_wakeup_all(tags, false);
 }
@@ -65,12 +83,11 @@ static inline bool hctx_may_queue(struct blk_mq_alloc_data *data,
 {
 	struct blk_mq_hw_ctx *hctx = data->hctx;
 	struct request_queue *q = data->q;
+	struct blk_mq_tag_set *set = q->tag_set;
 	unsigned int depth, users;
 
 	if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED))
 		return true;
-	if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
-		return true;
 
 	/*
 	 * Don't try dividing an ant
@@ -78,7 +95,15 @@ static inline bool hctx_may_queue(struct blk_mq_alloc_data *data,
 	if (bt->sb.depth == 1)
 		return true;
 
-	users = atomic_read(&hctx->tags->active_queues);
+	if (blk_mq_is_sbitmap_shared(q->tag_set)) {
+		if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &q->queue_flags))
+			return true;
+		users = atomic_read(&set->active_queues_shared_sbitmap);
+	} else {
+		if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
+			return true;
+		users = atomic_read(&hctx->tags->active_queues);
+	}
 	if (!users)
 		return true;
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 0f7e062a1665..f73a2f9c58bd 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3350,6 +3350,8 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
 		goto out_free_mq_map;
 
 	if (blk_mq_is_sbitmap_shared(set)) {
+		atomic_set(&set->active_queues_shared_sbitmap, 0);
+
 		if (!blk_mq_init_shared_sbitmap(set)) {
 			ret = -ENOMEM;
 			goto out_free_mq_rq_maps;
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 7b31cdb92a71..66711c7234db 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -252,6 +252,7 @@ struct blk_mq_tag_set {
 	unsigned int		timeout;
 	unsigned int		flags;
 	void			*driver_data;
+	atomic_t		active_queues_shared_sbitmap;
 
 	struct sbitmap_queue	__bitmap_tags;
 	struct sbitmap_queue	__breserved_tags;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index c536278bec9e..1b0087e8d01a 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -619,6 +619,7 @@ struct request_queue {
 #define QUEUE_FLAG_PCI_P2PDMA	25	/* device supports PCI p2p requests */
 #define QUEUE_FLAG_ZONE_RESETALL 26	/* supports Zone Reset All */
 #define QUEUE_FLAG_RQ_ALLOC_TIME 27	/* record rq->alloc_time_ns */
+#define QUEUE_FLAG_HCTX_ACTIVE 28	/* at least one blk-mq hctx is active */
 
 #define QUEUE_FLAG_MQ_DEFAULT	((1 << QUEUE_FLAG_IO_STAT) |		\
 				 (1 << QUEUE_FLAG_SAME_COMP))
-- 
2.26.2


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [PATCH RFC v7 07/12] blk-mq: Add support in hctx_tags_bitmap_show() for a shared sbitmap
  2020-06-10 17:29 [PATCH RFC v7 00/12] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs John Garry
                   ` (5 preceding siblings ...)
  2020-06-10 17:29 ` [PATCH RFC v7 06/12] blk-mq: Record active_queues_shared_sbitmap per tag_set " John Garry
@ 2020-06-10 17:29 ` John Garry
  2020-06-11 13:19   ` Hannes Reinecke
  2020-06-10 17:29 ` [PATCH RFC v7 08/12] scsi: Add template flag 'host_tagset' John Garry
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 123+ messages in thread
From: John Garry @ 2020-06-10 17:29 UTC (permalink / raw)
  To: axboe, jejb, martin.petersen, don.brace, kashyap.desai,
	sumit.saxena, ming.lei, bvanassche, hare, hch,
	shivasharan.srikanteshwara
  Cc: linux-block, linux-scsi, esc.storagedev, chenxiang66,
	megaraidlinux.pdl, John Garry

Since a set-wide shared tag sbitmap may be used, it is no longer valid to
examine the per-hctx tagset for getting the active requests for a hctx
(when a shared sbitmap is used).

As such, add support for the shared sbitmap by using an intermediate
sbitmap per hctx, iterating all active tags for the specific hctx in the
shared sbitmap.

Originally-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Hannes Reinecke <hare@suse.de> #earlier version
Signed-off-by: John Garry <john.garry@huawei.com>
---
 block/blk-mq-debugfs.c | 62 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 62 insertions(+)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 05b4be0c03d9..4da7e54adf3b 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -495,6 +495,67 @@ static int hctx_tags_show(void *data, struct seq_file *m)
 	return res;
 }
 
+struct hctx_sb_data {
+	struct sbitmap		*sb;	/* output bitmap */
+	struct blk_mq_hw_ctx	*hctx;	/* input hctx */
+};
+
+static bool hctx_filter_fn(struct blk_mq_hw_ctx *hctx, struct request *req,
+			   void *priv, bool reserved)
+{
+	struct hctx_sb_data *hctx_sb_data = priv;
+
+	if (hctx == hctx_sb_data->hctx)
+		sbitmap_set_bit(hctx_sb_data->sb, req->tag);
+
+	return true;
+}
+
+static void hctx_filter_sb(struct sbitmap *sb, struct blk_mq_hw_ctx *hctx)
+{
+	struct hctx_sb_data hctx_sb_data = { .sb = sb, .hctx = hctx };
+
+	blk_mq_queue_tag_busy_iter(hctx->queue, hctx_filter_fn, &hctx_sb_data);
+}
+
+static int hctx_tags_shared_sbitmap_bitmap_show(void *data, struct seq_file *m)
+{
+	struct blk_mq_hw_ctx *hctx = data;
+	struct request_queue *q = hctx->queue;
+	struct blk_mq_tag_set *set = q->tag_set;
+	struct sbitmap shared_sb, *sb;
+	int res;
+
+	if (!set)
+		return 0;
+
+	/*
+	 * We could use the allocated sbitmap for that hctx here, but
+	 * that would mean that we would need to clean it prior to use.
+	 */
+	res = sbitmap_init_node(&shared_sb,
+				set->__bitmap_tags.sb.depth,
+				set->__bitmap_tags.sb.shift,
+				GFP_KERNEL, NUMA_NO_NODE);
+	if (res)
+		return res;
+	sb = &shared_sb;
+
+	res = mutex_lock_interruptible(&q->sysfs_lock);
+	if (res)
+		goto out;
+	if (hctx->tags) {
+		hctx_filter_sb(sb, hctx);
+		sbitmap_bitmap_show(sb, m);
+	}
+
+	mutex_unlock(&q->sysfs_lock);
+
+out:
+	sbitmap_free(&shared_sb);
+	return res;
+}
+
 static int hctx_tags_bitmap_show(void *data, struct seq_file *m)
 {
 	struct blk_mq_hw_ctx *hctx = data;
@@ -823,6 +884,7 @@ static const struct blk_mq_debugfs_attr blk_mq_debugfs_hctx_shared_sbitmap_attrs
 	{"busy", 0400, hctx_busy_show},
 	{"ctx_map", 0400, hctx_ctx_map_show},
 	{"tags", 0400, hctx_tags_show},
+	{"tags_bitmap", 0400, hctx_tags_shared_sbitmap_bitmap_show},
 	{"sched_tags", 0400, hctx_sched_tags_show},
 	{"sched_tags_bitmap", 0400, hctx_sched_tags_bitmap_show},
 	{"io_poll", 0600, hctx_io_poll_show, hctx_io_poll_write},
-- 
2.26.2


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [PATCH RFC v7 08/12] scsi: Add template flag 'host_tagset'
  2020-06-10 17:29 [PATCH RFC v7 00/12] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs John Garry
                   ` (6 preceding siblings ...)
  2020-06-10 17:29 ` [PATCH RFC v7 07/12] blk-mq: Add support in hctx_tags_bitmap_show() for a " John Garry
@ 2020-06-10 17:29 ` John Garry
  2020-06-10 17:29 ` [PATCH RFC v7 09/12] scsi: hisi_sas: Switch v3 hw to MQ John Garry
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 123+ messages in thread
From: John Garry @ 2020-06-10 17:29 UTC (permalink / raw)
  To: axboe, jejb, martin.petersen, don.brace, kashyap.desai,
	sumit.saxena, ming.lei, bvanassche, hare, hch,
	shivasharan.srikanteshwara
  Cc: linux-block, linux-scsi, esc.storagedev, chenxiang66,
	megaraidlinux.pdl, John Garry

From: Hannes Reinecke <hare@suse.com>

Add a host template flag 'host_tagset' so hostwide tagset can be
shared on multiple reply queues after the SCSI device's reply queue
is converted to blk-mq hw queue.

Signed-off-by: Hannes Reinecke <hare@suse.com>
jpg: Update comment on can_queue
Signed-off-by: John Garry <john.garry@huawei.com>
---
 drivers/scsi/scsi_lib.c  | 2 ++
 include/scsi/scsi_host.h | 6 +++++-
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 0ba7a65e7c8d..0652acdcec22 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1894,6 +1894,8 @@ int scsi_mq_setup_tags(struct Scsi_Host *shost)
 	tag_set->flags |=
 		BLK_ALLOC_POLICY_TO_MQ_FLAG(shost->hostt->tag_alloc_policy);
 	tag_set->driver_data = shost;
+	if (shost->hostt->host_tagset)
+		tag_set->flags |= BLK_MQ_F_TAG_HCTX_SHARED;
 
 	return blk_mq_alloc_tag_set(tag_set);
 }
diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h
index 46ef8cccc982..9b7e333a681d 100644
--- a/include/scsi/scsi_host.h
+++ b/include/scsi/scsi_host.h
@@ -436,6 +436,9 @@ struct scsi_host_template {
 	/* True if the controller does not support WRITE SAME */
 	unsigned no_write_same:1;
 
+	/* True if the host uses host-wide tagspace */
+	unsigned host_tagset:1;
+
 	/*
 	 * Countdown for host blocking with no commands outstanding.
 	 */
@@ -603,7 +606,8 @@ struct Scsi_Host {
 	 *
 	 * Note: it is assumed that each hardware queue has a queue depth of
 	 * can_queue. In other words, the total queue depth per host
-	 * is nr_hw_queues * can_queue.
+	 * is nr_hw_queues * can_queue. However, for when host_tagset is set,
+	 * the total queue depth is can_queue.
 	 */
 	unsigned nr_hw_queues;
 	unsigned active_mode:2;
-- 
2.26.2


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [PATCH RFC v7 09/12] scsi: hisi_sas: Switch v3 hw to MQ
  2020-06-10 17:29 [PATCH RFC v7 00/12] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs John Garry
                   ` (7 preceding siblings ...)
  2020-06-10 17:29 ` [PATCH RFC v7 08/12] scsi: Add template flag 'host_tagset' John Garry
@ 2020-06-10 17:29 ` John Garry
  2020-06-10 17:29 ` [PATCH RFC v7 10/12] megaraid_sas: switch fusion adapters " John Garry
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 123+ messages in thread
From: John Garry @ 2020-06-10 17:29 UTC (permalink / raw)
  To: axboe, jejb, martin.petersen, don.brace, kashyap.desai,
	sumit.saxena, ming.lei, bvanassche, hare, hch,
	shivasharan.srikanteshwara
  Cc: linux-block, linux-scsi, esc.storagedev, chenxiang66,
	megaraidlinux.pdl, John Garry

Now that the block layer provides a shared tag, we can switch the driver
to expose all HW queues.

Signed-off-by: John Garry <john.garry@huawei.com>
---
 drivers/scsi/hisi_sas/hisi_sas.h       |  3 +-
 drivers/scsi/hisi_sas/hisi_sas_main.c  | 36 ++++++-----
 drivers/scsi/hisi_sas/hisi_sas_v3_hw.c | 87 +++++++++++---------------
 3 files changed, 56 insertions(+), 70 deletions(-)

diff --git a/drivers/scsi/hisi_sas/hisi_sas.h b/drivers/scsi/hisi_sas/hisi_sas.h
index 2bdd64648ef0..e6acbf940712 100644
--- a/drivers/scsi/hisi_sas/hisi_sas.h
+++ b/drivers/scsi/hisi_sas/hisi_sas.h
@@ -8,6 +8,8 @@
 #define _HISI_SAS_H_
 
 #include <linux/acpi.h>
+#include <linux/blk-mq.h>
+#include <linux/blk-mq-pci.h>
 #include <linux/clk.h>
 #include <linux/debugfs.h>
 #include <linux/dmapool.h>
@@ -431,7 +433,6 @@ struct hisi_hba {
 	u32 intr_coal_count;	/* Interrupt count to coalesce */
 
 	int cq_nvecs;
-	unsigned int *reply_map;
 
 	/* bist */
 	enum sas_linkrate debugfs_bist_linkrate;
diff --git a/drivers/scsi/hisi_sas/hisi_sas_main.c b/drivers/scsi/hisi_sas/hisi_sas_main.c
index 11caa4b0d797..7ed4eaedb7ca 100644
--- a/drivers/scsi/hisi_sas/hisi_sas_main.c
+++ b/drivers/scsi/hisi_sas/hisi_sas_main.c
@@ -417,6 +417,7 @@ static int hisi_sas_task_prep(struct sas_task *task,
 	struct device *dev = hisi_hba->dev;
 	int dlvry_queue_slot, dlvry_queue, rc, slot_idx;
 	int n_elem = 0, n_elem_dif = 0, n_elem_req = 0;
+	struct scsi_cmnd *scmd = NULL;
 	struct hisi_sas_dq *dq;
 	unsigned long flags;
 	int wr_q_index;
@@ -432,10 +433,23 @@ static int hisi_sas_task_prep(struct sas_task *task,
 		return -ECOMM;
 	}
 
-	if (hisi_hba->reply_map) {
-		int cpu = raw_smp_processor_id();
-		unsigned int dq_index = hisi_hba->reply_map[cpu];
+	if (task->uldd_task) {
+		struct ata_queued_cmd *qc;
 
+		if (dev_is_sata(device)) {
+			qc = task->uldd_task;
+			scmd = qc->scsicmd;
+		} else {
+			scmd = task->uldd_task;
+		}
+	}
+
+	if (scmd) {
+		unsigned int dq_index;
+		u32 blk_tag;
+
+		blk_tag = blk_mq_unique_tag(scmd->request);
+		dq_index = blk_mq_unique_tag_to_hwq(blk_tag);
 		*dq_pointer = dq = &hisi_hba->dq[dq_index];
 	} else {
 		*dq_pointer = dq = sas_dev->dq;
@@ -464,21 +478,9 @@ static int hisi_sas_task_prep(struct sas_task *task,
 
 	if (hisi_hba->hw->slot_index_alloc)
 		rc = hisi_hba->hw->slot_index_alloc(hisi_hba, device);
-	else {
-		struct scsi_cmnd *scsi_cmnd = NULL;
-
-		if (task->uldd_task) {
-			struct ata_queued_cmd *qc;
+	else
+		rc = hisi_sas_slot_index_alloc(hisi_hba, scmd);
 
-			if (dev_is_sata(device)) {
-				qc = task->uldd_task;
-				scsi_cmnd = qc->scsicmd;
-			} else {
-				scsi_cmnd = task->uldd_task;
-			}
-		}
-		rc  = hisi_sas_slot_index_alloc(hisi_hba, scsi_cmnd);
-	}
 	if (rc < 0)
 		goto err_out_dif_dma_unmap;
 
diff --git a/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c b/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
index 3e6b78a1f993..e22231403bbb 100644
--- a/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
+++ b/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
@@ -2360,68 +2360,36 @@ static irqreturn_t cq_interrupt_v3_hw(int irq_no, void *p)
 	return IRQ_WAKE_THREAD;
 }
 
-static void setup_reply_map_v3_hw(struct hisi_hba *hisi_hba, int nvecs)
+static int interrupt_preinit_v3_hw(struct hisi_hba *hisi_hba)
 {
-	const struct cpumask *mask;
-	int queue, cpu;
+	int vectors;
+	int max_msi = HISI_SAS_MSI_COUNT_V3_HW, min_msi;
+	struct Scsi_Host *shost = hisi_hba->shost;
+	struct irq_affinity desc = {
+		.pre_vectors = BASE_VECTORS_V3_HW,
+	};
 
-	for (queue = 0; queue < nvecs; queue++) {
-		struct hisi_sas_cq *cq = &hisi_hba->cq[queue];
+	min_msi = MIN_AFFINE_VECTORS_V3_HW;
+	vectors = pci_alloc_irq_vectors_affinity(hisi_hba->pci_dev,
+						 min_msi, max_msi,
+						 PCI_IRQ_MSI |
+						 PCI_IRQ_AFFINITY,
+						 &desc);
+	if (vectors < 0)
+		return -ENOENT;
 
-		mask = pci_irq_get_affinity(hisi_hba->pci_dev, queue +
-					    BASE_VECTORS_V3_HW);
-		if (!mask)
-			goto fallback;
-		cq->irq_mask = mask;
-		for_each_cpu(cpu, mask)
-			hisi_hba->reply_map[cpu] = queue;
-	}
-	return;
 
-fallback:
-	for_each_possible_cpu(cpu)
-		hisi_hba->reply_map[cpu] = cpu % hisi_hba->queue_count;
-	/* Don't clean all CQ masks */
+	hisi_hba->cq_nvecs = vectors - BASE_VECTORS_V3_HW;
+	shost->nr_hw_queues = hisi_hba->cq_nvecs;
+
+	return 0;
 }
 
 static int interrupt_init_v3_hw(struct hisi_hba *hisi_hba)
 {
 	struct device *dev = hisi_hba->dev;
 	struct pci_dev *pdev = hisi_hba->pci_dev;
-	int vectors, rc, i;
-	int max_msi = HISI_SAS_MSI_COUNT_V3_HW, min_msi;
-
-	if (auto_affine_msi_experimental) {
-		struct irq_affinity desc = {
-			.pre_vectors = BASE_VECTORS_V3_HW,
-		};
-
-		dev_info(dev, "Enable MSI auto-affinity\n");
-
-		min_msi = MIN_AFFINE_VECTORS_V3_HW;
-
-		hisi_hba->reply_map = devm_kcalloc(dev, nr_cpu_ids,
-						   sizeof(unsigned int),
-						   GFP_KERNEL);
-		if (!hisi_hba->reply_map)
-			return -ENOMEM;
-		vectors = pci_alloc_irq_vectors_affinity(hisi_hba->pci_dev,
-							 min_msi, max_msi,
-							 PCI_IRQ_MSI |
-							 PCI_IRQ_AFFINITY,
-							 &desc);
-		if (vectors < 0)
-			return -ENOENT;
-		setup_reply_map_v3_hw(hisi_hba, vectors - BASE_VECTORS_V3_HW);
-	} else {
-		min_msi = max_msi;
-		vectors = pci_alloc_irq_vectors(hisi_hba->pci_dev, min_msi,
-						max_msi, PCI_IRQ_MSI);
-		if (vectors < 0)
-			return vectors;
-	}
-
-	hisi_hba->cq_nvecs = vectors - BASE_VECTORS_V3_HW;
+	int rc, i;
 
 	rc = devm_request_irq(dev, pci_irq_vector(pdev, 1),
 			      int_phy_up_down_bcast_v3_hw, 0,
@@ -3070,6 +3038,15 @@ static int debugfs_set_bist_v3_hw(struct hisi_hba *hisi_hba, bool enable)
 	return 0;
 }
 
+static int hisi_sas_map_queues(struct Scsi_Host *shost)
+{
+	struct hisi_hba *hisi_hba = shost_priv(shost);
+	struct blk_mq_queue_map *qmap = &shost->tag_set.map[HCTX_TYPE_DEFAULT];
+
+	return blk_mq_pci_map_queues(qmap, hisi_hba->pci_dev,
+				     BASE_VECTORS_V3_HW);
+}
+
 static struct scsi_host_template sht_v3_hw = {
 	.name			= DRV_NAME,
 	.proc_name		= DRV_NAME,
@@ -3079,6 +3056,7 @@ static struct scsi_host_template sht_v3_hw = {
 	.slave_configure	= hisi_sas_slave_configure,
 	.scan_finished		= hisi_sas_scan_finished,
 	.scan_start		= hisi_sas_scan_start,
+	.map_queues		= hisi_sas_map_queues,
 	.change_queue_depth	= sas_change_queue_depth,
 	.bios_param		= sas_bios_param,
 	.this_id		= -1,
@@ -3095,6 +3073,7 @@ static struct scsi_host_template sht_v3_hw = {
 	.shost_attrs		= host_attrs_v3_hw,
 	.tag_alloc_policy	= BLK_TAG_ALLOC_RR,
 	.host_reset             = hisi_sas_host_reset,
+	.host_tagset		= 1,
 };
 
 static const struct hisi_sas_hw hisi_sas_v3_hw = {
@@ -3266,6 +3245,10 @@ hisi_sas_v3_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	if (hisi_sas_debugfs_enable)
 		hisi_sas_debugfs_init(hisi_hba);
 
+	rc = interrupt_preinit_v3_hw(hisi_hba);
+	if (rc)
+		goto err_out_ha;
+	dev_err(dev, "%d hw qeues\n", shost->nr_hw_queues);
 	rc = scsi_add_host(shost, dev);
 	if (rc)
 		goto err_out_ha;
-- 
2.26.2


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [PATCH RFC v7 10/12] megaraid_sas: switch fusion adapters to MQ
  2020-06-10 17:29 [PATCH RFC v7 00/12] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs John Garry
                   ` (8 preceding siblings ...)
  2020-06-10 17:29 ` [PATCH RFC v7 09/12] scsi: hisi_sas: Switch v3 hw to MQ John Garry
@ 2020-06-10 17:29 ` John Garry
  2020-07-02 10:23   ` Kashyap Desai
  2020-06-10 17:29 ` [PATCH RFC v7 11/12] smartpqi: enable host tagset John Garry
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 123+ messages in thread
From: John Garry @ 2020-06-10 17:29 UTC (permalink / raw)
  To: axboe, jejb, martin.petersen, don.brace, kashyap.desai,
	sumit.saxena, ming.lei, bvanassche, hare, hch,
	shivasharan.srikanteshwara
  Cc: linux-block, linux-scsi, esc.storagedev, chenxiang66,
	megaraidlinux.pdl, John Garry

From: Hannes Reinecke <hare@suse.com>

Fusion adapters can steer completions to individual queues, and
we now have support for shared host-wide tags.
So we can enable multiqueue support for fusion adapters and
drop the hand-crafted interrupt affinity settings.

Signed-off-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: John Garry <john.garry@huawei.com>
---
 drivers/scsi/megaraid/megaraid_sas.h        |  1 -
 drivers/scsi/megaraid/megaraid_sas_base.c   | 59 +++++++--------------
 drivers/scsi/megaraid/megaraid_sas_fusion.c | 24 +++++----
 3 files changed, 32 insertions(+), 52 deletions(-)

diff --git a/drivers/scsi/megaraid/megaraid_sas.h b/drivers/scsi/megaraid/megaraid_sas.h
index af2c7a2a9565..b27a34a5f5de 100644
--- a/drivers/scsi/megaraid/megaraid_sas.h
+++ b/drivers/scsi/megaraid/megaraid_sas.h
@@ -2261,7 +2261,6 @@ enum MR_PERF_MODE {
 
 struct megasas_instance {
 
-	unsigned int *reply_map;
 	__le32 *producer;
 	dma_addr_t producer_h;
 	__le32 *consumer;
diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c b/drivers/scsi/megaraid/megaraid_sas_base.c
index 00668335c2af..e6bb2a64d51c 100644
--- a/drivers/scsi/megaraid/megaraid_sas_base.c
+++ b/drivers/scsi/megaraid/megaraid_sas_base.c
@@ -37,6 +37,7 @@
 #include <linux/poll.h>
 #include <linux/vmalloc.h>
 #include <linux/irq_poll.h>
+#include <linux/blk-mq-pci.h>
 
 #include <scsi/scsi.h>
 #include <scsi/scsi_cmnd.h>
@@ -3115,6 +3116,19 @@ megasas_bios_param(struct scsi_device *sdev, struct block_device *bdev,
 	return 0;
 }
 
+static int megasas_map_queues(struct Scsi_Host *shost)
+{
+	struct megasas_instance *instance;
+
+	instance = (struct megasas_instance *)shost->hostdata;
+
+	if (!instance->smp_affinity_enable)
+		return 0;
+
+	return blk_mq_pci_map_queues(&shost->tag_set.map[HCTX_TYPE_DEFAULT],
+			instance->pdev, instance->low_latency_index_start);
+}
+
 static void megasas_aen_polling(struct work_struct *work);
 
 /**
@@ -3423,8 +3437,10 @@ static struct scsi_host_template megasas_template = {
 	.eh_timed_out = megasas_reset_timer,
 	.shost_attrs = megaraid_host_attrs,
 	.bios_param = megasas_bios_param,
+	.map_queues = megasas_map_queues,
 	.change_queue_depth = scsi_change_queue_depth,
 	.max_segment_size = 0xffffffff,
+	.host_tagset = 1,
 };
 
 /**
@@ -5708,34 +5724,6 @@ megasas_setup_jbod_map(struct megasas_instance *instance)
 		instance->use_seqnum_jbod_fp = false;
 }
 
-static void megasas_setup_reply_map(struct megasas_instance *instance)
-{
-	const struct cpumask *mask;
-	unsigned int queue, cpu, low_latency_index_start;
-
-	low_latency_index_start = instance->low_latency_index_start;
-
-	for (queue = low_latency_index_start; queue < instance->msix_vectors; queue++) {
-		mask = pci_irq_get_affinity(instance->pdev, queue);
-		if (!mask)
-			goto fallback;
-
-		for_each_cpu(cpu, mask)
-			instance->reply_map[cpu] = queue;
-	}
-	return;
-
-fallback:
-	queue = low_latency_index_start;
-	for_each_possible_cpu(cpu) {
-		instance->reply_map[cpu] = queue;
-		if (queue == (instance->msix_vectors - 1))
-			queue = low_latency_index_start;
-		else
-			queue++;
-	}
-}
-
 /**
  * megasas_get_device_list -	Get the PD and LD device list from FW.
  * @instance:			Adapter soft state
@@ -6158,8 +6146,6 @@ static int megasas_init_fw(struct megasas_instance *instance)
 			goto fail_init_adapter;
 	}
 
-	megasas_setup_reply_map(instance);
-
 	dev_info(&instance->pdev->dev,
 		"current msix/online cpus\t: (%d/%d)\n",
 		instance->msix_vectors, (unsigned int)num_online_cpus());
@@ -6793,6 +6779,9 @@ static int megasas_io_attach(struct megasas_instance *instance)
 	host->max_id = MEGASAS_MAX_DEV_PER_CHANNEL;
 	host->max_lun = MEGASAS_MAX_LUN;
 	host->max_cmd_len = 16;
+	if (instance->adapter_type != MFI_SERIES && instance->msix_vectors > 0)
+		host->nr_hw_queues = instance->msix_vectors -
+			instance->low_latency_index_start;
 
 	/*
 	 * Notify the mid-layer about the new controller
@@ -6960,11 +6949,6 @@ static inline int megasas_alloc_mfi_ctrl_mem(struct megasas_instance *instance)
  */
 static int megasas_alloc_ctrl_mem(struct megasas_instance *instance)
 {
-	instance->reply_map = kcalloc(nr_cpu_ids, sizeof(unsigned int),
-				      GFP_KERNEL);
-	if (!instance->reply_map)
-		return -ENOMEM;
-
 	switch (instance->adapter_type) {
 	case MFI_SERIES:
 		if (megasas_alloc_mfi_ctrl_mem(instance))
@@ -6981,8 +6965,6 @@ static int megasas_alloc_ctrl_mem(struct megasas_instance *instance)
 
 	return 0;
  fail:
-	kfree(instance->reply_map);
-	instance->reply_map = NULL;
 	return -ENOMEM;
 }
 
@@ -6995,7 +6977,6 @@ static int megasas_alloc_ctrl_mem(struct megasas_instance *instance)
  */
 static inline void megasas_free_ctrl_mem(struct megasas_instance *instance)
 {
-	kfree(instance->reply_map);
 	if (instance->adapter_type == MFI_SERIES) {
 		if (instance->producer)
 			dma_free_coherent(&instance->pdev->dev, sizeof(u32),
@@ -7683,8 +7664,6 @@ megasas_resume(struct pci_dev *pdev)
 			goto fail_reenable_msix;
 	}
 
-	megasas_setup_reply_map(instance);
-
 	if (instance->adapter_type != MFI_SERIES) {
 		megasas_reset_reply_desc(instance);
 		if (megasas_ioc_init_fusion(instance)) {
diff --git a/drivers/scsi/megaraid/megaraid_sas_fusion.c b/drivers/scsi/megaraid/megaraid_sas_fusion.c
index 319f241da4b6..8e25b700988e 100644
--- a/drivers/scsi/megaraid/megaraid_sas_fusion.c
+++ b/drivers/scsi/megaraid/megaraid_sas_fusion.c
@@ -373,24 +373,24 @@ megasas_get_msix_index(struct megasas_instance *instance,
 {
 	int sdev_busy;
 
-	/* nr_hw_queue = 1 for MegaRAID */
-	struct blk_mq_hw_ctx *hctx =
-		scmd->device->request_queue->queue_hw_ctx[0];
+	struct blk_mq_hw_ctx *hctx = scmd->request->mq_hctx;
 
 	sdev_busy = atomic_read(&hctx->nr_active);
 
 	if (instance->perf_mode == MR_BALANCED_PERF_MODE &&
-	    sdev_busy > (data_arms * MR_DEVICE_HIGH_IOPS_DEPTH))
+	    sdev_busy > (data_arms * MR_DEVICE_HIGH_IOPS_DEPTH)) {
 		cmd->request_desc->SCSIIO.MSIxIndex =
 			mega_mod64((atomic64_add_return(1, &instance->high_iops_outstanding) /
 					MR_HIGH_IOPS_BATCH_COUNT), instance->low_latency_index_start);
-	else if (instance->msix_load_balance)
+	} else if (instance->msix_load_balance) {
 		cmd->request_desc->SCSIIO.MSIxIndex =
 			(mega_mod64(atomic64_add_return(1, &instance->total_io_count),
 				instance->msix_vectors));
-	else
-		cmd->request_desc->SCSIIO.MSIxIndex =
-			instance->reply_map[raw_smp_processor_id()];
+	} else {
+		u32 tag = blk_mq_unique_tag(scmd->request);
+
+		cmd->request_desc->SCSIIO.MSIxIndex = blk_mq_unique_tag_to_hwq(tag) + instance->low_latency_index_start;
+	}
 }
 
 /**
@@ -3326,7 +3326,7 @@ megasas_build_and_issue_cmd_fusion(struct megasas_instance *instance,
 {
 	struct megasas_cmd_fusion *cmd, *r1_cmd = NULL;
 	union MEGASAS_REQUEST_DESCRIPTOR_UNION *req_desc;
-	u32 index;
+	u32 index, blk_tag, unique_tag;
 
 	if ((megasas_cmd_type(scmd) == READ_WRITE_LDIO) &&
 		instance->ldio_threshold &&
@@ -3342,7 +3342,9 @@ megasas_build_and_issue_cmd_fusion(struct megasas_instance *instance,
 		return SCSI_MLQUEUE_HOST_BUSY;
 	}
 
-	cmd = megasas_get_cmd_fusion(instance, scmd->request->tag);
+	unique_tag = blk_mq_unique_tag(scmd->request);
+	blk_tag = blk_mq_unique_tag_to_tag(unique_tag);
+	cmd = megasas_get_cmd_fusion(instance, blk_tag);
 
 	if (!cmd) {
 		atomic_dec(&instance->fw_outstanding);
@@ -3383,7 +3385,7 @@ megasas_build_and_issue_cmd_fusion(struct megasas_instance *instance,
 	 */
 	if (cmd->r1_alt_dev_handle != MR_DEVHANDLE_INVALID) {
 		r1_cmd = megasas_get_cmd_fusion(instance,
-				(scmd->request->tag + instance->max_fw_cmds));
+				(blk_tag + instance->max_fw_cmds));
 		megasas_prepare_secondRaid1_IO(instance, cmd, r1_cmd);
 	}
 
-- 
2.26.2


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [PATCH RFC v7 11/12] smartpqi: enable host tagset
  2020-06-10 17:29 [PATCH RFC v7 00/12] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs John Garry
                   ` (9 preceding siblings ...)
  2020-06-10 17:29 ` [PATCH RFC v7 10/12] megaraid_sas: switch fusion adapters " John Garry
@ 2020-06-10 17:29 ` John Garry
  2020-07-14 13:16   ` John Garry
  2020-06-10 17:29 ` [PATCH RFC v7 12/12] hpsa: enable host_tagset and switch to MQ John Garry
  2020-06-11  3:07 ` [PATCH RFC v7 00/12] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs Ming Lei
  12 siblings, 1 reply; 123+ messages in thread
From: John Garry @ 2020-06-10 17:29 UTC (permalink / raw)
  To: axboe, jejb, martin.petersen, don.brace, kashyap.desai,
	sumit.saxena, ming.lei, bvanassche, hare, hch,
	shivasharan.srikanteshwara
  Cc: linux-block, linux-scsi, esc.storagedev, chenxiang66,
	megaraidlinux.pdl, Hannes Reinecke

From: Hannes Reinecke <hare@suse.de>

Enable host tagset for smartpqi; with this we can use the request
tag to look command from the pool avoiding the list iteration in
the hot path.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 drivers/scsi/smartpqi/smartpqi_init.c | 38 ++++++++++++++++++++-------
 1 file changed, 28 insertions(+), 10 deletions(-)

diff --git a/drivers/scsi/smartpqi/smartpqi_init.c b/drivers/scsi/smartpqi/smartpqi_init.c
index cd157f11eb22..1f4de4c2d876 100644
--- a/drivers/scsi/smartpqi/smartpqi_init.c
+++ b/drivers/scsi/smartpqi/smartpqi_init.c
@@ -575,17 +575,29 @@ static inline void pqi_reinit_io_request(struct pqi_io_request *io_request)
 }
 
 static struct pqi_io_request *pqi_alloc_io_request(
-	struct pqi_ctrl_info *ctrl_info)
+	struct pqi_ctrl_info *ctrl_info, struct scsi_cmnd *scmd)
 {
 	struct pqi_io_request *io_request;
+	unsigned int limit = PQI_RESERVED_IO_SLOTS;
 	u16 i = ctrl_info->next_io_request_slot;	/* benignly racy */
 
-	while (1) {
+	if (scmd) {
+		u32 blk_tag = blk_mq_unique_tag(scmd->request);
+
+		i = blk_mq_unique_tag_to_tag(blk_tag) + limit;
 		io_request = &ctrl_info->io_request_pool[i];
-		if (atomic_inc_return(&io_request->refcount) == 1)
-			break;
-		atomic_dec(&io_request->refcount);
-		i = (i + 1) % ctrl_info->max_io_slots;
+		if (WARN_ON(atomic_inc_return(&io_request->refcount) > 1)) {
+			atomic_dec(&io_request->refcount);
+			return NULL;
+		}
+	} else {
+		while (1) {
+			io_request = &ctrl_info->io_request_pool[i];
+			if (atomic_inc_return(&io_request->refcount) == 1)
+				break;
+			atomic_dec(&io_request->refcount);
+			i = (i + 1) % limit;
+		}
 	}
 
 	/* benignly racy */
@@ -4075,7 +4087,7 @@ static int pqi_submit_raid_request_synchronous(struct pqi_ctrl_info *ctrl_info,
 
 	atomic_inc(&ctrl_info->sync_cmds_outstanding);
 
-	io_request = pqi_alloc_io_request(ctrl_info);
+	io_request = pqi_alloc_io_request(ctrl_info, NULL);
 
 	put_unaligned_le16(io_request->index,
 		&(((struct pqi_raid_path_request *)request)->request_id));
@@ -5032,7 +5044,9 @@ static inline int pqi_raid_submit_scsi_cmd(struct pqi_ctrl_info *ctrl_info,
 {
 	struct pqi_io_request *io_request;
 
-	io_request = pqi_alloc_io_request(ctrl_info);
+	io_request = pqi_alloc_io_request(ctrl_info, scmd);
+	if (!io_request)
+		return SCSI_MLQUEUE_HOST_BUSY;
 
 	return pqi_raid_submit_scsi_cmd_with_io_request(ctrl_info, io_request,
 		device, scmd, queue_group);
@@ -5230,7 +5244,10 @@ static int pqi_aio_submit_io(struct pqi_ctrl_info *ctrl_info,
 	struct pqi_io_request *io_request;
 	struct pqi_aio_path_request *request;
 
-	io_request = pqi_alloc_io_request(ctrl_info);
+	io_request = pqi_alloc_io_request(ctrl_info, scmd);
+	if (!io_request)
+		return SCSI_MLQUEUE_HOST_BUSY;
+
 	io_request->io_complete_callback = pqi_aio_io_complete;
 	io_request->scmd = scmd;
 	io_request->raid_bypass = raid_bypass;
@@ -5657,7 +5674,7 @@ static int pqi_lun_reset(struct pqi_ctrl_info *ctrl_info,
 	DECLARE_COMPLETION_ONSTACK(wait);
 	struct pqi_task_management_request *request;
 
-	io_request = pqi_alloc_io_request(ctrl_info);
+	io_request = pqi_alloc_io_request(ctrl_info, NULL);
 	io_request->io_complete_callback = pqi_lun_reset_complete;
 	io_request->context = &wait;
 
@@ -6504,6 +6521,7 @@ static struct scsi_host_template pqi_driver_template = {
 	.map_queues = pqi_map_queues,
 	.sdev_attrs = pqi_sdev_attrs,
 	.shost_attrs = pqi_shost_attrs,
+	.host_tagset = 1,
 };
 
 static int pqi_register_scsi(struct pqi_ctrl_info *ctrl_info)
-- 
2.26.2


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [PATCH RFC v7 12/12] hpsa: enable host_tagset and switch to MQ
  2020-06-10 17:29 [PATCH RFC v7 00/12] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs John Garry
                   ` (10 preceding siblings ...)
  2020-06-10 17:29 ` [PATCH RFC v7 11/12] smartpqi: enable host tagset John Garry
@ 2020-06-10 17:29 ` John Garry
  2020-07-14  7:37   ` John Garry
  2020-06-11  3:07 ` [PATCH RFC v7 00/12] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs Ming Lei
  12 siblings, 1 reply; 123+ messages in thread
From: John Garry @ 2020-06-10 17:29 UTC (permalink / raw)
  To: axboe, jejb, martin.petersen, don.brace, kashyap.desai,
	sumit.saxena, ming.lei, bvanassche, hare, hch,
	shivasharan.srikanteshwara
  Cc: linux-block, linux-scsi, esc.storagedev, chenxiang66,
	megaraidlinux.pdl, Hannes Reinecke

From: Hannes Reinecke <hare@suse.de>

The smart array HBAs can steer interrupt completion, so this
patch switches the implementation to use multiqueue and enables
'host_tagset' as the HBA has a shared host-wide tagset.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 drivers/scsi/hpsa.c | 44 +++++++-------------------------------------
 drivers/scsi/hpsa.h |  1 -
 2 files changed, 7 insertions(+), 38 deletions(-)

diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c
index 1e9302e99d05..f807f9bdae85 100644
--- a/drivers/scsi/hpsa.c
+++ b/drivers/scsi/hpsa.c
@@ -980,6 +980,7 @@ static struct scsi_host_template hpsa_driver_template = {
 	.shost_attrs = hpsa_shost_attrs,
 	.max_sectors = 2048,
 	.no_write_same = 1,
+	.host_tagset = 1,
 };
 
 static inline u32 next_command(struct ctlr_info *h, u8 q)
@@ -1144,12 +1145,14 @@ static void dial_up_lockup_detection_on_fw_flash_complete(struct ctlr_info *h,
 static void __enqueue_cmd_and_start_io(struct ctlr_info *h,
 	struct CommandList *c, int reply_queue)
 {
+	u32 blk_tag = blk_mq_unique_tag(c->scsi_cmd->request);
+
 	dial_down_lockup_detection_during_fw_flash(h, c);
 	atomic_inc(&h->commands_outstanding);
 	if (c->device)
 		atomic_inc(&c->device->commands_outstanding);
 
-	reply_queue = h->reply_map[raw_smp_processor_id()];
+	reply_queue = blk_mq_unique_tag_to_hwq(blk_tag);
 	switch (c->cmd_type) {
 	case CMD_IOACCEL1:
 		set_ioaccel1_performant_mode(h, c, reply_queue);
@@ -5653,8 +5656,6 @@ static int hpsa_scsi_queue_command(struct Scsi_Host *sh, struct scsi_cmnd *cmd)
 	/* Get the ptr to our adapter structure out of cmd->host. */
 	h = sdev_to_hba(cmd->device);
 
-	BUG_ON(cmd->request->tag < 0);
-
 	dev = cmd->device->hostdata;
 	if (!dev) {
 		cmd->result = DID_NO_CONNECT << 16;
@@ -5830,7 +5831,7 @@ static int hpsa_scsi_host_alloc(struct ctlr_info *h)
 	sh->hostdata[0] = (unsigned long) h;
 	sh->irq = pci_irq_vector(h->pdev, 0);
 	sh->unique_id = sh->irq;
-
+	sh->nr_hw_queues = h->msix_vectors > 0 ? h->msix_vectors : 1;
 	h->scsi_host = sh;
 	return 0;
 }
@@ -5856,7 +5857,8 @@ static int hpsa_scsi_add_host(struct ctlr_info *h)
  */
 static int hpsa_get_cmd_index(struct scsi_cmnd *scmd)
 {
-	int idx = scmd->request->tag;
+	u32 blk_tag = blk_mq_unique_tag(scmd->request);
+	int idx = blk_mq_unique_tag_to_tag(blk_tag);
 
 	if (idx < 0)
 		return idx;
@@ -7456,26 +7458,6 @@ static void hpsa_disable_interrupt_mode(struct ctlr_info *h)
 	h->msix_vectors = 0;
 }
 
-static void hpsa_setup_reply_map(struct ctlr_info *h)
-{
-	const struct cpumask *mask;
-	unsigned int queue, cpu;
-
-	for (queue = 0; queue < h->msix_vectors; queue++) {
-		mask = pci_irq_get_affinity(h->pdev, queue);
-		if (!mask)
-			goto fallback;
-
-		for_each_cpu(cpu, mask)
-			h->reply_map[cpu] = queue;
-	}
-	return;
-
-fallback:
-	for_each_possible_cpu(cpu)
-		h->reply_map[cpu] = 0;
-}
-
 /* If MSI/MSI-X is supported by the kernel we will try to enable it on
  * controllers that are capable. If not, we use legacy INTx mode.
  */
@@ -7872,9 +7854,6 @@ static int hpsa_pci_init(struct ctlr_info *h)
 	if (err)
 		goto clean1;
 
-	/* setup mapping between CPU and reply queue */
-	hpsa_setup_reply_map(h);
-
 	err = hpsa_pci_find_memory_BAR(h->pdev, &h->paddr);
 	if (err)
 		goto clean2;	/* intmode+region, pci */
@@ -8613,7 +8592,6 @@ static struct workqueue_struct *hpsa_create_controller_wq(struct ctlr_info *h,
 
 static void hpda_free_ctlr_info(struct ctlr_info *h)
 {
-	kfree(h->reply_map);
 	kfree(h);
 }
 
@@ -8622,14 +8600,6 @@ static struct ctlr_info *hpda_alloc_ctlr_info(void)
 	struct ctlr_info *h;
 
 	h = kzalloc(sizeof(*h), GFP_KERNEL);
-	if (!h)
-		return NULL;
-
-	h->reply_map = kcalloc(nr_cpu_ids, sizeof(*h->reply_map), GFP_KERNEL);
-	if (!h->reply_map) {
-		kfree(h);
-		return NULL;
-	}
 	return h;
 }
 
diff --git a/drivers/scsi/hpsa.h b/drivers/scsi/hpsa.h
index f8c88fc7b80a..ea4a609e3eb7 100644
--- a/drivers/scsi/hpsa.h
+++ b/drivers/scsi/hpsa.h
@@ -161,7 +161,6 @@ struct bmic_controller_parameters {
 #pragma pack()
 
 struct ctlr_info {
-	unsigned int *reply_map;
 	int	ctlr;
 	char	devname[8];
 	char    *product_name;
-- 
2.26.2


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 02/12] blk-mq: rename blk_mq_update_tag_set_depth()
  2020-06-10 17:29 ` [PATCH RFC v7 02/12] blk-mq: rename blk_mq_update_tag_set_depth() John Garry
@ 2020-06-11  2:57   ` Ming Lei
  2020-06-11  8:26     ` John Garry
  0 siblings, 1 reply; 123+ messages in thread
From: Ming Lei @ 2020-06-11  2:57 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, jejb, martin.petersen, don.brace, kashyap.desai,
	sumit.saxena, bvanassche, hare, hch, shivasharan.srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	megaraidlinux.pdl, Hannes Reinecke

On Thu, Jun 11, 2020 at 01:29:09AM +0800, John Garry wrote:
> From: Hannes Reinecke <hare@suse.de>
> 
> The function does not set the depth, but rather transitions from
> shared to non-shared queues and vice versa.
> So rename it to blk_mq_update_tag_set_shared() to better reflect
> its purpose.

It is fine to rename it for me, however:

This patch claims to rename blk_mq_update_tag_set_shared(), but also
change blk_mq_init_bitmap_tags's signature.

So suggest to split this patch into two or add comment log on changing
blk_mq_init_bitmap_tags().


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 00/12] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs
  2020-06-10 17:29 [PATCH RFC v7 00/12] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs John Garry
                   ` (11 preceding siblings ...)
  2020-06-10 17:29 ` [PATCH RFC v7 12/12] hpsa: enable host_tagset and switch to MQ John Garry
@ 2020-06-11  3:07 ` Ming Lei
  2020-06-11  9:35   ` John Garry
  12 siblings, 1 reply; 123+ messages in thread
From: Ming Lei @ 2020-06-11  3:07 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, jejb, martin.petersen, don.brace, kashyap.desai,
	sumit.saxena, bvanassche, hare, hch, shivasharan.srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	megaraidlinux.pdl

On Thu, Jun 11, 2020 at 01:29:07AM +0800, John Garry wrote:
> Hi all,
> 
> Here is v7 of the patchset.
> 
> In this version of the series, we keep the shared sbitmap for driver tags,
> and introduce changes to fix up the tag budgeting across request queues (
> and associated debugfs changes).
> 
> Some performance figures:
> 
> Using 12x SAS SSDs on hisi_sas v3 hw. mq-deadline results are included,
> but it is not always an appropriate scheduler to use.
> 
> Tag depth 		4000 (default)			260**
> 
> Baseline:
> none sched:		2290K IOPS			894K
> mq-deadline sched:	2341K IOPS			2313K
> 
> Final, host_tagset=0 in LLDD*
> none sched:		2289K IOPS			703K
> mq-deadline sched:	2337K IOPS			2291K
> 
> Final:
> none sched:		2281K IOPS			1101K
> mq-deadline sched:	2322K IOPS			1278K
> 
> * this is relevant as this is the performance in supporting but not
>   enabling the feature
> ** depth=260 is relevant as some point where we are regularly waiting for
>    tags to be available. Figures were are a bit unstable here for testing.
> 
> A copy of the patches can be found here:
> https://github.com/hisilicon/kernel-dev/commits/private-topic-blk-mq-shared-tags-rfc-v7
> 
> And to progress this series, we the the following to go in first, when ready:
> https://lore.kernel.org/linux-scsi/20200430131904.5847-1-hare@suse.de/

I'd suggest to add options to enable shared tags for null_blk & scsi_debug in V8, so
that it is easier to verify the changes without real hardware.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 04/12] blk-mq: Facilitate a shared sbitmap per tagset
  2020-06-10 17:29 ` [PATCH RFC v7 04/12] blk-mq: Facilitate a shared sbitmap per tagset John Garry
@ 2020-06-11  3:37   ` Ming Lei
  2020-06-11 10:09     ` John Garry
  0 siblings, 1 reply; 123+ messages in thread
From: Ming Lei @ 2020-06-11  3:37 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, jejb, martin.petersen, don.brace, kashyap.desai,
	sumit.saxena, bvanassche, hare, hch, shivasharan.srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	megaraidlinux.pdl

On Thu, Jun 11, 2020 at 01:29:11AM +0800, John Garry wrote:
> Some SCSI HBAs (such as HPSA, megaraid, mpt3sas, hisi_sas_v3 ..) support
> multiple reply queues with single hostwide tags.
> 
> In addition, these drivers want to use interrupt assignment in
> pci_alloc_irq_vectors(PCI_IRQ_AFFINITY). However, as discussed in [0],
> CPU hotplug may cause in-flight IO completion to not be serviced when an
> interrupt is shutdown. That problem is solved in commit bf0beec0607d
> ("blk-mq: drain I/O when all CPUs in a hctx are offline").
> 
> However, to take advantage of that blk-mq feature, the HBA HW queuess are
> required to be mapped to that of the blk-mq hctx's; to do that, the HBA HW queues
> need to be exposed to the upper layer.
> 
> In making that transition, the per-SCSI command request tags are no
> longer unique per Scsi host - they are just unique per hctx. As such, the
> HBA LLDD would have to generate this tag internally, which has a certain
> performance overhead.
> 
> However another problem is that blk-mq assumes the host may accept
> (Scsi_host.can_queue * #hw queue) commands. In commit 6eb045e092ef ("scsi:
>  core: avoid host-wide host_busy counter for scsi_mq"), the Scsi host busy
> counter was removed, which would stop the LLDD being sent more than
> .can_queue commands; however, it should still be ensured that the block
> layer does not issue more than .can_queue commands to the Scsi host.
> 
> To solve this problem, introduce a shared sbitmap per blk_mq_tag_set,
> which may be requested at init time.
> 
> New flag BLK_MQ_F_TAG_HCTX_SHARED should be set when requesting the
> tagset to indicate whether the shared sbitmap should be used.
> 
> Even when BLK_MQ_F_TAG_HCTX_SHARED is set, a full set of tags and requests
> are still allocated per hctx; the reason for this is that if tags and
> requests were only allocated for a single hctx - like hctx0 - it may break
> block drivers which expect a request be associated with a specific hctx,
> i.e. not always hctx0. This will introduce extra memory usage.
> 
> This change is based on work originally from Ming Lei in [1] and from
> Bart's suggestion in [2].
> 
> [0] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/
> [1] https://lore.kernel.org/linux-block/20190531022801.10003-1-ming.lei@redhat.com/
> [2] https://lore.kernel.org/linux-block/ff77beff-5fd9-9f05-12b6-826922bace1f@huawei.com/T/#m3db0a602f095cbcbff27e9c884d6b4ae826144be
> 
> Signed-off-by: John Garry <john.garry@huawei.com>
> ---
>  block/blk-mq-tag.c     | 39 +++++++++++++++++++++++++++++++++++++--
>  block/blk-mq-tag.h     | 10 +++++++++-
>  block/blk-mq.c         | 24 +++++++++++++++++++++++-
>  block/blk-mq.h         |  5 +++++
>  include/linux/blk-mq.h |  6 ++++++
>  5 files changed, 80 insertions(+), 4 deletions(-)
> 
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index be39db3c88d7..92843e3e1a2a 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -228,7 +228,7 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
>  	 * We can hit rq == NULL here, because the tagging functions
>  	 * test and set the bit before assigning ->rqs[].
>  	 */
> -	if (rq && rq->q == hctx->queue)
> +	if (rq && rq->q == hctx->queue && rq->mq_hctx == hctx)
>  		return iter_data->fn(hctx, rq, iter_data->data, reserved);
>  	return true;
>  }
> @@ -466,6 +466,7 @@ static int blk_mq_init_bitmap_tags(struct blk_mq_tags *tags,
>  		     round_robin, node))
>  		goto free_bitmap_tags;
>  
> +	/* We later overwrite these in case of per-set shared sbitmap */
>  	tags->bitmap_tags = &tags->__bitmap_tags;
>  	tags->breserved_tags = &tags->__breserved_tags;

You may skip to allocate anything for blk_mq_is_sbitmap_shared(), and
similar change for blk_mq_free_tags().

>  
> @@ -475,7 +476,32 @@ static int blk_mq_init_bitmap_tags(struct blk_mq_tags *tags,
>  	return -ENOMEM;
>  }
>  
> -struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
> +bool blk_mq_init_shared_sbitmap(struct blk_mq_tag_set *tag_set)
> +{
> +	unsigned int depth = tag_set->queue_depth - tag_set->reserved_tags;
> +	int alloc_policy = BLK_MQ_FLAG_TO_ALLOC_POLICY(tag_set->flags);
> +	bool round_robin = alloc_policy == BLK_TAG_ALLOC_RR;
> +	int node = tag_set->numa_node;
> +
> +	if (bt_alloc(&tag_set->__bitmap_tags, depth, round_robin, node))
> +		return false;
> +	if (bt_alloc(&tag_set->__breserved_tags, tag_set->reserved_tags,
> +		     round_robin, node))
> +		goto free_bitmap_tags;
> +	return true;
> +free_bitmap_tags:
> +	sbitmap_queue_free(&tag_set->__bitmap_tags);
> +	return false;
> +}
> +
> +void blk_mq_exit_shared_sbitmap(struct blk_mq_tag_set *tag_set)
> +{
> +	sbitmap_queue_free(&tag_set->__bitmap_tags);
> +	sbitmap_queue_free(&tag_set->__breserved_tags);
> +}
> +
> +struct blk_mq_tags *blk_mq_init_tags(struct blk_mq_tag_set *set,
> +				     unsigned int total_tags,
>  				     unsigned int reserved_tags,
>  				     int node, int alloc_policy)
>  {
> @@ -502,6 +528,10 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
>  
>  void blk_mq_free_tags(struct blk_mq_tags *tags)
>  {
> +	/*
> +	 * Do not free tags->{bitmap, breserved}_tags, as this may point to
> +	 * shared sbitmap
> +	 */
>  	sbitmap_queue_free(&tags->__bitmap_tags);
>  	sbitmap_queue_free(&tags->__breserved_tags);
>  	kfree(tags);
> @@ -560,6 +590,11 @@ int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
>  	return 0;
>  }
>  
> +void blk_mq_tag_resize_shared_sbitmap(struct blk_mq_tag_set *set, unsigned int size)
> +{
> +	sbitmap_queue_resize(&set->__bitmap_tags, size - set->reserved_tags);
> +}
> +
>  /**
>   * blk_mq_unique_tag() - return a tag that is unique queue-wide
>   * @rq: request for which to compute a unique tag
> diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
> index cebf7a4b280a..cf39dd13a24d 100644
> --- a/block/blk-mq-tag.h
> +++ b/block/blk-mq-tag.h
> @@ -25,7 +25,12 @@ struct blk_mq_tags {
>  };
>  
>  
> -extern struct blk_mq_tags *blk_mq_init_tags(unsigned int nr_tags, unsigned int reserved_tags, int node, int alloc_policy);
> +extern bool blk_mq_init_shared_sbitmap(struct blk_mq_tag_set *tag_set);
> +extern void blk_mq_exit_shared_sbitmap(struct blk_mq_tag_set *tag_set);
> +extern struct blk_mq_tags *blk_mq_init_tags(struct blk_mq_tag_set *tag_set,
> +					    unsigned int nr_tags,
> +					    unsigned int reserved_tags,
> +					    int node, int alloc_policy);
>  extern void blk_mq_free_tags(struct blk_mq_tags *tags);
>  
>  extern unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data);
> @@ -34,6 +39,9 @@ extern void blk_mq_put_tag(struct blk_mq_tags *tags, struct blk_mq_ctx *ctx,
>  extern int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
>  					struct blk_mq_tags **tags,
>  					unsigned int depth, bool can_grow);
> +extern void blk_mq_tag_resize_shared_sbitmap(struct blk_mq_tag_set *set,
> +					     unsigned int size);
> +
>  extern void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags, bool);
>  void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
>  		void *priv);
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 90b645c3092c..77120dd4e4d5 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2229,7 +2229,7 @@ struct blk_mq_tags *blk_mq_alloc_rq_map(struct blk_mq_tag_set *set,
>  	if (node == NUMA_NO_NODE)
>  		node = set->numa_node;
>  
> -	tags = blk_mq_init_tags(nr_tags, reserved_tags, node,
> +	tags = blk_mq_init_tags(set, nr_tags, reserved_tags, node,
>  				BLK_MQ_FLAG_TO_ALLOC_POLICY(set->flags));
>  	if (!tags)
>  		return NULL;
> @@ -3349,11 +3349,28 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
>  	if (ret)
>  		goto out_free_mq_map;
>  
> +	if (blk_mq_is_sbitmap_shared(set)) {
> +		if (!blk_mq_init_shared_sbitmap(set)) {
> +			ret = -ENOMEM;
> +			goto out_free_mq_rq_maps;
> +		}
> +
> +		for (i = 0; i < set->nr_hw_queues; i++) {
> +			struct blk_mq_tags *tags = set->tags[i];
> +
> +			tags->bitmap_tags = &set->__bitmap_tags;
> +			tags->breserved_tags = &set->__breserved_tags;
> +		}

I am wondering why you don't put ->[bitmap|breserved]_tags initialization into
blk_mq_init_shared_sbitmap().


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 05/12] blk-mq: Record nr_active_requests per queue for when using shared sbitmap
  2020-06-10 17:29 ` [PATCH RFC v7 05/12] blk-mq: Record nr_active_requests per queue for when using shared sbitmap John Garry
@ 2020-06-11  4:04   ` Ming Lei
  2020-06-11 10:22     ` John Garry
  0 siblings, 1 reply; 123+ messages in thread
From: Ming Lei @ 2020-06-11  4:04 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, jejb, martin.petersen, don.brace, kashyap.desai,
	sumit.saxena, bvanassche, hare, hch, shivasharan.srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	megaraidlinux.pdl

On Thu, Jun 11, 2020 at 01:29:12AM +0800, John Garry wrote:
> The per-hctx nr_active value can no longer be used to fairly assign a share
> of tag depth per request queue for when using a shared sbitmap, as it does
> not consider that the tags are shared tags over all hctx's.
> 
> For this case, record the nr_active_requests per request_queue, and make
> the judgment based on that value.
> 
> Also introduce a debugfs version of per-hctx blk_mq_debugfs_attr, omitting
> hctx_active_show() (as blk_mq_hw_ctx.nr_active is no longer maintained for
> the case of shared sbitmap) and other entries which we can add which would
> be revised specifically for when using a shared sbitmap.
> 
> Co-developed-with: Kashyap Desai <kashyap.desai@broadcom.com>
> Signed-off-by: John Garry <john.garry@huawei.com>
> ---
>  block/blk-core.c       |  2 ++
>  block/blk-mq-debugfs.c | 23 ++++++++++++++++++++++-
>  block/blk-mq-tag.c     | 10 ++++++----
>  block/blk-mq.c         |  6 +++---
>  block/blk-mq.h         | 28 +++++++++++++++++++++++++++-
>  include/linux/blkdev.h |  2 ++
>  6 files changed, 62 insertions(+), 9 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 03252af8c82c..c622453c1363 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -529,6 +529,8 @@ struct request_queue *__blk_alloc_queue(int node_id)
>  	q->backing_dev_info->capabilities = BDI_CAP_CGROUP_WRITEBACK;
>  	q->node = node_id;
>  
> +	atomic_set(&q->nr_active_requests_shared_sbitmap, 0);
> +
>  	timer_setup(&q->backing_dev_info->laptop_mode_wb_timer,
>  		    laptop_mode_timer_fn, 0);
>  	timer_setup(&q->timeout, blk_rq_timed_out_timer, 0);
> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> index a400b6698dff..0fa3af41ab65 100644
> --- a/block/blk-mq-debugfs.c
> +++ b/block/blk-mq-debugfs.c
> @@ -796,6 +796,23 @@ static const struct blk_mq_debugfs_attr blk_mq_debugfs_hctx_attrs[] = {
>  	{},
>  };
>  
> +static const struct blk_mq_debugfs_attr blk_mq_debugfs_hctx_shared_sbitmap_attrs[] = {
> +	{"state", 0400, hctx_state_show},
> +	{"flags", 0400, hctx_flags_show},
> +	{"dispatch", 0400, .seq_ops = &hctx_dispatch_seq_ops},
> +	{"busy", 0400, hctx_busy_show},
> +	{"ctx_map", 0400, hctx_ctx_map_show},
> +	{"sched_tags", 0400, hctx_sched_tags_show},
> +	{"sched_tags_bitmap", 0400, hctx_sched_tags_bitmap_show},
> +	{"io_poll", 0600, hctx_io_poll_show, hctx_io_poll_write},
> +	{"dispatched", 0600, hctx_dispatched_show, hctx_dispatched_write},
> +	{"queued", 0600, hctx_queued_show, hctx_queued_write},
> +	{"run", 0600, hctx_run_show, hctx_run_write},
> +	{"active", 0400, hctx_active_show},
> +	{"dispatch_busy", 0400, hctx_dispatch_busy_show},
> +	{}
> +};

You may use macro or whatever to avoid so the duplication.

> +
>  static const struct blk_mq_debugfs_attr blk_mq_debugfs_ctx_attrs[] = {
>  	{"default_rq_list", 0400, .seq_ops = &ctx_default_rq_list_seq_ops},
>  	{"read_rq_list", 0400, .seq_ops = &ctx_read_rq_list_seq_ops},
> @@ -878,13 +895,17 @@ void blk_mq_debugfs_register_hctx(struct request_queue *q,
>  				  struct blk_mq_hw_ctx *hctx)
>  {
>  	struct blk_mq_ctx *ctx;
> +	struct blk_mq_tag_set *set = q->tag_set;
>  	char name[20];
>  	int i;
>  
>  	snprintf(name, sizeof(name), "hctx%u", hctx->queue_num);
>  	hctx->debugfs_dir = debugfs_create_dir(name, q->debugfs_dir);
>  
> -	debugfs_create_files(hctx->debugfs_dir, hctx, blk_mq_debugfs_hctx_attrs);
> +	if (blk_mq_is_sbitmap_shared(set))
> +		debugfs_create_files(hctx->debugfs_dir, hctx, blk_mq_debugfs_hctx_shared_sbitmap_attrs);
> +	else
> +		debugfs_create_files(hctx->debugfs_dir, hctx, blk_mq_debugfs_hctx_attrs);
>  
>  	hctx_for_each_ctx(hctx, ctx, i)
>  		blk_mq_debugfs_register_ctx(hctx, ctx);
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index 92843e3e1a2a..7db16e49f6f6 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -60,9 +60,11 @@ void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
>   * For shared tag users, we track the number of currently active users
>   * and attempt to provide a fair share of the tag depth for each of them.
>   */
> -static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx,
> +static inline bool hctx_may_queue(struct blk_mq_alloc_data *data,
>  				  struct sbitmap_queue *bt)
>  {
> +	struct blk_mq_hw_ctx *hctx = data->hctx;
> +	struct request_queue *q = data->q;
>  	unsigned int depth, users;
>  
>  	if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED))
> @@ -84,15 +86,15 @@ static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx,
>  	 * Allow at least some tags
>  	 */
>  	depth = max((bt->sb.depth + users - 1) / users, 4U);
> -	return atomic_read(&hctx->nr_active) < depth;
> +	return __blk_mq_active_requests(hctx, q) < depth;

There is big change on 'users' too:

	users = atomic_read(&hctx->tags->active_queues);

Originally there is single hctx->tags for these HBAs, now there are many
hctx->tags, so 'users' may become much smaller than before.

Maybe '->active_queues' can be moved to tag_set for blk_mq_is_sbitmap_shared().

>  }
>  
>  static int __blk_mq_get_tag(struct blk_mq_alloc_data *data,
>  			    struct sbitmap_queue *bt)
>  {
>  	if (!(data->flags & BLK_MQ_REQ_INTERNAL) &&
> -	    !hctx_may_queue(data->hctx, bt))
> -		return BLK_MQ_NO_TAG;
> +	    !hctx_may_queue(data, bt))
> +		return -1;

BLK_MQ_NO_TAG should have been returned.

>  	if (data->shallow_depth)
>  		return __sbitmap_queue_get_shallow(bt, data->shallow_depth);
>  	else
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 77120dd4e4d5..0f7e062a1665 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -283,7 +283,7 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
>  	} else {
>  		if (data->hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED) {
>  			rq_flags = RQF_MQ_INFLIGHT;
> -			atomic_inc(&data->hctx->nr_active);
> +			__blk_mq_inc_active_requests(data->hctx, data->q);
>  		}
>  		rq->tag = tag;
>  		rq->internal_tag = BLK_MQ_NO_TAG;
> @@ -527,7 +527,7 @@ void blk_mq_free_request(struct request *rq)
>  
>  	ctx->rq_completed[rq_is_sync(rq)]++;
>  	if (rq->rq_flags & RQF_MQ_INFLIGHT)
> -		atomic_dec(&hctx->nr_active);
> +		__blk_mq_dec_active_requests(hctx, q);
>  
>  	if (unlikely(laptop_mode && !blk_rq_is_passthrough(rq)))
>  		laptop_io_completion(q->backing_dev_info);
> @@ -1073,7 +1073,7 @@ bool blk_mq_get_driver_tag(struct request *rq)
>  	if (rq->tag >= 0) {
>  		if (shared) {
>  			rq->rq_flags |= RQF_MQ_INFLIGHT;
> -			atomic_inc(&data.hctx->nr_active);
> +			__blk_mq_inc_active_requests(rq->mq_hctx, rq->q);
>  		}
>  		data.hctx->tags->rqs[rq->tag] = rq;
>  	}
> diff --git a/block/blk-mq.h b/block/blk-mq.h
> index 1a283c707215..9c1e612c2298 100644
> --- a/block/blk-mq.h
> +++ b/block/blk-mq.h
> @@ -202,6 +202,32 @@ static inline bool blk_mq_get_dispatch_budget(struct blk_mq_hw_ctx *hctx)
>  	return true;
>  }
>  
> +static inline void __blk_mq_inc_active_requests(struct blk_mq_hw_ctx *hctx,
> +						struct request_queue *q)
> +{
> +	if (blk_mq_is_sbitmap_shared(q->tag_set))
> +		atomic_inc(&q->nr_active_requests_shared_sbitmap);
> +	else
> +		atomic_inc(&hctx->nr_active);
> +}
> +
> +static inline void __blk_mq_dec_active_requests(struct blk_mq_hw_ctx *hctx,
> +						struct request_queue *q)
> +{
> +	if (blk_mq_is_sbitmap_shared(q->tag_set))
> +		atomic_dec(&q->nr_active_requests_shared_sbitmap);
> +	else
> +		atomic_dec(&hctx->nr_active);
> +}
> +
> +static inline int __blk_mq_active_requests(struct blk_mq_hw_ctx *hctx,
> +					   struct request_queue *q)
> +{
> +	if (blk_mq_is_sbitmap_shared(q->tag_set))

I'd suggest to add one hctx version of blk_mq_is_sbitmap_shared() since
q->tag_set is seldom used in fast path, and hctx->flags is more
efficient than tag_set->flags.


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 02/12] blk-mq: rename blk_mq_update_tag_set_depth()
  2020-06-11  2:57   ` Ming Lei
@ 2020-06-11  8:26     ` John Garry
  2020-06-23 11:25       ` John Garry
  0 siblings, 1 reply; 123+ messages in thread
From: John Garry @ 2020-06-11  8:26 UTC (permalink / raw)
  To: Ming Lei
  Cc: axboe, jejb, martin.petersen, don.brace, kashyap.desai,
	sumit.saxena, bvanassche, hare, hch, shivasharan.srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	megaraidlinux.pdl, Hannes Reinecke

On 11/06/2020 03:57, Ming Lei wrote:
> On Thu, Jun 11, 2020 at 01:29:09AM +0800, John Garry wrote:
>> From: Hannes Reinecke <hare@suse.de>
>>
>> The function does not set the depth, but rather transitions from
>> shared to non-shared queues and vice versa.
>> So rename it to blk_mq_update_tag_set_shared() to better reflect
>> its purpose.
> 
> It is fine to rename it for me, however:
> 
> This patch claims to rename blk_mq_update_tag_set_shared(), but also
> change blk_mq_init_bitmap_tags's signature.

I was going to update the commit message here, but forgot again...

> 
> So suggest to split this patch into two or add comment log on changing
> blk_mq_init_bitmap_tags().

I think I'll just split into 2x commits.

Thanks,
John

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 00/12] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs
  2020-06-11  3:07 ` [PATCH RFC v7 00/12] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs Ming Lei
@ 2020-06-11  9:35   ` John Garry
  2020-06-12 18:47     ` Kashyap Desai
  0 siblings, 1 reply; 123+ messages in thread
From: John Garry @ 2020-06-11  9:35 UTC (permalink / raw)
  To: Ming Lei
  Cc: axboe, jejb, martin.petersen, don.brace, kashyap.desai,
	sumit.saxena, bvanassche, hare, hch, shivasharan.srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	megaraidlinux.pdl

On 11/06/2020 04:07, Ming Lei wrote:
>> Using 12x SAS SSDs on hisi_sas v3 hw. mq-deadline results are included,
>> but it is not always an appropriate scheduler to use.
>>
>> Tag depth 		4000 (default)			260**
>>
>> Baseline:
>> none sched:		2290K IOPS			894K
>> mq-deadline sched:	2341K IOPS			2313K
>>
>> Final, host_tagset=0 in LLDD*
>> none sched:		2289K IOPS			703K
>> mq-deadline sched:	2337K IOPS			2291K
>>
>> Final:
>> none sched:		2281K IOPS			1101K
>> mq-deadline sched:	2322K IOPS			1278K
>>
>> * this is relevant as this is the performance in supporting but not
>>    enabling the feature
>> ** depth=260 is relevant as some point where we are regularly waiting for
>>     tags to be available. Figures were are a bit unstable here for testing.
>>
>> A copy of the patches can be found here:
>> https://github.com/hisilicon/kernel-dev/commits/private-topic-blk-mq-shared-tags-rfc-v7
>>
>> And to progress this series, we the the following to go in first, when ready:
>> https://lore.kernel.org/linux-scsi/20200430131904.5847-1-hare@suse.de/
> I'd suggest to add options to enable shared tags for null_blk & scsi_debug in V8, so
> that it is easier to verify the changes without real hardware.
> 

ok, fine, I can look at including that. To stop the series getting too 
large, I might spin off the early patches, which are not strictly related.

Thanks,
John

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 04/12] blk-mq: Facilitate a shared sbitmap per tagset
  2020-06-11  3:37   ` Ming Lei
@ 2020-06-11 10:09     ` John Garry
  0 siblings, 0 replies; 123+ messages in thread
From: John Garry @ 2020-06-11 10:09 UTC (permalink / raw)
  To: Ming Lei
  Cc: axboe, jejb, martin.petersen, don.brace, kashyap.desai,
	sumit.saxena, bvanassche, hare, hch, shivasharan.srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	megaraidlinux.pdl

On 11/06/2020 04:37, Ming Lei wrote:

Hi Ming,

Thanks for checking this.

>> bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
>>   	 * We can hit rq == NULL here, because the tagging functions
>>   	 * test and set the bit before assigning ->rqs[].
>>   	 */
>> -	if (rq && rq->q == hctx->queue)
>> +	if (rq && rq->q == hctx->queue && rq->mq_hctx == hctx)
>>   		return iter_data->fn(hctx, rq, iter_data->data, reserved);
>>   	return true;
>>   }
>> @@ -466,6 +466,7 @@ static int blk_mq_init_bitmap_tags(struct blk_mq_tags *tags,
>>   		     round_robin, node))
>>   		goto free_bitmap_tags;
>>   
>> +	/* We later overwrite these in case of per-set shared sbitmap */
>>   	tags->bitmap_tags = &tags->__bitmap_tags;
>>   	tags->breserved_tags = &tags->__breserved_tags;
> You may skip to allocate anything for blk_mq_is_sbitmap_shared(), and
> similar change for blk_mq_free_tags().

I did try that, but it breaks scheduler tags allocation - this is common 
code. Maybe I can pass some flag, to avoid the allocation for case of 
shared sbitmap and !sched tags. Same for free path.

BTW, if you check patch 7/12, I mentioned that we could use this sbitmap 
for iterating to get the per-hctx bitmap, instead of allocating a temp 
sbitmap. Maybe it's better.

> 
>>   
>> @@ -475,7 +476,32 @@ static int blk_mq_init_bitmap_tags(struct blk_mq_tags *tags,
>>   	return -ENOMEM;
>>   }
>>   
>> -struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
>> +bool blk_mq_init_shared_sbitmap(struct blk_mq_tag_set *tag_set)
>> +{
>> +	unsigned int depth = tag_set->queue_depth - tag_set->reserved_tags;
>> +	int alloc_policy = BLK_MQ_FLAG_TO_ALLOC_POLICY(tag_set->flags);
>> +	bool round_robin = alloc_policy == BLK_TAG_ALLOC_RR;
>> +	int node = tag_set->numa_node;
>> +
>> +	if (bt_alloc(&tag_set->__bitmap_tags, depth, round_robin, node))
>> +		return false;
>> +	if (bt_alloc(&tag_set->__breserved_tags, tag_set->reserved_tags,
>> +		     round_robin, node))
>> +		goto free_bitmap_tags;
>> +	return true;
>> +free_bitmap_tags:
>> +	sbitmap_queue_free(&tag_set->__bitmap_tags);
>> +	return false;
>> +}
>> +

[...]

>> index 90b645c3092c..77120dd4e4d5 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -2229,7 +2229,7 @@ struct blk_mq_tags *blk_mq_alloc_rq_map(struct blk_mq_tag_set *set,
>>   	if (node == NUMA_NO_NODE)
>>   		node = set->numa_node;
>>   
>> -	tags = blk_mq_init_tags(nr_tags, reserved_tags, node,
>> +	tags = blk_mq_init_tags(set, nr_tags, reserved_tags, node,
>>   				BLK_MQ_FLAG_TO_ALLOC_POLICY(set->flags));
>>   	if (!tags)
>>   		return NULL;
>> @@ -3349,11 +3349,28 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
>>   	if (ret)
>>   		goto out_free_mq_map;
>>   
>> +	if (blk_mq_is_sbitmap_shared(set)) {
>> +		if (!blk_mq_init_shared_sbitmap(set)) {
>> +			ret = -ENOMEM;
>> +			goto out_free_mq_rq_maps;
>> +		}
>> +
>> +		for (i = 0; i < set->nr_hw_queues; i++) {
>> +			struct blk_mq_tags *tags = set->tags[i];
>> +
>> +			tags->bitmap_tags = &set->__bitmap_tags;
>> +			tags->breserved_tags = &set->__breserved_tags;
>> +		}
> I am wondering why you don't put ->[bitmap|breserved]_tags initialization into
> blk_mq_init_shared_sbitmap().

I suppose I could.

Thanks,
John

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 05/12] blk-mq: Record nr_active_requests per queue for when using shared sbitmap
  2020-06-11  4:04   ` Ming Lei
@ 2020-06-11 10:22     ` John Garry
  0 siblings, 0 replies; 123+ messages in thread
From: John Garry @ 2020-06-11 10:22 UTC (permalink / raw)
  To: Ming Lei
  Cc: axboe, jejb, martin.petersen, don.brace, kashyap.desai,
	sumit.saxena, bvanassche, hare, hch, shivasharan.srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	megaraidlinux.pdl

>> +static const struct blk_mq_debugfs_attr blk_mq_debugfs_hctx_shared_sbitmap_attrs[] = {
>> +	{"state", 0400, hctx_state_show},
>> +	{"flags", 0400, hctx_flags_show},
>> +	{"dispatch", 0400, .seq_ops = &hctx_dispatch_seq_ops},
>> +	{"busy", 0400, hctx_busy_show},
>> +	{"ctx_map", 0400, hctx_ctx_map_show},
>> +	{"sched_tags", 0400, hctx_sched_tags_show},
>> +	{"sched_tags_bitmap", 0400, hctx_sched_tags_bitmap_show},
>> +	{"io_poll", 0600, hctx_io_poll_show, hctx_io_poll_write},
>> +	{"dispatched", 0600, hctx_dispatched_show, hctx_dispatched_write},
>> +	{"queued", 0600, hctx_queued_show, hctx_queued_write},
>> +	{"run", 0600, hctx_run_show, hctx_run_write},
>> +	{"active", 0400, hctx_active_show},
>> +	{"dispatch_busy", 0400, hctx_dispatch_busy_show},
>> +	{}
>> +};
> 
> You may use macro or whatever to avoid so the duplication.

Let me check alternatives.

> 
>> +
>>   static const struct blk_mq_debugfs_attr blk_mq_debugfs_ctx_attrs[] = {
>>   	{"default_rq_list", 0400, .seq_ops = &ctx_default_rq_list_seq_ops},
>>   	{"read_rq_list", 0400, .seq_ops = &ctx_read_rq_list_seq_ops},
>> @@ -878,13 +895,17 @@ void blk_mq_debugfs_register_hctx(struct request_queue *q,
>>   				  struct blk_mq_hw_ctx *hctx)
>>   {
>>   	struct blk_mq_ctx *ctx;
>> +	struct blk_mq_tag_set *set = q->tag_set;
>>   	char name[20];
>>   	int i;
>>   
>>   	snprintf(name, sizeof(name), "hctx%u", hctx->queue_num);
>>   	hctx->debugfs_dir = debugfs_create_dir(name, q->debugfs_dir);
>>   
>> -	debugfs_create_files(hctx->debugfs_dir, hctx, blk_mq_debugfs_hctx_attrs);
>> +	if (blk_mq_is_sbitmap_shared(set))
>> +		debugfs_create_files(hctx->debugfs_dir, hctx, blk_mq_debugfs_hctx_shared_sbitmap_attrs);
>> +	else
>> +		debugfs_create_files(hctx->debugfs_dir, hctx, blk_mq_debugfs_hctx_attrs);
>>   
>>   	hctx_for_each_ctx(hctx, ctx, i)
>>   		blk_mq_debugfs_register_ctx(hctx, ctx);
>> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
>> index 92843e3e1a2a..7db16e49f6f6 100644
>> --- a/block/blk-mq-tag.c
>> +++ b/block/blk-mq-tag.c
>> @@ -60,9 +60,11 @@ void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
>>    * For shared tag users, we track the number of currently active users
>>    * and attempt to provide a fair share of the tag depth for each of them.
>>    */
>> -static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx,
>> +static inline bool hctx_may_queue(struct blk_mq_alloc_data *data,
>>   				  struct sbitmap_queue *bt)
>>   {
>> +	struct blk_mq_hw_ctx *hctx = data->hctx;
>> +	struct request_queue *q = data->q;
>>   	unsigned int depth, users;
>>   
>>   	if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED))
>> @@ -84,15 +86,15 @@ static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx,
>>   	 * Allow at least some tags
>>   	 */
>>   	depth = max((bt->sb.depth + users - 1) / users, 4U);
>> -	return atomic_read(&hctx->nr_active) < depth;
>> +	return __blk_mq_active_requests(hctx, q) < depth;
> 
> There is big change on 'users' too:
> 
> 	users = atomic_read(&hctx->tags->active_queues);
> 
> Originally there is single hctx->tags for these HBAs, now there are many
> hctx->tags, so 'users' may become much smaller than before.

Can you please check how I handled that in the next patch? There we 
record the number of active request queues per set.

(I will note that I could have combined some of these patches, but I 
liked the piecemeal appraoch, and none of these paths are enabled until 
later).

> 
> Maybe '->active_queues' can be moved to tag_set for blk_mq_is_sbitmap_shared().
> 
>>   }
>>   
>>   static int __blk_mq_get_tag(struct blk_mq_alloc_data *data,
>>   			    struct sbitmap_queue *bt)
>>   {
>>   	if (!(data->flags & BLK_MQ_REQ_INTERNAL) &&
>> -	    !hctx_may_queue(data->hctx, bt))
>> -		return BLK_MQ_NO_TAG;
>> +	    !hctx_may_queue(data, bt))
>> +		return -1;
> 
> BLK_MQ_NO_TAG should have been returned.

OK, I missed that in the rebase.

> 
>>   	if (data->shallow_depth)
>>   		return __sbitmap_queue_get_shallow(bt, data->shallow_depth);
>>   	else
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index 77120dd4e4d5..0f7e062a1665 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -283,7 +283,7 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
>>   	} else {
>>   		if (data->hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED) {
>>   			rq_flags = RQF_MQ_INFLIGHT;
>> -			atomic_inc(&data->hctx->nr_active);
>> +			__blk_mq_inc_active_requests(data->hctx, data->q);
>>   		}
>>   		rq->tag = tag;
>>   		rq->internal_tag = BLK_MQ_NO_TAG;
>> @@ -527,7 +527,7 @@ void blk_mq_free_request(struct request *rq)
>>   
>>   	ctx->rq_completed[rq_is_sync(rq)]++;
>>   	if (rq->rq_flags & RQF_MQ_INFLIGHT)
>> -		atomic_dec(&hctx->nr_active);
>> +		__blk_mq_dec_active_requests(hctx, q);
>>   
>>   	if (unlikely(laptop_mode && !blk_rq_is_passthrough(rq)))
>>   		laptop_io_completion(q->backing_dev_info);
>> @@ -1073,7 +1073,7 @@ bool blk_mq_get_driver_tag(struct request *rq)
>>   	if (rq->tag >= 0) {
>>   		if (shared) {
>>   			rq->rq_flags |= RQF_MQ_INFLIGHT;
>> -			atomic_inc(&data.hctx->nr_active);
>> +			__blk_mq_inc_active_requests(rq->mq_hctx, rq->q);
>>   		}
>>   		data.hctx->tags->rqs[rq->tag] = rq;
>>   	}
>> diff --git a/block/blk-mq.h b/block/blk-mq.h
>> index 1a283c707215..9c1e612c2298 100644
>> --- a/block/blk-mq.h
>> +++ b/block/blk-mq.h
>> @@ -202,6 +202,32 @@ static inline bool blk_mq_get_dispatch_budget(struct blk_mq_hw_ctx *hctx)
>>   	return true;
>>   }
>>   
>> +static inline void __blk_mq_inc_active_requests(struct blk_mq_hw_ctx *hctx,
>> +						struct request_queue *q)
>> +{
>> +	if (blk_mq_is_sbitmap_shared(q->tag_set))
>> +		atomic_inc(&q->nr_active_requests_shared_sbitmap);
>> +	else
>> +		atomic_inc(&hctx->nr_active);
>> +}
>> +
>> +static inline void __blk_mq_dec_active_requests(struct blk_mq_hw_ctx *hctx,
>> +						struct request_queue *q)
>> +{
>> +	if (blk_mq_is_sbitmap_shared(q->tag_set))
>> +		atomic_dec(&q->nr_active_requests_shared_sbitmap);
>> +	else
>> +		atomic_dec(&hctx->nr_active);
>> +}
>> +
>> +static inline int __blk_mq_active_requests(struct blk_mq_hw_ctx *hctx,
>> +					   struct request_queue *q)
>> +{
>> +	if (blk_mq_is_sbitmap_shared(q->tag_set))
> 
> I'd suggest to add one hctx version of blk_mq_is_sbitmap_shared() since
> q->tag_set is seldom used in fast path, and hctx->flags is more
> efficient than tag_set->flags.

OK

Thanks,
John

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 06/12] blk-mq: Record active_queues_shared_sbitmap per tag_set for when using shared sbitmap
  2020-06-10 17:29 ` [PATCH RFC v7 06/12] blk-mq: Record active_queues_shared_sbitmap per tag_set " John Garry
@ 2020-06-11 13:16   ` Hannes Reinecke
  2020-06-11 14:22     ` John Garry
  0 siblings, 1 reply; 123+ messages in thread
From: Hannes Reinecke @ 2020-06-11 13:16 UTC (permalink / raw)
  To: John Garry, axboe, jejb, martin.petersen, don.brace,
	kashyap.desai, sumit.saxena, ming.lei, bvanassche, hare, hch,
	shivasharan.srikanteshwara
  Cc: linux-block, linux-scsi, esc.storagedev, chenxiang66, megaraidlinux.pdl

On 6/10/20 7:29 PM, John Garry wrote:
> For when using a shared sbitmap, no longer should the number of active
> request queues per hctx be relied on for when judging how to share the tag
> bitmap.
> 
> Instead maintain the number of active request queues per tag_set, and make
> the judgment based on that.
> 
> And since the blk_mq_tags.active_queues is no longer maintained, do not
> show it in debugfs.
> 
> Originally-from: Kashyap Desai <kashyap.desai@broadcom.com>
> Signed-off-by: John Garry <john.garry@huawei.com>
> ---
>   block/blk-mq-debugfs.c | 25 ++++++++++++++++++++--
>   block/blk-mq-tag.c     | 47 ++++++++++++++++++++++++++++++++----------
>   block/blk-mq.c         |  2 ++
>   include/linux/blk-mq.h |  1 +
>   include/linux/blkdev.h |  1 +
>   5 files changed, 63 insertions(+), 13 deletions(-)
> 
> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> index 0fa3af41ab65..05b4be0c03d9 100644
> --- a/block/blk-mq-debugfs.c
> +++ b/block/blk-mq-debugfs.c
> @@ -458,17 +458,37 @@ static void blk_mq_debugfs_tags_show(struct seq_file *m,
>   	}
>   }
>   
> +static void blk_mq_debugfs_tags_shared_sbitmap_show(struct seq_file *m,
> +				     struct blk_mq_tags *tags)
> +{
> +	seq_printf(m, "nr_tags=%u\n", tags->nr_tags);
> +	seq_printf(m, "nr_reserved_tags=%u\n", tags->nr_reserved_tags);
> +
> +	seq_puts(m, "\nbitmap_tags:\n");
> +	sbitmap_queue_show(tags->bitmap_tags, m);
> +
> +	if (tags->nr_reserved_tags) {
> +		seq_puts(m, "\nbreserved_tags:\n");
> +		sbitmap_queue_show(tags->breserved_tags, m);
> +	}
> +}
> +
>   static int hctx_tags_show(void *data, struct seq_file *m)
>   {
>   	struct blk_mq_hw_ctx *hctx = data;
>   	struct request_queue *q = hctx->queue;
> +	struct blk_mq_tag_set *set = q->tag_set;
>   	int res;
>   
>   	res = mutex_lock_interruptible(&q->sysfs_lock);
>   	if (res)
>   		goto out;
> -	if (hctx->tags)
> -		blk_mq_debugfs_tags_show(m, hctx->tags);
> +	if (hctx->tags) {
> +		if (blk_mq_is_sbitmap_shared(set))
> +			blk_mq_debugfs_tags_shared_sbitmap_show(m, hctx->tags);
> +		else
> +			blk_mq_debugfs_tags_show(m, hctx->tags);
> +	}
>   	mutex_unlock(&q->sysfs_lock);
>   
>   out:
> @@ -802,6 +822,7 @@ static const struct blk_mq_debugfs_attr blk_mq_debugfs_hctx_shared_sbitmap_attrs
>   	{"dispatch", 0400, .seq_ops = &hctx_dispatch_seq_ops},
>   	{"busy", 0400, hctx_busy_show},
>   	{"ctx_map", 0400, hctx_ctx_map_show},
> +	{"tags", 0400, hctx_tags_show},
>   	{"sched_tags", 0400, hctx_sched_tags_show},
>   	{"sched_tags_bitmap", 0400, hctx_sched_tags_bitmap_show},
>   	{"io_poll", 0600, hctx_io_poll_show, hctx_io_poll_write},

I had been pondering this, too, when creating v6. Problem is that it'll 
show the tags per hctx, but as they are shared I guess the list looks 
pretty identical per hctx.
So I guess we should filter the tags per hctx to have only those active 
on that hctx displayed. But when doing so we can only print the 
in-flight tags, the others are not assigned to a hctx and as such we 
can't make a decision on which hctx they'll end up.

> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index 7db16e49f6f6..6ca06b1c3a99 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -23,9 +23,19 @@
>    */
>   bool __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx)
>   {
> -	if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state) &&
> -	    !test_and_set_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
> -		atomic_inc(&hctx->tags->active_queues);
> +	struct request_queue *q = hctx->queue;
> +	struct blk_mq_tag_set *set = q->tag_set;
> +
> +	if (blk_mq_is_sbitmap_shared(set)){
> +		if (!test_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags) &&
> +		    !test_and_set_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags))
> +			atomic_inc(&set->active_queues_shared_sbitmap);
> +
> +	} else {
> +		if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state) &&
> +		    !test_and_set_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
> +			atomic_inc(&hctx->tags->active_queues);
> +	}
>   
>   	return true;
>   }
At one point someone would need to educate me what this double 
'test_bit' and 'test_and_set_bit' is supposed to achieve.
Other than deliberately injecting a race condition ...

> @@ -47,11 +57,19 @@ void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags, bool include_reserve)
>   void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
>   {
>   	struct blk_mq_tags *tags = hctx->tags;
> -
> -	if (!test_and_clear_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
> -		return;
> -
> -	atomic_dec(&tags->active_queues);
> +	struct request_queue *q = hctx->queue;
> +	struct blk_mq_tag_set *set = q->tag_set;
> +
> +	if (blk_mq_is_sbitmap_shared(q->tag_set)){
> +		if (!test_and_clear_bit(QUEUE_FLAG_HCTX_ACTIVE,
> +					&q->queue_flags))
> +			return;
> +		atomic_dec(&set->active_queues_shared_sbitmap);
> +	} else {
> +		if (!test_and_clear_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
> +			return;
> +		atomic_dec(&tags->active_queues);
> +	}
>   
>   	blk_mq_tag_wakeup_all(tags, false);
>   }
> @@ -65,12 +83,11 @@ static inline bool hctx_may_queue(struct blk_mq_alloc_data *data,
>   {
>   	struct blk_mq_hw_ctx *hctx = data->hctx;
>   	struct request_queue *q = data->q;
> +	struct blk_mq_tag_set *set = q->tag_set;
>   	unsigned int depth, users;
>   
>   	if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED))
>   		return true;
> -	if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
> -		return true;
>   
>   	/*
>   	 * Don't try dividing an ant
> @@ -78,7 +95,15 @@ static inline bool hctx_may_queue(struct blk_mq_alloc_data *data,
>   	if (bt->sb.depth == 1)
>   		return true;
>   
> -	users = atomic_read(&hctx->tags->active_queues);
> +	if (blk_mq_is_sbitmap_shared(q->tag_set)) {
> +		if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &q->queue_flags))
> +			return true;
> +		users = atomic_read(&set->active_queues_shared_sbitmap);
> +	} else {
> +		if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
> +			return true;
> +		users = atomic_read(&hctx->tags->active_queues);
> +	}
>   	if (!users)
>   		return true;
>   
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 0f7e062a1665..f73a2f9c58bd 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -3350,6 +3350,8 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
>   		goto out_free_mq_map;
>   
>   	if (blk_mq_is_sbitmap_shared(set)) {
> +		atomic_set(&set->active_queues_shared_sbitmap, 0);
> +
>   		if (!blk_mq_init_shared_sbitmap(set)) {
>   			ret = -ENOMEM;
>   			goto out_free_mq_rq_maps;
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index 7b31cdb92a71..66711c7234db 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -252,6 +252,7 @@ struct blk_mq_tag_set {
>   	unsigned int		timeout;
>   	unsigned int		flags;
>   	void			*driver_data;
> +	atomic_t		active_queues_shared_sbitmap;
>   
>   	struct sbitmap_queue	__bitmap_tags;
>   	struct sbitmap_queue	__breserved_tags;
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index c536278bec9e..1b0087e8d01a 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -619,6 +619,7 @@ struct request_queue {
>   #define QUEUE_FLAG_PCI_P2PDMA	25	/* device supports PCI p2p requests */
>   #define QUEUE_FLAG_ZONE_RESETALL 26	/* supports Zone Reset All */
>   #define QUEUE_FLAG_RQ_ALLOC_TIME 27	/* record rq->alloc_time_ns */
> +#define QUEUE_FLAG_HCTX_ACTIVE 28	/* at least one blk-mq hctx is active */
>   
>   #define QUEUE_FLAG_MQ_DEFAULT	((1 << QUEUE_FLAG_IO_STAT) |		\
>   				 (1 << QUEUE_FLAG_SAME_COMP))
> 
Other than that it looks fine.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                               +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 07/12] blk-mq: Add support in hctx_tags_bitmap_show() for a shared sbitmap
  2020-06-10 17:29 ` [PATCH RFC v7 07/12] blk-mq: Add support in hctx_tags_bitmap_show() for a " John Garry
@ 2020-06-11 13:19   ` Hannes Reinecke
  2020-06-11 14:33     ` John Garry
  0 siblings, 1 reply; 123+ messages in thread
From: Hannes Reinecke @ 2020-06-11 13:19 UTC (permalink / raw)
  To: John Garry, axboe, jejb, martin.petersen, don.brace,
	kashyap.desai, sumit.saxena, ming.lei, bvanassche, hare, hch,
	shivasharan.srikanteshwara
  Cc: linux-block, linux-scsi, esc.storagedev, chenxiang66, megaraidlinux.pdl

On 6/10/20 7:29 PM, John Garry wrote:
> Since a set-wide shared tag sbitmap may be used, it is no longer valid to
> examine the per-hctx tagset for getting the active requests for a hctx
> (when a shared sbitmap is used).
> 
> As such, add support for the shared sbitmap by using an intermediate
> sbitmap per hctx, iterating all active tags for the specific hctx in the
> shared sbitmap.
> 
> Originally-by: Bart Van Assche <bvanassche@acm.org>
> Reviewed-by: Hannes Reinecke <hare@suse.de> #earlier version
> Signed-off-by: John Garry <john.garry@huawei.com>
> ---
>   block/blk-mq-debugfs.c | 62 ++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 62 insertions(+)
> 
> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> index 05b4be0c03d9..4da7e54adf3b 100644
> --- a/block/blk-mq-debugfs.c
> +++ b/block/blk-mq-debugfs.c
> @@ -495,6 +495,67 @@ static int hctx_tags_show(void *data, struct seq_file *m)
>   	return res;
>   }
>   
> +struct hctx_sb_data {
> +	struct sbitmap		*sb;	/* output bitmap */
> +	struct blk_mq_hw_ctx	*hctx;	/* input hctx */
> +};
> +
> +static bool hctx_filter_fn(struct blk_mq_hw_ctx *hctx, struct request *req,
> +			   void *priv, bool reserved)
> +{
> +	struct hctx_sb_data *hctx_sb_data = priv;
> +
> +	if (hctx == hctx_sb_data->hctx)
> +		sbitmap_set_bit(hctx_sb_data->sb, req->tag);
> +
> +	return true;
> +}
> +
> +static void hctx_filter_sb(struct sbitmap *sb, struct blk_mq_hw_ctx *hctx)
> +{
> +	struct hctx_sb_data hctx_sb_data = { .sb = sb, .hctx = hctx };
> +
> +	blk_mq_queue_tag_busy_iter(hctx->queue, hctx_filter_fn, &hctx_sb_data);
> +}
> +
> +static int hctx_tags_shared_sbitmap_bitmap_show(void *data, struct seq_file *m)
> +{
> +	struct blk_mq_hw_ctx *hctx = data;
> +	struct request_queue *q = hctx->queue;
> +	struct blk_mq_tag_set *set = q->tag_set;
> +	struct sbitmap shared_sb, *sb;
> +	int res;
> +
> +	if (!set)
> +		return 0;
> +
> +	/*
> +	 * We could use the allocated sbitmap for that hctx here, but
> +	 * that would mean that we would need to clean it prior to use.
> +	 */
> +	res = sbitmap_init_node(&shared_sb,
> +				set->__bitmap_tags.sb.depth,
> +				set->__bitmap_tags.sb.shift,
> +				GFP_KERNEL, NUMA_NO_NODE);
> +	if (res)
> +		return res;
> +	sb = &shared_sb;
> +
> +	res = mutex_lock_interruptible(&q->sysfs_lock);
> +	if (res)
> +		goto out;
> +	if (hctx->tags) {
> +		hctx_filter_sb(sb, hctx);
> +		sbitmap_bitmap_show(sb, m);
> +	}
> +
> +	mutex_unlock(&q->sysfs_lock);
> +
> +out:
> +	sbitmap_free(&shared_sb);
> +	return res;
> +}
> +
>   static int hctx_tags_bitmap_show(void *data, struct seq_file *m)
>   {
>   	struct blk_mq_hw_ctx *hctx = data;
> @@ -823,6 +884,7 @@ static const struct blk_mq_debugfs_attr blk_mq_debugfs_hctx_shared_sbitmap_attrs
>   	{"busy", 0400, hctx_busy_show},
>   	{"ctx_map", 0400, hctx_ctx_map_show},
>   	{"tags", 0400, hctx_tags_show},
> +	{"tags_bitmap", 0400, hctx_tags_shared_sbitmap_bitmap_show},
>   	{"sched_tags", 0400, hctx_sched_tags_show},
>   	{"sched_tags_bitmap", 0400, hctx_sched_tags_bitmap_show},
>   	{"io_poll", 0600, hctx_io_poll_show, hctx_io_poll_write},
> 
Ah, right. Here it is.

But I don't get it; why do we have to allocate a temporary bitmap and 
can't walk the existing shared sbitmap?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                               +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 06/12] blk-mq: Record active_queues_shared_sbitmap per tag_set for when using shared sbitmap
  2020-06-11 13:16   ` Hannes Reinecke
@ 2020-06-11 14:22     ` John Garry
  0 siblings, 0 replies; 123+ messages in thread
From: John Garry @ 2020-06-11 14:22 UTC (permalink / raw)
  To: Hannes Reinecke, axboe, jejb, martin.petersen, don.brace,
	kashyap.desai, sumit.saxena, ming.lei, bvanassche, hare, hch,
	shivasharan.srikanteshwara
  Cc: linux-block, linux-scsi, esc.storagedev, chenxiang66, megaraidlinux.pdl

>>   out:
>> @@ -802,6 +822,7 @@ static const struct blk_mq_debugfs_attr 
>> blk_mq_debugfs_hctx_shared_sbitmap_attrs
>>       {"dispatch", 0400, .seq_ops = &hctx_dispatch_seq_ops},
>>       {"busy", 0400, hctx_busy_show},
>>       {"ctx_map", 0400, hctx_ctx_map_show},
>> +    {"tags", 0400, hctx_tags_show},
>>       {"sched_tags", 0400, hctx_sched_tags_show},
>>       {"sched_tags_bitmap", 0400, hctx_sched_tags_bitmap_show},
>>       {"io_poll", 0600, hctx_io_poll_show, hctx_io_poll_write},
> 
> I had been pondering this, too, when creating v6. Problem is that it'll 
> show the tags per hctx, but as they are shared I guess the list looks 
> pretty identical per hctx.

Right, so my main concern in this series is debugfs, and how to present 
the tag/hctx info. We can hide things under the hood mostly, but not so 
much for debugfs.

> So I guess we should filter the tags per hctx to have only those active 
> on that hctx displayed. But when doing so we can only print the 
> in-flight tags, the others are not assigned to a hctx and as such we 
> can't make a decision on which hctx they'll end up.

So we filter the tags in 7/12, and that should be ok.

But a concern is that we present identical hctx_tags_show() -> 
blk_mq_debugfs_shared_sitmap_show() -> sbitmap_queue_show() per-hctx 
info, which may be inappropriate/misleading/wrong. I need to audit that 
more thoroughly.

> 
>> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
>> index 7db16e49f6f6..6ca06b1c3a99 100644
>> --- a/block/blk-mq-tag.c
>> +++ b/block/blk-mq-tag.c
>> @@ -23,9 +23,19 @@
>>    */
>>   bool __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx)
>>   {
>> -    if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state) &&
>> -        !test_and_set_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
>> -        atomic_inc(&hctx->tags->active_queues);
>> +    struct request_queue *q = hctx->queue;
>> +    struct blk_mq_tag_set *set = q->tag_set;
>> +
>> +    if (blk_mq_is_sbitmap_shared(set)){
>> +        if (!test_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags) &&
>> +            !test_and_set_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags))
>> +            atomic_inc(&set->active_queues_shared_sbitmap);
>> +
>> +    } else {
>> +        if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state) &&
>> +            !test_and_set_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
>> +            atomic_inc(&hctx->tags->active_queues);
>> +    }
>>       return true;
>>   }
> At one point someone would need to educate me what this double 
> 'test_bit' and 'test_and_set_bit' is supposed to achieve.
> Other than deliberately injecting a race condition ...

As I see, it's not a dangerous race, and meant as a performance 
optimization.

test_bit() should be is quick compared to test_and_set_bit(), so nicer 
to avoid always doing a test_and_set_bit().

So if we race with failed test_bit() calls and have multiple attempts 
with the test_and_set_bit(), only 1 will succeed and do the increment.

> 
>> @@ -47,11 +57,19 @@ void blk_mq_tag_wakeup_all(struct blk_mq_tags 
>> *tags, bool include_reserve)
>>   void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
>>   {
>>       struct blk_mq_tags *tags = hctx->tags;
>> -
>> -    if (!test_and_clear_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
>> -        return;
>> -
>> -    atomic_dec(&tags->active_queues);
>> +    struct request_queue *q = hctx->queue;
>> +    struct blk_mq_tag_set *set = q->tag_set;
>> +
>> +    if (blk_mq_is_sbitmap_shared(q->tag_set)){
>> +        if (!test_and_clear_bit(QUEUE_FLAG_HCTX_ACTIVE,
>> +                    &q->queue_flags))
>> +            return;
>> +        atomic_dec(&set->active_queues_shared_sbitmap);
>> +    } else {
>> +        if (!test_and_clear_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
>> +            return;
>> +        atomic_dec(&tags->active_queues);
>> +    }
>>       blk_mq_tag_wakeup_all(tags, false);
>>   }
>> @@ -65,12 +83,11 @@ static inline bool hctx_may_queue(struct 
>> blk_mq_alloc_data *data,
>>   {
>>       struct blk_mq_hw_ctx *hctx = data->hctx;
>>       struct request_queue *q = data->q;
>> +    struct blk_mq_tag_set *set = q->tag_set;
>>       unsigned int depth, users;
>>       if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED))
>>           return true;
>> -    if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
>> -        return true;
>>       /*
>>        * Don't try dividing an ant
>> @@ -78,7 +95,15 @@ static inline bool hctx_may_queue(struct 
>> blk_mq_alloc_data *data,
>>       if (bt->sb.depth == 1)
>>           return true;
>> -    users = atomic_read(&hctx->tags->active_queues);
>> +    if (blk_mq_is_sbitmap_shared(q->tag_set)) {
>> +        if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &q->queue_flags))
>> +            return true;
>> +        users = atomic_read(&set->active_queues_shared_sbitmap);
>> +    } else {
>> +        if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
>> +            return true;
>> +        users = atomic_read(&hctx->tags->active_queues);
>> +    }
>>       if (!users)
>>           return true;
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index 0f7e062a1665..f73a2f9c58bd 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -3350,6 +3350,8 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set 
>> *set)
>>           goto out_free_mq_map;
>>       if (blk_mq_is_sbitmap_shared(set)) {
>> +        atomic_set(&set->active_queues_shared_sbitmap, 0);
>> +
>>           if (!blk_mq_init_shared_sbitmap(set)) {
>>               ret = -ENOMEM;
>>               goto out_free_mq_rq_maps;
>> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
>> index 7b31cdb92a71..66711c7234db 100644
>> --- a/include/linux/blk-mq.h
>> +++ b/include/linux/blk-mq.h
>> @@ -252,6 +252,7 @@ struct blk_mq_tag_set {
>>       unsigned int        timeout;
>>       unsigned int        flags;
>>       void            *driver_data;
>> +    atomic_t        active_queues_shared_sbitmap;

I'm not sure if we should present this in debugfs. It is not according 
to this series.

>>       struct sbitmap_queue    __bitmap_tags;
>>       struct sbitmap_queue    __breserved_tags;
>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>> index c536278bec9e..1b0087e8d01a 100644
>> --- a/include/linux/blkdev.h
>> +++ b/include/linux/blkdev.h
>> @@ -619,6 +619,7 @@ struct request_queue {
>>   #define QUEUE_FLAG_PCI_P2PDMA    25    /* device supports PCI p2p 
>> requests */
>>   #define QUEUE_FLAG_ZONE_RESETALL 26    /* supports Zone Reset All */
>>   #define QUEUE_FLAG_RQ_ALLOC_TIME 27    /* record rq->alloc_time_ns */
>> +#define QUEUE_FLAG_HCTX_ACTIVE 28    /* at least one blk-mq hctx is 
>> active */
>>   #define QUEUE_FLAG_MQ_DEFAULT    ((1 << QUEUE_FLAG_IO_STAT) |        \
>>                    (1 << QUEUE_FLAG_SAME_COMP))
>>
> Other than that it looks fine.

Cheers

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 07/12] blk-mq: Add support in hctx_tags_bitmap_show() for a shared sbitmap
  2020-06-11 13:19   ` Hannes Reinecke
@ 2020-06-11 14:33     ` John Garry
  2020-06-12  6:06       ` Hannes Reinecke
  0 siblings, 1 reply; 123+ messages in thread
From: John Garry @ 2020-06-11 14:33 UTC (permalink / raw)
  To: Hannes Reinecke, axboe, jejb, martin.petersen, don.brace,
	kashyap.desai, sumit.saxena, ming.lei, bvanassche, hare, hch,
	shivasharan.srikanteshwara
  Cc: linux-block, linux-scsi, esc.storagedev, chenxiang66, megaraidlinux.pdl

>> +static int hctx_tags_shared_sbitmap_bitmap_show(void *data, struct 
>> seq_file *m)
>> +{
>> +    struct blk_mq_hw_ctx *hctx = data;
>> +    struct request_queue *q = hctx->queue;
>> +    struct blk_mq_tag_set *set = q->tag_set;
>> +    struct sbitmap shared_sb, *sb;
>> +    int res;
>> +
>> +    if (!set)
>> +        return 0;
>> +
>> +    /*
>> +     * We could use the allocated sbitmap for that hctx here, but
>> +     * that would mean that we would need to clean it prior to use.
>> +     */

*

>> +    res = sbitmap_init_node(&shared_sb,
>> +                set->__bitmap_tags.sb.depth,
>> +                set->__bitmap_tags.sb.shift,
>> +                GFP_KERNEL, NUMA_NO_NODE);
>> +    if (res)
>> +        return res;
>> +    sb = &shared_sb;
>> +
>> +    res = mutex_lock_interruptible(&q->sysfs_lock);
>> +    if (res)
>> +        goto out;
>> +    if (hctx->tags) {
>> +        hctx_filter_sb(sb, hctx);
>> +        sbitmap_bitmap_show(sb, m);
>> +    }
>> +
>> +    mutex_unlock(&q->sysfs_lock);
>> +
>> +out:
>> +    sbitmap_free(&shared_sb);
>> +    return res;
>> +}
>> +
>>   static int hctx_tags_bitmap_show(void *data, struct seq_file *m)
>>   {
>>       struct blk_mq_hw_ctx *hctx = data;
>> @@ -823,6 +884,7 @@ static const struct blk_mq_debugfs_attr 
>> blk_mq_debugfs_hctx_shared_sbitmap_attrs
>>       {"busy", 0400, hctx_busy_show},
>>       {"ctx_map", 0400, hctx_ctx_map_show},
>>       {"tags", 0400, hctx_tags_show},
>> +    {"tags_bitmap", 0400, hctx_tags_shared_sbitmap_bitmap_show},
>>       {"sched_tags", 0400, hctx_sched_tags_show},
>>       {"sched_tags_bitmap", 0400, hctx_sched_tags_bitmap_show},
>>       {"io_poll", 0600, hctx_io_poll_show, hctx_io_poll_write},
>>
> Ah, right. Here it is.
> 
> But I don't get it; why do we have to allocate a temporary bitmap and 
> can't walk the existing shared sbitmap?

For the bitmap dump - sbitmap_bitmap_show() - it is passed a struct 
sbitmap *. So we have to filter into a temp per-hctx struct sbitmap. We 
could change sbitmap_bitmap_show() to accept a filter iterator - which I 
think you're getting at - but I am not sure it's worth the change. Or 
else use the allocated sbitmap for the hctx, as above*, but I may be 
remove that (see 4/12 response).

Cheers,
John

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 07/12] blk-mq: Add support in hctx_tags_bitmap_show() for a shared sbitmap
  2020-06-11 14:33     ` John Garry
@ 2020-06-12  6:06       ` Hannes Reinecke
  2020-06-29 15:32         ` About sbitmap_bitmap_show() and cleared bits (was Re: [PATCH RFC v7 07/12] blk-mq: Add support in hctx_tags_bitmap_show() for a shared sbitmap) John Garry
  2020-07-13  9:41         ` [PATCH RFC v7 07/12] blk-mq: Add support in hctx_tags_bitmap_show() for a shared sbitmap John Garry
  0 siblings, 2 replies; 123+ messages in thread
From: Hannes Reinecke @ 2020-06-12  6:06 UTC (permalink / raw)
  To: John Garry, axboe, jejb, martin.petersen, don.brace,
	kashyap.desai, sumit.saxena, ming.lei, bvanassche, hare, hch,
	shivasharan.srikanteshwara
  Cc: linux-block, linux-scsi, esc.storagedev, chenxiang66, megaraidlinux.pdl

On 6/11/20 4:33 PM, John Garry wrote:
>>> +static int hctx_tags_shared_sbitmap_bitmap_show(void *data, struct 
>>> seq_file *m)
>>> +{
>>> +    struct blk_mq_hw_ctx *hctx = data;
>>> +    struct request_queue *q = hctx->queue;
>>> +    struct blk_mq_tag_set *set = q->tag_set;
>>> +    struct sbitmap shared_sb, *sb;
>>> +    int res;
>>> +
>>> +    if (!set)
>>> +        return 0;
>>> +
>>> +    /*
>>> +     * We could use the allocated sbitmap for that hctx here, but
>>> +     * that would mean that we would need to clean it prior to use.
>>> +     */
> 
> *
> 
>>> +    res = sbitmap_init_node(&shared_sb,
>>> +                set->__bitmap_tags.sb.depth,
>>> +                set->__bitmap_tags.sb.shift,
>>> +                GFP_KERNEL, NUMA_NO_NODE);
>>> +    if (res)
>>> +        return res;
>>> +    sb = &shared_sb;
>>> +
>>> +    res = mutex_lock_interruptible(&q->sysfs_lock);
>>> +    if (res)
>>> +        goto out;
>>> +    if (hctx->tags) {
>>> +        hctx_filter_sb(sb, hctx);
>>> +        sbitmap_bitmap_show(sb, m);
>>> +    }
>>> +
>>> +    mutex_unlock(&q->sysfs_lock);
>>> +
>>> +out:
>>> +    sbitmap_free(&shared_sb);
>>> +    return res;
>>> +}
>>> +
>>>   static int hctx_tags_bitmap_show(void *data, struct seq_file *m)
>>>   {
>>>       struct blk_mq_hw_ctx *hctx = data;
>>> @@ -823,6 +884,7 @@ static const struct blk_mq_debugfs_attr 
>>> blk_mq_debugfs_hctx_shared_sbitmap_attrs
>>>       {"busy", 0400, hctx_busy_show},
>>>       {"ctx_map", 0400, hctx_ctx_map_show},
>>>       {"tags", 0400, hctx_tags_show},
>>> +    {"tags_bitmap", 0400, hctx_tags_shared_sbitmap_bitmap_show},
>>>       {"sched_tags", 0400, hctx_sched_tags_show},
>>>       {"sched_tags_bitmap", 0400, hctx_sched_tags_bitmap_show},
>>>       {"io_poll", 0600, hctx_io_poll_show, hctx_io_poll_write},
>>>
>> Ah, right. Here it is.
>>
>> But I don't get it; why do we have to allocate a temporary bitmap and 
>> can't walk the existing shared sbitmap?
> 
> For the bitmap dump - sbitmap_bitmap_show() - it is passed a struct 
> sbitmap *. So we have to filter into a temp per-hctx struct sbitmap. We 
> could change sbitmap_bitmap_show() to accept a filter iterator - which I 
> think you're getting at - but I am not sure it's worth the change. Or 
> else use the allocated sbitmap for the hctx, as above*, but I may be 
> remove that (see 4/12 response).
> 
Yes, I do think I would prefer updating sbitmap_bitmap_show() to accept 
a filter. Especially as Ming Lei has now updated the tag iterators to 
accept a filter, too, so it should be an easy change.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                               +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: [PATCH RFC v7 00/12] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs
  2020-06-11  9:35   ` John Garry
@ 2020-06-12 18:47     ` Kashyap Desai
  2020-06-15  2:13       ` Ming Lei
  0 siblings, 1 reply; 123+ messages in thread
From: Kashyap Desai @ 2020-06-12 18:47 UTC (permalink / raw)
  To: John Garry, Ming Lei
  Cc: axboe, jejb, martin.petersen, don.brace, Sumit Saxena,
	bvanassche, hare, hch, Shivasharan Srikanteshwara, linux-block,
	linux-scsi, esc.storagedev, chenxiang66, PDL,MEGARAIDLINUX

> On 11/06/2020 04:07, Ming Lei wrote:
> >> Using 12x SAS SSDs on hisi_sas v3 hw. mq-deadline results are
> >> included, but it is not always an appropriate scheduler to use.
> >>
> >> Tag depth 		4000 (default)			260**
> >>
> >> Baseline:
> >> none sched:		2290K IOPS			894K
> >> mq-deadline sched:	2341K IOPS			2313K
> >>
> >> Final, host_tagset=0 in LLDD*
> >> none sched:		2289K IOPS			703K
> >> mq-deadline sched:	2337K IOPS			2291K
> >>
> >> Final:
> >> none sched:		2281K IOPS			1101K
> >> mq-deadline sched:	2322K IOPS			1278K
> >>
> >> * this is relevant as this is the performance in supporting but not
> >>    enabling the feature
> >> ** depth=260 is relevant as some point where we are regularly waiting
> >> for
> >>     tags to be available. Figures were are a bit unstable here for
> >> testing.

John -

I tried V7 series and debug further on mq-deadline interface. This time I
have used another setup since HDD based setup is not readily available for
me.
In fact, I was able to simulate issue very easily using single scsi_device
as well. BTW, this is not an issue with this RFC, but generic issue.
Since I have converted nr_hw_queue > 1 for Broadcom product using this RFC,
It becomes noticeable now.

Problem - Using below command  I see heavy CPU utilization on "
native_queued_spin_lock_slowpath". This is because kblockd work queue is
submitting IO from all the CPUs even though fio is bound to single CPU.
Lock contention from " dd_dispatch_request" is causing this issue.

numactl -C 13  fio
single.fio --iodepth=32 --bs=4k --rw=randread --ioscheduler=none --numjobs=1
 --cpus_allowed_policy=split --ioscheduler=mq-deadline
--group_reporting --filename=/dev/sdd

While running above command, ideally we expect only kworker/13 to be active.
But you can see below - All the CPU is attempting submission and lots of CPU
consumption is due to lock contention.

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 2726 root       0 -20       0      0      0 R  56.5  0.0   0:53.20
kworker/13:1H-k
 7815 root      20   0  712404  15536   2228 R  43.2  0.0   0:05.03 fio
 2792 root       0 -20       0      0      0 I  26.6  0.0   0:22.19
kworker/18:1H-k
 2791 root       0 -20       0      0      0 I  19.9  0.0   0:17.17
kworker/19:1H-k
 1419 root       0 -20       0      0      0 I  19.6  0.0   0:17.03
kworker/20:1H-k
 2793 root       0 -20       0      0      0 I  18.3  0.0   0:15.64
kworker/21:1H-k
 1424 root       0 -20       0      0      0 I  17.3  0.0   0:14.99
kworker/22:1H-k
 2626 root       0 -20       0      0      0 I  16.9  0.0   0:14.68
kworker/26:1H-k
 2794 root       0 -20       0      0      0 I  16.9  0.0   0:14.87
kworker/23:1H-k
 2795 root       0 -20       0      0      0 I  16.9  0.0   0:14.81
kworker/24:1H-k
 2797 root       0 -20       0      0      0 I  16.9  0.0   0:14.62
kworker/27:1H-k
 1415 root       0 -20       0      0      0 I  16.6  0.0   0:14.44
kworker/30:1H-k
 2669 root       0 -20       0      0      0 I  16.6  0.0   0:14.38
kworker/31:1H-k
 2796 root       0 -20       0      0      0 I  16.6  0.0   0:14.74
kworker/25:1H-k
 2799 root       0 -20       0      0      0 I  16.6  0.0   0:14.56
kworker/28:1H-k
 1425 root       0 -20       0      0      0 I  16.3  0.0   0:14.21
kworker/34:1H-k
 2746 root       0 -20       0      0      0 I  16.3  0.0   0:14.33
kworker/32:1H-k
 2798 root       0 -20       0      0      0 I  16.3  0.0   0:14.50
kworker/29:1H-k
 2800 root       0 -20       0      0      0 I  16.3  0.0   0:14.27
kworker/33:1H-k
 1423 root       0 -20       0      0      0 I  15.9  0.0   0:14.10
kworker/54:1H-k
 1784 root       0 -20       0      0      0 I  15.9  0.0   0:14.03
kworker/55:1H-k
 2801 root       0 -20       0      0      0 I  15.9  0.0   0:14.15
kworker/35:1H-k
 2815 root       0 -20       0      0      0 I  15.9  0.0   0:13.97
kworker/56:1H-k
 1484 root       0 -20       0      0      0 I  15.6  0.0   0:13.90
kworker/57:1H-k
 1485 root       0 -20       0      0      0 I  15.6  0.0   0:13.82
kworker/59:1H-k
 1519 root       0 -20       0      0      0 I  15.6  0.0   0:13.64
kworker/62:1H-k
 2315 root       0 -20       0      0      0 I  15.6  0.0   0:13.87
kworker/58:1H-k
 2627 root       0 -20       0      0      0 I  15.6  0.0   0:13.69
kworker/61:1H-k
 2816 root       0 -20       0      0      0 I  15.6  0.0   0:13.75
kworker/60:1H-k


I root cause this issue -

Block layer always queue IO on hctx context mapped to CPU-13, but hw queue
run from all the hctx context.
I noticed in my test hctx48 has queued all the IOs. No other hctx has queued
IO. But all the hctx is counting for "run".

# cat hctx48/queued
2087058

#cat hctx*/run
151318
30038
83110
50680
69907
60391
111239
18036
33935
91648
34582
22853
61286
19489

Below patch has fix - "Run the hctx queue for which request was completed
instead of running all the hardware queue."
If this looks valid fix, please include in V8 OR I can post separate patch
for this. Just want to have some level of review from this discussion.

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 0652acd..f52118f 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -554,6 +554,7 @@ static bool scsi_end_request(struct request *req,
blk_status_t error,
        struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(req);
        struct scsi_device *sdev = cmd->device;
        struct request_queue *q = sdev->request_queue;
+       struct blk_mq_hw_ctx *mq_hctx = req->mq_hctx;

        if (blk_update_request(req, error, bytes))
                return true;
@@ -595,7 +596,8 @@ static bool scsi_end_request(struct request *req,
blk_status_t error,
            !list_empty(&sdev->host->starved_list))
                kblockd_schedule_work(&sdev->requeue_work);
        else
-               blk_mq_run_hw_queues(q, true);
+               blk_mq_run_hw_queue(mq_hctx, true);
+               //blk_mq_run_hw_queues(q, true);

        percpu_ref_put(&q->q_usage_counter);
        return false;


After above patch - Only kworker/13 is actively doing submission.

3858 root       0 -20       0      0      0 I  22.9  0.0   3:24.04
kworker/13:1H-k
16768 root      20   0  712008  14968   2180 R  21.6  0.0   0:03.27 fio
16769 root      20   0  712012  14968   2180 R  21.6  0.0   0:03.27 fio

Without above patch - 24 SSD driver can give 1.5M IOPS and after above patch
3.2M IOPS.

I will continue my testing.

Thanks, Kashyap

> >>
> >> A copy of the patches can be found here:
> >> https://github.com/hisilicon/kernel-dev/commits/private-topic-blk-mq-
> >> shared-tags-rfc-v7
> >>
> >> And to progress this series, we the the following to go in first, when
> >> ready:
> >> https://lore.kernel.org/linux-scsi/20200430131904.5847-1-hare@suse.de
> >> /
> > I'd suggest to add options to enable shared tags for null_blk &
> > scsi_debug in V8, so that it is easier to verify the changes without
> > real
> hardware.
> >
>
> ok, fine, I can look at including that. To stop the series getting too
> large, I
> might spin off the early patches, which are not strictly related.
>
> Thanks,
> John

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 00/12] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs
  2020-06-12 18:47     ` Kashyap Desai
@ 2020-06-15  2:13       ` Ming Lei
  2020-06-15  6:57         ` Kashyap Desai
  0 siblings, 1 reply; 123+ messages in thread
From: Ming Lei @ 2020-06-15  2:13 UTC (permalink / raw)
  To: Kashyap Desai
  Cc: John Garry, axboe, jejb, martin.petersen, don.brace,
	Sumit Saxena, bvanassche, hare, hch, Shivasharan Srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	PDL,MEGARAIDLINUX

On Sat, Jun 13, 2020 at 12:17:37AM +0530, Kashyap Desai wrote:
> > On 11/06/2020 04:07, Ming Lei wrote:
> > >> Using 12x SAS SSDs on hisi_sas v3 hw. mq-deadline results are
> > >> included, but it is not always an appropriate scheduler to use.
> > >>
> > >> Tag depth 		4000 (default)			260**
> > >>
> > >> Baseline:
> > >> none sched:		2290K IOPS			894K
> > >> mq-deadline sched:	2341K IOPS			2313K
> > >>
> > >> Final, host_tagset=0 in LLDD*
> > >> none sched:		2289K IOPS			703K
> > >> mq-deadline sched:	2337K IOPS			2291K
> > >>
> > >> Final:
> > >> none sched:		2281K IOPS			1101K
> > >> mq-deadline sched:	2322K IOPS			1278K
> > >>
> > >> * this is relevant as this is the performance in supporting but not
> > >>    enabling the feature
> > >> ** depth=260 is relevant as some point where we are regularly waiting
> > >> for
> > >>     tags to be available. Figures were are a bit unstable here for
> > >> testing.
> 
> John -
> 
> I tried V7 series and debug further on mq-deadline interface. This time I
> have used another setup since HDD based setup is not readily available for
> me.
> In fact, I was able to simulate issue very easily using single scsi_device
> as well. BTW, this is not an issue with this RFC, but generic issue.
> Since I have converted nr_hw_queue > 1 for Broadcom product using this RFC,
> It becomes noticeable now.
> 
> Problem - Using below command  I see heavy CPU utilization on "
> native_queued_spin_lock_slowpath". This is because kblockd work queue is
> submitting IO from all the CPUs even though fio is bound to single CPU.
> Lock contention from " dd_dispatch_request" is causing this issue.
> 
> numactl -C 13  fio
> single.fio --iodepth=32 --bs=4k --rw=randread --ioscheduler=none --numjobs=1
>  --cpus_allowed_policy=split --ioscheduler=mq-deadline
> --group_reporting --filename=/dev/sdd
> 
> While running above command, ideally we expect only kworker/13 to be active.
> But you can see below - All the CPU is attempting submission and lots of CPU
> consumption is due to lock contention.
> 
>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
>  2726 root       0 -20       0      0      0 R  56.5  0.0   0:53.20
> kworker/13:1H-k
>  7815 root      20   0  712404  15536   2228 R  43.2  0.0   0:05.03 fio
>  2792 root       0 -20       0      0      0 I  26.6  0.0   0:22.19
> kworker/18:1H-k
>  2791 root       0 -20       0      0      0 I  19.9  0.0   0:17.17
> kworker/19:1H-k
>  1419 root       0 -20       0      0      0 I  19.6  0.0   0:17.03
> kworker/20:1H-k
>  2793 root       0 -20       0      0      0 I  18.3  0.0   0:15.64
> kworker/21:1H-k
>  1424 root       0 -20       0      0      0 I  17.3  0.0   0:14.99
> kworker/22:1H-k
>  2626 root       0 -20       0      0      0 I  16.9  0.0   0:14.68
> kworker/26:1H-k
>  2794 root       0 -20       0      0      0 I  16.9  0.0   0:14.87
> kworker/23:1H-k
>  2795 root       0 -20       0      0      0 I  16.9  0.0   0:14.81
> kworker/24:1H-k
>  2797 root       0 -20       0      0      0 I  16.9  0.0   0:14.62
> kworker/27:1H-k
>  1415 root       0 -20       0      0      0 I  16.6  0.0   0:14.44
> kworker/30:1H-k
>  2669 root       0 -20       0      0      0 I  16.6  0.0   0:14.38
> kworker/31:1H-k
>  2796 root       0 -20       0      0      0 I  16.6  0.0   0:14.74
> kworker/25:1H-k
>  2799 root       0 -20       0      0      0 I  16.6  0.0   0:14.56
> kworker/28:1H-k
>  1425 root       0 -20       0      0      0 I  16.3  0.0   0:14.21
> kworker/34:1H-k
>  2746 root       0 -20       0      0      0 I  16.3  0.0   0:14.33
> kworker/32:1H-k
>  2798 root       0 -20       0      0      0 I  16.3  0.0   0:14.50
> kworker/29:1H-k
>  2800 root       0 -20       0      0      0 I  16.3  0.0   0:14.27
> kworker/33:1H-k
>  1423 root       0 -20       0      0      0 I  15.9  0.0   0:14.10
> kworker/54:1H-k
>  1784 root       0 -20       0      0      0 I  15.9  0.0   0:14.03
> kworker/55:1H-k
>  2801 root       0 -20       0      0      0 I  15.9  0.0   0:14.15
> kworker/35:1H-k
>  2815 root       0 -20       0      0      0 I  15.9  0.0   0:13.97
> kworker/56:1H-k
>  1484 root       0 -20       0      0      0 I  15.6  0.0   0:13.90
> kworker/57:1H-k
>  1485 root       0 -20       0      0      0 I  15.6  0.0   0:13.82
> kworker/59:1H-k
>  1519 root       0 -20       0      0      0 I  15.6  0.0   0:13.64
> kworker/62:1H-k
>  2315 root       0 -20       0      0      0 I  15.6  0.0   0:13.87
> kworker/58:1H-k
>  2627 root       0 -20       0      0      0 I  15.6  0.0   0:13.69
> kworker/61:1H-k
>  2816 root       0 -20       0      0      0 I  15.6  0.0   0:13.75
> kworker/60:1H-k
> 
> 
> I root cause this issue -
> 
> Block layer always queue IO on hctx context mapped to CPU-13, but hw queue
> run from all the hctx context.
> I noticed in my test hctx48 has queued all the IOs. No other hctx has queued
> IO. But all the hctx is counting for "run".
> 
> # cat hctx48/queued
> 2087058
> 
> #cat hctx*/run
> 151318
> 30038
> 83110
> 50680
> 69907
> 60391
> 111239
> 18036
> 33935
> 91648
> 34582
> 22853
> 61286
> 19489
> 
> Below patch has fix - "Run the hctx queue for which request was completed
> instead of running all the hardware queue."
> If this looks valid fix, please include in V8 OR I can post separate patch
> for this. Just want to have some level of review from this discussion.
> 
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index 0652acd..f52118f 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -554,6 +554,7 @@ static bool scsi_end_request(struct request *req,
> blk_status_t error,
>         struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(req);
>         struct scsi_device *sdev = cmd->device;
>         struct request_queue *q = sdev->request_queue;
> +       struct blk_mq_hw_ctx *mq_hctx = req->mq_hctx;
> 
>         if (blk_update_request(req, error, bytes))
>                 return true;
> @@ -595,7 +596,8 @@ static bool scsi_end_request(struct request *req,
> blk_status_t error,
>             !list_empty(&sdev->host->starved_list))
>                 kblockd_schedule_work(&sdev->requeue_work);
>         else
> -               blk_mq_run_hw_queues(q, true);
> +               blk_mq_run_hw_queue(mq_hctx, true);
> +               //blk_mq_run_hw_queues(q, true);

This way may cause IO hang because ->device_busy is shared by all hctxs.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: [PATCH RFC v7 00/12] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs
  2020-06-15  2:13       ` Ming Lei
@ 2020-06-15  6:57         ` Kashyap Desai
  2020-06-16  1:00           ` Ming Lei
  0 siblings, 1 reply; 123+ messages in thread
From: Kashyap Desai @ 2020-06-15  6:57 UTC (permalink / raw)
  To: Ming Lei
  Cc: John Garry, axboe, jejb, martin.petersen, don.brace,
	Sumit Saxena, bvanassche, hare, hch, Shivasharan Srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	PDL,MEGARAIDLINUX

> >
> > John -
> >
> > I tried V7 series and debug further on mq-deadline interface. This
> > time I have used another setup since HDD based setup is not readily
> > available for me.
> > In fact, I was able to simulate issue very easily using single
> > scsi_device as well. BTW, this is not an issue with this RFC, but
generic issue.
> > Since I have converted nr_hw_queue > 1 for Broadcom product using this
> > RFC, It becomes noticeable now.
> >
> > Problem - Using below command  I see heavy CPU utilization on "
> > native_queued_spin_lock_slowpath". This is because kblockd work queue
> > is submitting IO from all the CPUs even though fio is bound to single
CPU.
> > Lock contention from " dd_dispatch_request" is causing this issue.
> >
> > numactl -C 13  fio
> > single.fio --iodepth=32 --bs=4k --rw=randread --ioscheduler=none
> > --numjobs=1  --cpus_allowed_policy=split --ioscheduler=mq-deadline
> > --group_reporting --filename=/dev/sdd
> >
> > While running above command, ideally we expect only kworker/13 to be
> active.
> > But you can see below - All the CPU is attempting submission and lots
> > of CPU consumption is due to lock contention.
> >
> >   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+
COMMAND
> >  2726 root       0 -20       0      0      0 R  56.5  0.0   0:53.20
> > kworker/13:1H-k
> >  7815 root      20   0  712404  15536   2228 R  43.2  0.0   0:05.03
fio
> >  2792 root       0 -20       0      0      0 I  26.6  0.0   0:22.19
> > kworker/18:1H-k
> >  2791 root       0 -20       0      0      0 I  19.9  0.0   0:17.17
> > kworker/19:1H-k
> >  1419 root       0 -20       0      0      0 I  19.6  0.0   0:17.03
> > kworker/20:1H-k
> >  2793 root       0 -20       0      0      0 I  18.3  0.0   0:15.64
> > kworker/21:1H-k
> >  1424 root       0 -20       0      0      0 I  17.3  0.0   0:14.99
> > kworker/22:1H-k
> >  2626 root       0 -20       0      0      0 I  16.9  0.0   0:14.68
> > kworker/26:1H-k
> >  2794 root       0 -20       0      0      0 I  16.9  0.0   0:14.87
> > kworker/23:1H-k
> >  2795 root       0 -20       0      0      0 I  16.9  0.0   0:14.81
> > kworker/24:1H-k
> >  2797 root       0 -20       0      0      0 I  16.9  0.0   0:14.62
> > kworker/27:1H-k
> >  1415 root       0 -20       0      0      0 I  16.6  0.0   0:14.44
> > kworker/30:1H-k
> >  2669 root       0 -20       0      0      0 I  16.6  0.0   0:14.38
> > kworker/31:1H-k
> >  2796 root       0 -20       0      0      0 I  16.6  0.0   0:14.74
> > kworker/25:1H-k
> >  2799 root       0 -20       0      0      0 I  16.6  0.0   0:14.56
> > kworker/28:1H-k
> >  1425 root       0 -20       0      0      0 I  16.3  0.0   0:14.21
> > kworker/34:1H-k
> >  2746 root       0 -20       0      0      0 I  16.3  0.0   0:14.33
> > kworker/32:1H-k
> >  2798 root       0 -20       0      0      0 I  16.3  0.0   0:14.50
> > kworker/29:1H-k
> >  2800 root       0 -20       0      0      0 I  16.3  0.0   0:14.27
> > kworker/33:1H-k
> >  1423 root       0 -20       0      0      0 I  15.9  0.0   0:14.10
> > kworker/54:1H-k
> >  1784 root       0 -20       0      0      0 I  15.9  0.0   0:14.03
> > kworker/55:1H-k
> >  2801 root       0 -20       0      0      0 I  15.9  0.0   0:14.15
> > kworker/35:1H-k
> >  2815 root       0 -20       0      0      0 I  15.9  0.0   0:13.97
> > kworker/56:1H-k
> >  1484 root       0 -20       0      0      0 I  15.6  0.0   0:13.90
> > kworker/57:1H-k
> >  1485 root       0 -20       0      0      0 I  15.6  0.0   0:13.82
> > kworker/59:1H-k
> >  1519 root       0 -20       0      0      0 I  15.6  0.0   0:13.64
> > kworker/62:1H-k
> >  2315 root       0 -20       0      0      0 I  15.6  0.0   0:13.87
> > kworker/58:1H-k
> >  2627 root       0 -20       0      0      0 I  15.6  0.0   0:13.69
> > kworker/61:1H-k
> >  2816 root       0 -20       0      0      0 I  15.6  0.0   0:13.75
> > kworker/60:1H-k
> >
> >
> > I root cause this issue -
> >
> > Block layer always queue IO on hctx context mapped to CPU-13, but hw
> > queue run from all the hctx context.
> > I noticed in my test hctx48 has queued all the IOs. No other hctx has
> > queued IO. But all the hctx is counting for "run".
> >
> > # cat hctx48/queued
> > 2087058
> >
> > #cat hctx*/run
> > 151318
> > 30038
> > 83110
> > 50680
> > 69907
> > 60391
> > 111239
> > 18036
> > 33935
> > 91648
> > 34582
> > 22853
> > 61286
> > 19489
> >
> > Below patch has fix - "Run the hctx queue for which request was
> > completed instead of running all the hardware queue."
> > If this looks valid fix, please include in V8 OR I can post separate
> > patch for this. Just want to have some level of review from this
discussion.
> >
> > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index
> > 0652acd..f52118f 100644
> > --- a/drivers/scsi/scsi_lib.c
> > +++ b/drivers/scsi/scsi_lib.c
> > @@ -554,6 +554,7 @@ static bool scsi_end_request(struct request *req,
> > blk_status_t error,
> >         struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(req);
> >         struct scsi_device *sdev = cmd->device;
> >         struct request_queue *q = sdev->request_queue;
> > +       struct blk_mq_hw_ctx *mq_hctx = req->mq_hctx;
> >
> >         if (blk_update_request(req, error, bytes))
> >                 return true;
> > @@ -595,7 +596,8 @@ static bool scsi_end_request(struct request *req,
> > blk_status_t error,
> >             !list_empty(&sdev->host->starved_list))
> >                 kblockd_schedule_work(&sdev->requeue_work);
> >         else
> > -               blk_mq_run_hw_queues(q, true);
> > +               blk_mq_run_hw_queue(mq_hctx, true);
> > +               //blk_mq_run_hw_queues(q, true);
>
> This way may cause IO hang because ->device_busy is shared by all hctxs.

From SCSI stack, if we attempt to run all h/w queue, is it possible that
block layer actually run hw_queue which has really not queued any IO.
Currently, in case of mq-deadline, IOS are inserted using
"dd_insert_request". This function will add IOs on elevator data which is
per request queue and not per hctx.
When there is an attempt to run hctx, "blk_mq_sched_has_work" will check
pending work which is per request queue and not per hctx.
Because of this, IOs queued on only one hctx will be run from all the hctx
and this will create unnecessary lock contention.

How about below patch - ?

diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 126021f..1d30bd3 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -74,6 +74,13 @@ static inline bool blk_mq_sched_has_work(struct
blk_mq_hw_ctx *hctx)
 {
        struct elevator_queue *e = hctx->queue->elevator;

+       /* If current hctx has not queued any request, there is no need to
run.
+        * blk_mq_run_hw_queue() on hctx which has queued IO will handle
+        * running specific hctx.
+        */
+       if (!hctx->queued)
+               return false;
+
        if (e && e->type->ops.has_work)
                return e->type->ops.has_work(hctx);

Kashyap

>
> Thanks,
> Ming

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 00/12] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs
  2020-06-15  6:57         ` Kashyap Desai
@ 2020-06-16  1:00           ` Ming Lei
  2020-06-17 11:26             ` Kashyap Desai
  0 siblings, 1 reply; 123+ messages in thread
From: Ming Lei @ 2020-06-16  1:00 UTC (permalink / raw)
  To: Kashyap Desai
  Cc: John Garry, axboe, jejb, martin.petersen, don.brace,
	Sumit Saxena, bvanassche, hare, hch, Shivasharan Srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	PDL,MEGARAIDLINUX

On Mon, Jun 15, 2020 at 12:27:30PM +0530, Kashyap Desai wrote:
> > >
> > > John -
> > >
> > > I tried V7 series and debug further on mq-deadline interface. This
> > > time I have used another setup since HDD based setup is not readily
> > > available for me.
> > > In fact, I was able to simulate issue very easily using single
> > > scsi_device as well. BTW, this is not an issue with this RFC, but
> generic issue.
> > > Since I have converted nr_hw_queue > 1 for Broadcom product using this
> > > RFC, It becomes noticeable now.
> > >
> > > Problem - Using below command  I see heavy CPU utilization on "
> > > native_queued_spin_lock_slowpath". This is because kblockd work queue
> > > is submitting IO from all the CPUs even though fio is bound to single
> CPU.
> > > Lock contention from " dd_dispatch_request" is causing this issue.
> > >
> > > numactl -C 13  fio
> > > single.fio --iodepth=32 --bs=4k --rw=randread --ioscheduler=none
> > > --numjobs=1  --cpus_allowed_policy=split --ioscheduler=mq-deadline
> > > --group_reporting --filename=/dev/sdd
> > >
> > > While running above command, ideally we expect only kworker/13 to be
> > active.
> > > But you can see below - All the CPU is attempting submission and lots
> > > of CPU consumption is due to lock contention.
> > >
> > >   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+
> COMMAND
> > >  2726 root       0 -20       0      0      0 R  56.5  0.0   0:53.20
> > > kworker/13:1H-k
> > >  7815 root      20   0  712404  15536   2228 R  43.2  0.0   0:05.03
> fio
> > >  2792 root       0 -20       0      0      0 I  26.6  0.0   0:22.19
> > > kworker/18:1H-k
> > >  2791 root       0 -20       0      0      0 I  19.9  0.0   0:17.17
> > > kworker/19:1H-k
> > >  1419 root       0 -20       0      0      0 I  19.6  0.0   0:17.03
> > > kworker/20:1H-k
> > >  2793 root       0 -20       0      0      0 I  18.3  0.0   0:15.64
> > > kworker/21:1H-k
> > >  1424 root       0 -20       0      0      0 I  17.3  0.0   0:14.99
> > > kworker/22:1H-k
> > >  2626 root       0 -20       0      0      0 I  16.9  0.0   0:14.68
> > > kworker/26:1H-k
> > >  2794 root       0 -20       0      0      0 I  16.9  0.0   0:14.87
> > > kworker/23:1H-k
> > >  2795 root       0 -20       0      0      0 I  16.9  0.0   0:14.81
> > > kworker/24:1H-k
> > >  2797 root       0 -20       0      0      0 I  16.9  0.0   0:14.62
> > > kworker/27:1H-k
> > >  1415 root       0 -20       0      0      0 I  16.6  0.0   0:14.44
> > > kworker/30:1H-k
> > >  2669 root       0 -20       0      0      0 I  16.6  0.0   0:14.38
> > > kworker/31:1H-k
> > >  2796 root       0 -20       0      0      0 I  16.6  0.0   0:14.74
> > > kworker/25:1H-k
> > >  2799 root       0 -20       0      0      0 I  16.6  0.0   0:14.56
> > > kworker/28:1H-k
> > >  1425 root       0 -20       0      0      0 I  16.3  0.0   0:14.21
> > > kworker/34:1H-k
> > >  2746 root       0 -20       0      0      0 I  16.3  0.0   0:14.33
> > > kworker/32:1H-k
> > >  2798 root       0 -20       0      0      0 I  16.3  0.0   0:14.50
> > > kworker/29:1H-k
> > >  2800 root       0 -20       0      0      0 I  16.3  0.0   0:14.27
> > > kworker/33:1H-k
> > >  1423 root       0 -20       0      0      0 I  15.9  0.0   0:14.10
> > > kworker/54:1H-k
> > >  1784 root       0 -20       0      0      0 I  15.9  0.0   0:14.03
> > > kworker/55:1H-k
> > >  2801 root       0 -20       0      0      0 I  15.9  0.0   0:14.15
> > > kworker/35:1H-k
> > >  2815 root       0 -20       0      0      0 I  15.9  0.0   0:13.97
> > > kworker/56:1H-k
> > >  1484 root       0 -20       0      0      0 I  15.6  0.0   0:13.90
> > > kworker/57:1H-k
> > >  1485 root       0 -20       0      0      0 I  15.6  0.0   0:13.82
> > > kworker/59:1H-k
> > >  1519 root       0 -20       0      0      0 I  15.6  0.0   0:13.64
> > > kworker/62:1H-k
> > >  2315 root       0 -20       0      0      0 I  15.6  0.0   0:13.87
> > > kworker/58:1H-k
> > >  2627 root       0 -20       0      0      0 I  15.6  0.0   0:13.69
> > > kworker/61:1H-k
> > >  2816 root       0 -20       0      0      0 I  15.6  0.0   0:13.75
> > > kworker/60:1H-k
> > >
> > >
> > > I root cause this issue -
> > >
> > > Block layer always queue IO on hctx context mapped to CPU-13, but hw
> > > queue run from all the hctx context.
> > > I noticed in my test hctx48 has queued all the IOs. No other hctx has
> > > queued IO. But all the hctx is counting for "run".
> > >
> > > # cat hctx48/queued
> > > 2087058
> > >
> > > #cat hctx*/run
> > > 151318
> > > 30038
> > > 83110
> > > 50680
> > > 69907
> > > 60391
> > > 111239
> > > 18036
> > > 33935
> > > 91648
> > > 34582
> > > 22853
> > > 61286
> > > 19489
> > >
> > > Below patch has fix - "Run the hctx queue for which request was
> > > completed instead of running all the hardware queue."
> > > If this looks valid fix, please include in V8 OR I can post separate
> > > patch for this. Just want to have some level of review from this
> discussion.
> > >
> > > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index
> > > 0652acd..f52118f 100644
> > > --- a/drivers/scsi/scsi_lib.c
> > > +++ b/drivers/scsi/scsi_lib.c
> > > @@ -554,6 +554,7 @@ static bool scsi_end_request(struct request *req,
> > > blk_status_t error,
> > >         struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(req);
> > >         struct scsi_device *sdev = cmd->device;
> > >         struct request_queue *q = sdev->request_queue;
> > > +       struct blk_mq_hw_ctx *mq_hctx = req->mq_hctx;
> > >
> > >         if (blk_update_request(req, error, bytes))
> > >                 return true;
> > > @@ -595,7 +596,8 @@ static bool scsi_end_request(struct request *req,
> > > blk_status_t error,
> > >             !list_empty(&sdev->host->starved_list))
> > >                 kblockd_schedule_work(&sdev->requeue_work);
> > >         else
> > > -               blk_mq_run_hw_queues(q, true);
> > > +               blk_mq_run_hw_queue(mq_hctx, true);
> > > +               //blk_mq_run_hw_queues(q, true);
> >
> > This way may cause IO hang because ->device_busy is shared by all hctxs.
> 
> From SCSI stack, if we attempt to run all h/w queue, is it possible that
> block layer actually run hw_queue which has really not queued any IO.
> Currently, in case of mq-deadline, IOS are inserted using
> "dd_insert_request". This function will add IOs on elevator data which is
> per request queue and not per hctx.
> When there is an attempt to run hctx, "blk_mq_sched_has_work" will check
> pending work which is per request queue and not per hctx.
> Because of this, IOs queued on only one hctx will be run from all the hctx
> and this will create unnecessary lock contention.

Deadline is supposed for HDD. slow disks, so the lock contention shouldn't have
been one problem given there is seldom MQ HDD. before this patchset.

Another related issue is default scheduler, I guess deadline still should have
been the default io sched for HDDs. attached to this kind HBA with multiple reply
queues and single submission queue.

> 
> How about below patch - ?
> 
> diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
> index 126021f..1d30bd3 100644
> --- a/block/blk-mq-sched.h
> +++ b/block/blk-mq-sched.h
> @@ -74,6 +74,13 @@ static inline bool blk_mq_sched_has_work(struct
> blk_mq_hw_ctx *hctx)
>  {
>         struct elevator_queue *e = hctx->queue->elevator;
> 
> +       /* If current hctx has not queued any request, there is no need to
> run.
> +        * blk_mq_run_hw_queue() on hctx which has queued IO will handle
> +        * running specific hctx.
> +        */
> +       if (!hctx->queued)
> +               return false;
> +
>         if (e && e->type->ops.has_work)
>                 return e->type->ops.has_work(hctx);

->queued is increased only and not decreased just for debug purpose so far, so
it can't be relied for this purpose.

One approach is to add one similar counter, and maintain it by scheduler's insert/dispatch
callback.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: [PATCH RFC v7 00/12] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs
  2020-06-16  1:00           ` Ming Lei
@ 2020-06-17 11:26             ` Kashyap Desai
  2020-06-22  6:24               ` Hannes Reinecke
  0 siblings, 1 reply; 123+ messages in thread
From: Kashyap Desai @ 2020-06-17 11:26 UTC (permalink / raw)
  To: Ming Lei
  Cc: John Garry, axboe, jejb, martin.petersen, don.brace,
	Sumit Saxena, bvanassche, hare, hch, Shivasharan Srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	PDL,MEGARAIDLINUX

> On Mon, Jun 15, 2020 at 12:27:30PM +0530, Kashyap Desai wrote:
> > > >
> > > > John -
> > > >
> > > > I tried V7 series and debug further on mq-deadline interface. This
> > > > time I have used another setup since HDD based setup is not
> > > > readily available for me.
> > > > In fact, I was able to simulate issue very easily using single
> > > > scsi_device as well. BTW, this is not an issue with this RFC, but
> > generic issue.
> > > > Since I have converted nr_hw_queue > 1 for Broadcom product using
> > > > this RFC, It becomes noticeable now.
> > > >
> > > > Problem - Using below command  I see heavy CPU utilization on "
> > > > native_queued_spin_lock_slowpath". This is because kblockd work
> > > > queue is submitting IO from all the CPUs even though fio is bound
> > > > to single
> > CPU.
> > > > Lock contention from " dd_dispatch_request" is causing this issue.
> > > >
> > > > numactl -C 13  fio
> > > > single.fio --iodepth=32 --bs=4k --rw=randread --ioscheduler=none
> > > > --numjobs=1  --cpus_allowed_policy=split --ioscheduler=mq-deadline
> > > > --group_reporting --filename=/dev/sdd
> > > >
> > > > While running above command, ideally we expect only kworker/13 to
> > > > be
> > > active.
> > > > But you can see below - All the CPU is attempting submission and
> > > > lots of CPU consumption is due to lock contention.
> > > >
> > > >   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM
TIME+
> > COMMAND
> > > >  2726 root       0 -20       0      0      0 R  56.5  0.0
0:53.20
> > > > kworker/13:1H-k
> > > >  7815 root      20   0  712404  15536   2228 R  43.2  0.0
0:05.03
> > fio
> > > >  2792 root       0 -20       0      0      0 I  26.6  0.0
0:22.19
> > > > kworker/18:1H-k
> > > >  2791 root       0 -20       0      0      0 I  19.9  0.0
0:17.17
> > > > kworker/19:1H-k
> > > >  1419 root       0 -20       0      0      0 I  19.6  0.0
0:17.03
> > > > kworker/20:1H-k
> > > >  2793 root       0 -20       0      0      0 I  18.3  0.0
0:15.64
> > > > kworker/21:1H-k
> > > >  1424 root       0 -20       0      0      0 I  17.3  0.0
0:14.99
> > > > kworker/22:1H-k
> > > >  2626 root       0 -20       0      0      0 I  16.9  0.0
0:14.68
> > > > kworker/26:1H-k
> > > >  2794 root       0 -20       0      0      0 I  16.9  0.0
0:14.87
> > > > kworker/23:1H-k
> > > >  2795 root       0 -20       0      0      0 I  16.9  0.0
0:14.81
> > > > kworker/24:1H-k
> > > >  2797 root       0 -20       0      0      0 I  16.9  0.0
0:14.62
> > > > kworker/27:1H-k
> > > >  1415 root       0 -20       0      0      0 I  16.6  0.0
0:14.44
> > > > kworker/30:1H-k
> > > >  2669 root       0 -20       0      0      0 I  16.6  0.0
0:14.38
> > > > kworker/31:1H-k
> > > >  2796 root       0 -20       0      0      0 I  16.6  0.0
0:14.74
> > > > kworker/25:1H-k
> > > >  2799 root       0 -20       0      0      0 I  16.6  0.0
0:14.56
> > > > kworker/28:1H-k
> > > >  1425 root       0 -20       0      0      0 I  16.3  0.0
0:14.21
> > > > kworker/34:1H-k
> > > >  2746 root       0 -20       0      0      0 I  16.3  0.0
0:14.33
> > > > kworker/32:1H-k
> > > >  2798 root       0 -20       0      0      0 I  16.3  0.0
0:14.50
> > > > kworker/29:1H-k
> > > >  2800 root       0 -20       0      0      0 I  16.3  0.0
0:14.27
> > > > kworker/33:1H-k
> > > >  1423 root       0 -20       0      0      0 I  15.9  0.0
0:14.10
> > > > kworker/54:1H-k
> > > >  1784 root       0 -20       0      0      0 I  15.9  0.0
0:14.03
> > > > kworker/55:1H-k
> > > >  2801 root       0 -20       0      0      0 I  15.9  0.0
0:14.15
> > > > kworker/35:1H-k
> > > >  2815 root       0 -20       0      0      0 I  15.9  0.0
0:13.97
> > > > kworker/56:1H-k
> > > >  1484 root       0 -20       0      0      0 I  15.6  0.0
0:13.90
> > > > kworker/57:1H-k
> > > >  1485 root       0 -20       0      0      0 I  15.6  0.0
0:13.82
> > > > kworker/59:1H-k
> > > >  1519 root       0 -20       0      0      0 I  15.6  0.0
0:13.64
> > > > kworker/62:1H-k
> > > >  2315 root       0 -20       0      0      0 I  15.6  0.0
0:13.87
> > > > kworker/58:1H-k
> > > >  2627 root       0 -20       0      0      0 I  15.6  0.0
0:13.69
> > > > kworker/61:1H-k
> > > >  2816 root       0 -20       0      0      0 I  15.6  0.0
0:13.75
> > > > kworker/60:1H-k
> > > >
> > > >
> > > > I root cause this issue -
> > > >
> > > > Block layer always queue IO on hctx context mapped to CPU-13, but
> > > > hw queue run from all the hctx context.
> > > > I noticed in my test hctx48 has queued all the IOs. No other hctx
> > > > has queued IO. But all the hctx is counting for "run".
> > > >
> > > > # cat hctx48/queued
> > > > 2087058
> > > >
> > > > #cat hctx*/run
> > > > 151318
> > > > 30038
> > > > 83110
> > > > 50680
> > > > 69907
> > > > 60391
> > > > 111239
> > > > 18036
> > > > 33935
> > > > 91648
> > > > 34582
> > > > 22853
> > > > 61286
> > > > 19489
> > > >
> > > > Below patch has fix - "Run the hctx queue for which request was
> > > > completed instead of running all the hardware queue."
> > > > If this looks valid fix, please include in V8 OR I can post
> > > > separate patch for this. Just want to have some level of review
> > > > from this
> > discussion.
> > > >
> > > > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> > > > index 0652acd..f52118f 100644
> > > > --- a/drivers/scsi/scsi_lib.c
> > > > +++ b/drivers/scsi/scsi_lib.c
> > > > @@ -554,6 +554,7 @@ static bool scsi_end_request(struct request
> > > > *req, blk_status_t error,
> > > >         struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(req);
> > > >         struct scsi_device *sdev = cmd->device;
> > > >         struct request_queue *q = sdev->request_queue;
> > > > +       struct blk_mq_hw_ctx *mq_hctx = req->mq_hctx;
> > > >
> > > >         if (blk_update_request(req, error, bytes))
> > > >                 return true;
> > > > @@ -595,7 +596,8 @@ static bool scsi_end_request(struct request
> > > > *req, blk_status_t error,
> > > >             !list_empty(&sdev->host->starved_list))
> > > >                 kblockd_schedule_work(&sdev->requeue_work);
> > > >         else
> > > > -               blk_mq_run_hw_queues(q, true);
> > > > +               blk_mq_run_hw_queue(mq_hctx, true);
> > > > +               //blk_mq_run_hw_queues(q, true);
> > >
> > > This way may cause IO hang because ->device_busy is shared by all
hctxs.
> >
> > From SCSI stack, if we attempt to run all h/w queue, is it possible
> > that block layer actually run hw_queue which has really not queued any
IO.
> > Currently, in case of mq-deadline, IOS are inserted using
> > "dd_insert_request". This function will add IOs on elevator data which
> > is per request queue and not per hctx.
> > When there is an attempt to run hctx, "blk_mq_sched_has_work" will
> > check pending work which is per request queue and not per hctx.
> > Because of this, IOs queued on only one hctx will be run from all the
> > hctx and this will create unnecessary lock contention.
>
> Deadline is supposed for HDD. slow disks, so the lock contention
shouldn't
> have been one problem given there is seldom MQ HDD. before this
patchset.
>
> Another related issue is default scheduler, I guess deadline still
should have
> been the default io sched for HDDs. attached to this kind HBA with
multiple
> reply queues and single submission queue.
>
> >
> > How about below patch - ?
> >
> > diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h index
> > 126021f..1d30bd3 100644
> > --- a/block/blk-mq-sched.h
> > +++ b/block/blk-mq-sched.h
> > @@ -74,6 +74,13 @@ static inline bool blk_mq_sched_has_work(struct
> > blk_mq_hw_ctx *hctx)  {
> >         struct elevator_queue *e = hctx->queue->elevator;
> >
> > +       /* If current hctx has not queued any request, there is no
> > + need to
> > run.
> > +        * blk_mq_run_hw_queue() on hctx which has queued IO will
handle
> > +        * running specific hctx.
> > +        */
> > +       if (!hctx->queued)
> > +               return false;
> > +
> >         if (e && e->type->ops.has_work)
> >                 return e->type->ops.has_work(hctx);
>
> ->queued is increased only and not decreased just for debug purpose so
> ->far, so
> it can't be relied for this purpose.

Thanks. I overlooked that that it is only incremental counter.

>
> One approach is to add one similar counter, and maintain it by
scheduler's
> insert/dispatch callback.

I tried below  and I see performance is on expected range.

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index fdcc2c1..ea201d0 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -485,6 +485,7 @@ void blk_mq_sched_insert_request(struct request *rq,
bool at_head,

                list_add(&rq->queuelist, &list);
                e->type->ops.insert_requests(hctx, &list, at_head);
+               atomic_inc(&hctx->elevator_queued);
        } else {
                spin_lock(&ctx->lock);
                __blk_mq_insert_request(hctx, rq, at_head);
@@ -511,8 +512,10 @@ void blk_mq_sched_insert_requests(struct
blk_mq_hw_ctx *hctx,
        percpu_ref_get(&q->q_usage_counter);

        e = hctx->queue->elevator;
-       if (e && e->type->ops.insert_requests)
+       if (e && e->type->ops.insert_requests) {
                e->type->ops.insert_requests(hctx, list, false);
+               atomic_inc(&hctx->elevator_queued);
+       }
        else {
                /*
                 * try to issue requests directly if the hw queue isn't
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 126021f..946b47a 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -74,6 +74,13 @@ static inline bool blk_mq_sched_has_work(struct
blk_mq_hw_ctx *hctx)
 {
        struct elevator_queue *e = hctx->queue->elevator;

+       /* If current hctx has not queued any request, there is no need to
run.
+        * blk_mq_run_hw_queue() on hctx which has queued IO will handle
+        * running specific hctx.
+        */
+       if (!atomic_read(&hctx->elevator_queued))
+               return false;
+
        if (e && e->type->ops.has_work)
                return e->type->ops.has_work(hctx);

diff --git a/block/blk-mq.c b/block/blk-mq.c
index f73a2f9..48f1824 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -517,8 +517,10 @@ void blk_mq_free_request(struct request *rq)
        struct blk_mq_hw_ctx *hctx = rq->mq_hctx;

        if (rq->rq_flags & RQF_ELVPRIV) {
-               if (e && e->type->ops.finish_request)
+               if (e && e->type->ops.finish_request) {
                        e->type->ops.finish_request(rq);
+                       atomic_dec(&hctx->elevator_queued);
+               }
                if (rq->elv.icq) {
                        put_io_context(rq->elv.icq->ioc);
                        rq->elv.icq = NULL;
@@ -2571,6 +2573,7 @@ blk_mq_alloc_hctx(struct request_queue *q, struct
blk_mq_tag_set *set,
                goto free_hctx;

        atomic_set(&hctx->nr_active, 0);
+       atomic_set(&hctx->elevator_queued, 0);
        if (node == NUMA_NO_NODE)
                node = set->numa_node;
        hctx->numa_node = node;
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 66711c7..ea1ddb1 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -139,6 +139,10 @@ struct blk_mq_hw_ctx {
         * shared across request queues.
         */
        atomic_t                nr_active;
+       /**
+        * @elevator_queued: Number of queued requests on hctx.
+        */
+       atomic_t                elevator_queued;

        /** @cpuhp_online: List to store request if CPU is going to die */
        struct hlist_node       cpuhp_online;



>
> Thanks,
> Ming

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 00/12] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs
  2020-06-17 11:26             ` Kashyap Desai
@ 2020-06-22  6:24               ` Hannes Reinecke
  2020-06-23  0:55                 ` Ming Lei
  0 siblings, 1 reply; 123+ messages in thread
From: Hannes Reinecke @ 2020-06-22  6:24 UTC (permalink / raw)
  To: Kashyap Desai, Ming Lei
  Cc: John Garry, axboe, jejb, martin.petersen, don.brace,
	Sumit Saxena, bvanassche, hare, hch, Shivasharan Srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	PDL,MEGARAIDLINUX

On 6/17/20 1:26 PM, Kashyap Desai wrote:
>>
>> ->queued is increased only and not decreased just for debug purpose so
>> ->far, so
>> it can't be relied for this purpose.
> 
> Thanks. I overlooked that that it is only incremental counter.
> 
>>
>> One approach is to add one similar counter, and maintain it by
> scheduler's
>> insert/dispatch callback.
> 
> I tried below  and I see performance is on expected range.
> 
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index fdcc2c1..ea201d0 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -485,6 +485,7 @@ void blk_mq_sched_insert_request(struct request *rq,
> bool at_head,
> 
>                  list_add(&rq->queuelist, &list);
>                  e->type->ops.insert_requests(hctx, &list, at_head);
> +               atomic_inc(&hctx->elevator_queued);
>          } else {
>                  spin_lock(&ctx->lock);
>                  __blk_mq_insert_request(hctx, rq, at_head);
> @@ -511,8 +512,10 @@ void blk_mq_sched_insert_requests(struct
> blk_mq_hw_ctx *hctx,
>          percpu_ref_get(&q->q_usage_counter);
> 
>          e = hctx->queue->elevator;
> -       if (e && e->type->ops.insert_requests)
> +       if (e && e->type->ops.insert_requests) {
>                  e->type->ops.insert_requests(hctx, list, false);
> +               atomic_inc(&hctx->elevator_queued);
> +       }
>          else {
>                  /*
>                   * try to issue requests directly if the hw queue isn't
> diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
> index 126021f..946b47a 100644
> --- a/block/blk-mq-sched.h
> +++ b/block/blk-mq-sched.h
> @@ -74,6 +74,13 @@ static inline bool blk_mq_sched_has_work(struct
> blk_mq_hw_ctx *hctx)
>   {
>          struct elevator_queue *e = hctx->queue->elevator;
> 
> +       /* If current hctx has not queued any request, there is no need to
> run.
> +        * blk_mq_run_hw_queue() on hctx which has queued IO will handle
> +        * running specific hctx.
> +        */
> +       if (!atomic_read(&hctx->elevator_queued))
> +               return false;
> +
>          if (e && e->type->ops.has_work)
>                  return e->type->ops.has_work(hctx);
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index f73a2f9..48f1824 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -517,8 +517,10 @@ void blk_mq_free_request(struct request *rq)
>          struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
> 
>          if (rq->rq_flags & RQF_ELVPRIV) {
> -               if (e && e->type->ops.finish_request)
> +               if (e && e->type->ops.finish_request) {
>                          e->type->ops.finish_request(rq);
> +                       atomic_dec(&hctx->elevator_queued);
> +               }
>                  if (rq->elv.icq) {
>                          put_io_context(rq->elv.icq->ioc);
>                          rq->elv.icq = NULL;
> @@ -2571,6 +2573,7 @@ blk_mq_alloc_hctx(struct request_queue *q, struct
> blk_mq_tag_set *set,
>                  goto free_hctx;
> 
>          atomic_set(&hctx->nr_active, 0);
> +       atomic_set(&hctx->elevator_queued, 0);
>          if (node == NUMA_NO_NODE)
>                  node = set->numa_node;
>          hctx->numa_node = node;
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index 66711c7..ea1ddb1 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -139,6 +139,10 @@ struct blk_mq_hw_ctx {
>           * shared across request queues.
>           */
>          atomic_t                nr_active;
> +       /**
> +        * @elevator_queued: Number of queued requests on hctx.
> +        */
> +       atomic_t                elevator_queued;
> 
>          /** @cpuhp_online: List to store request if CPU is going to die */
>          struct hlist_node       cpuhp_online;
> 
> 
> 
Would it make sense to move it into the elevator itself?
It's a bit odd having a value 'elevator_queued' with no direct reference 
to the elevator...

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                               +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 00/12] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs
  2020-06-22  6:24               ` Hannes Reinecke
@ 2020-06-23  0:55                 ` Ming Lei
  2020-06-23 11:50                   ` Kashyap Desai
  2020-06-23 12:11                   ` Kashyap Desai
  0 siblings, 2 replies; 123+ messages in thread
From: Ming Lei @ 2020-06-23  0:55 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Kashyap Desai, John Garry, axboe, jejb, martin.petersen,
	don.brace, Sumit Saxena, bvanassche, hare, hch,
	Shivasharan Srikanteshwara, linux-block, linux-scsi,
	esc.storagedev, chenxiang66, PDL,MEGARAIDLINUX

On Mon, Jun 22, 2020 at 08:24:39AM +0200, Hannes Reinecke wrote:
> On 6/17/20 1:26 PM, Kashyap Desai wrote:
> > > 
> > > ->queued is increased only and not decreased just for debug purpose so
> > > ->far, so
> > > it can't be relied for this purpose.
> > 
> > Thanks. I overlooked that that it is only incremental counter.
> > 
> > > 
> > > One approach is to add one similar counter, and maintain it by
> > scheduler's
> > > insert/dispatch callback.
> > 
> > I tried below  and I see performance is on expected range.
> > 
> > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> > index fdcc2c1..ea201d0 100644
> > --- a/block/blk-mq-sched.c
> > +++ b/block/blk-mq-sched.c
> > @@ -485,6 +485,7 @@ void blk_mq_sched_insert_request(struct request *rq,
> > bool at_head,
> > 
> >                  list_add(&rq->queuelist, &list);
> >                  e->type->ops.insert_requests(hctx, &list, at_head);
> > +               atomic_inc(&hctx->elevator_queued);
> >          } else {
> >                  spin_lock(&ctx->lock);
> >                  __blk_mq_insert_request(hctx, rq, at_head);
> > @@ -511,8 +512,10 @@ void blk_mq_sched_insert_requests(struct
> > blk_mq_hw_ctx *hctx,
> >          percpu_ref_get(&q->q_usage_counter);
> > 
> >          e = hctx->queue->elevator;
> > -       if (e && e->type->ops.insert_requests)
> > +       if (e && e->type->ops.insert_requests) {
> >                  e->type->ops.insert_requests(hctx, list, false);
> > +               atomic_inc(&hctx->elevator_queued);
> > +       }
> >          else {
> >                  /*
> >                   * try to issue requests directly if the hw queue isn't
> > diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
> > index 126021f..946b47a 100644
> > --- a/block/blk-mq-sched.h
> > +++ b/block/blk-mq-sched.h
> > @@ -74,6 +74,13 @@ static inline bool blk_mq_sched_has_work(struct
> > blk_mq_hw_ctx *hctx)
> >   {
> >          struct elevator_queue *e = hctx->queue->elevator;
> > 
> > +       /* If current hctx has not queued any request, there is no need to
> > run.
> > +        * blk_mq_run_hw_queue() on hctx which has queued IO will handle
> > +        * running specific hctx.
> > +        */
> > +       if (!atomic_read(&hctx->elevator_queued))
> > +               return false;
> > +
> >          if (e && e->type->ops.has_work)
> >                  return e->type->ops.has_work(hctx);
> > 
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index f73a2f9..48f1824 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -517,8 +517,10 @@ void blk_mq_free_request(struct request *rq)
> >          struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
> > 
> >          if (rq->rq_flags & RQF_ELVPRIV) {
> > -               if (e && e->type->ops.finish_request)
> > +               if (e && e->type->ops.finish_request) {
> >                          e->type->ops.finish_request(rq);
> > +                       atomic_dec(&hctx->elevator_queued);
> > +               }
> >                  if (rq->elv.icq) {
> >                          put_io_context(rq->elv.icq->ioc);
> >                          rq->elv.icq = NULL;
> > @@ -2571,6 +2573,7 @@ blk_mq_alloc_hctx(struct request_queue *q, struct
> > blk_mq_tag_set *set,
> >                  goto free_hctx;
> > 
> >          atomic_set(&hctx->nr_active, 0);
> > +       atomic_set(&hctx->elevator_queued, 0);
> >          if (node == NUMA_NO_NODE)
> >                  node = set->numa_node;
> >          hctx->numa_node = node;
> > diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> > index 66711c7..ea1ddb1 100644
> > --- a/include/linux/blk-mq.h
> > +++ b/include/linux/blk-mq.h
> > @@ -139,6 +139,10 @@ struct blk_mq_hw_ctx {
> >           * shared across request queues.
> >           */
> >          atomic_t                nr_active;
> > +       /**
> > +        * @elevator_queued: Number of queued requests on hctx.
> > +        */
> > +       atomic_t                elevator_queued;
> > 
> >          /** @cpuhp_online: List to store request if CPU is going to die */
> >          struct hlist_node       cpuhp_online;
> > 
> > 
> > 
> Would it make sense to move it into the elevator itself?

That is my initial suggestion, and the counter is just done for bfq &
mq-deadline, then we needn't to pay the cost for others.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 02/12] blk-mq: rename blk_mq_update_tag_set_depth()
  2020-06-11  8:26     ` John Garry
@ 2020-06-23 11:25       ` John Garry
  2020-06-23 14:23         ` Hannes Reinecke
  0 siblings, 1 reply; 123+ messages in thread
From: John Garry @ 2020-06-23 11:25 UTC (permalink / raw)
  To: Ming Lei, Hannes Reinecke
  Cc: axboe, jejb, martin.petersen, don.brace, kashyap.desai,
	sumit.saxena, bvanassche, hare, hch, shivasharan.srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	megaraidlinux.pdl

On 11/06/2020 09:26, John Garry wrote:
> On 11/06/2020 03:57, Ming Lei wrote:
>> On Thu, Jun 11, 2020 at 01:29:09AM +0800, John Garry wrote:
>>> From: Hannes Reinecke <hare@suse.de>
>>>
>>> The function does not set the depth, but rather transitions from
>>> shared to non-shared queues and vice versa.
>>> So rename it to blk_mq_update_tag_set_shared() to better reflect
>>> its purpose.
>>
>> It is fine to rename it for me, however:
>>
>> This patch claims to rename blk_mq_update_tag_set_shared(), but also
>> change blk_mq_init_bitmap_tags's signature.
> 
> I was going to update the commit message here, but forgot again...
> 
>>
>> So suggest to split this patch into two or add comment log on changing
>> blk_mq_init_bitmap_tags().
> 
> I think I'll just split into 2x commits.

Hi Hannes,

Do you have any issue with splitting the undocumented changes into 
another patch as so:

-------------------->8---------------------

 From db3f8ec1295efbf53273ffc137d348857cbd411e Mon Sep 17 00:00:00 2001
From: Hannes Reinecke <hare@suse.de>
Date: Tue, 23 Jun 2020 12:07:33 +0100
Subject: [PATCH] blk-mq: Free tags in blk_mq_init_tags() upon error

Since the tags are allocated in blk_mq_init_tags() it's better practice
to free in that same function upon error, rather than a callee which is 
to init the bitmap tags - blk_mq_init_tags().

Signed-off-by: Hannes Reinecke <hare@suse.de>
[jpg: split from an earlier patch with a new commit message, minor mod 
to return NULL directly for error]
Signed-off-by: John Garry <john.garry@huawei.com>

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 1085dc9848f3..b8972014cd90 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -487,24 +487,22 @@ static int bt_alloc(struct sbitmap_queue *bt, 
unsigned int depth,
  				       node);
  }

-static struct blk_mq_tags *blk_mq_init_bitmap_tags(struct blk_mq_tags 
*tags,
-						   int node, int alloc_policy)
+static int blk_mq_init_bitmap_tags(struct blk_mq_tags *tags,
+				   int node, int alloc_policy)
  {
  	unsigned int depth = tags->nr_tags - tags->nr_reserved_tags;
  	bool round_robin = alloc_policy == BLK_TAG_ALLOC_RR;

  	if (bt_alloc(&tags->bitmap_tags, depth, round_robin, node))
-		goto free_tags;
+		return -ENOMEM;
  	if (bt_alloc(&tags->breserved_tags, tags->nr_reserved_tags, round_robin,
  		     node))
  		goto free_bitmap_tags;

-	return tags;
+	return 0;
  free_bitmap_tags:
  	sbitmap_queue_free(&tags->bitmap_tags);
-free_tags:
-	kfree(tags);
-	return NULL;
+	return -ENOMEM;
  }

  struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
@@ -525,7 +523,11 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int 
total_tags,
  	tags->nr_tags = total_tags;
  	tags->nr_reserved_tags = reserved_tags;

-	return blk_mq_init_bitmap_tags(tags, node, alloc_policy);
+	if (blk_mq_init_bitmap_tags(tags, node, alloc_policy) < 0) {
+		kfree(tags);
+		return NULL;
+	}
+	return tags;
  }

  void blk_mq_free_tags(struct blk_mq_tags *tags)

--------------------8<---------------------

Thanks

^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: [PATCH RFC v7 00/12] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs
  2020-06-23  0:55                 ` Ming Lei
@ 2020-06-23 11:50                   ` Kashyap Desai
  2020-06-23 12:11                   ` Kashyap Desai
  1 sibling, 0 replies; 123+ messages in thread
From: Kashyap Desai @ 2020-06-23 11:50 UTC (permalink / raw)
  To: Ming Lei, Hannes Reinecke
  Cc: John Garry, axboe, jejb, martin.petersen, don.brace,
	Sumit Saxena, bvanassche, hare, hch, Shivasharan Srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	PDL,MEGARAIDLINUX

>
> On Mon, Jun 22, 2020 at 08:24:39AM +0200, Hannes Reinecke wrote:
> > On 6/17/20 1:26 PM, Kashyap Desai wrote:
> > > >
> > > > ->queued is increased only and not decreased just for debug
> > > > ->purpose so far, so
> > > > it can't be relied for this purpose.
> > >
> > > Thanks. I overlooked that that it is only incremental counter.
> > >
> > > >
> > > > One approach is to add one similar counter, and maintain it by
> > > scheduler's
> > > > insert/dispatch callback.
> > >
> > > I tried below  and I see performance is on expected range.
> > >
> > > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c index
> > > fdcc2c1..ea201d0 100644
> > > --- a/block/blk-mq-sched.c
> > > +++ b/block/blk-mq-sched.c
> > > @@ -485,6 +485,7 @@ void blk_mq_sched_insert_request(struct request
> > > *rq, bool at_head,
> > >
> > >                  list_add(&rq->queuelist, &list);
> > >                  e->type->ops.insert_requests(hctx, &list, at_head);
> > > +               atomic_inc(&hctx->elevator_queued);
> > >          } else {
> > >                  spin_lock(&ctx->lock);
> > >                  __blk_mq_insert_request(hctx, rq, at_head); @@
> > > -511,8 +512,10 @@ void blk_mq_sched_insert_requests(struct
> > > blk_mq_hw_ctx *hctx,
> > >          percpu_ref_get(&q->q_usage_counter);
> > >
> > >          e = hctx->queue->elevator;
> > > -       if (e && e->type->ops.insert_requests)
> > > +       if (e && e->type->ops.insert_requests) {
> > >                  e->type->ops.insert_requests(hctx, list, false);
> > > +               atomic_inc(&hctx->elevator_queued);
> > > +       }
> > >          else {
> > >                  /*
> > >                   * try to issue requests directly if the hw queue
> > > isn't diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h index
> > > 126021f..946b47a 100644
> > > --- a/block/blk-mq-sched.h
> > > +++ b/block/blk-mq-sched.h
> > > @@ -74,6 +74,13 @@ static inline bool blk_mq_sched_has_work(struct
> > > blk_mq_hw_ctx *hctx)
> > >   {
> > >          struct elevator_queue *e = hctx->queue->elevator;
> > >
> > > +       /* If current hctx has not queued any request, there is no
> > > + need to
> > > run.
> > > +        * blk_mq_run_hw_queue() on hctx which has queued IO will
handle
> > > +        * running specific hctx.
> > > +        */
> > > +       if (!atomic_read(&hctx->elevator_queued))
> > > +               return false;
> > > +
> > >          if (e && e->type->ops.has_work)
> > >                  return e->type->ops.has_work(hctx);
> > >
> > > diff --git a/block/blk-mq.c b/block/blk-mq.c index f73a2f9..48f1824
> > > 100644
> > > --- a/block/blk-mq.c
> > > +++ b/block/blk-mq.c
> > > @@ -517,8 +517,10 @@ void blk_mq_free_request(struct request *rq)
> > >          struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
> > >
> > >          if (rq->rq_flags & RQF_ELVPRIV) {
> > > -               if (e && e->type->ops.finish_request)
> > > +               if (e && e->type->ops.finish_request) {
> > >                          e->type->ops.finish_request(rq);
> > > +                       atomic_dec(&hctx->elevator_queued);
> > > +               }
> > >                  if (rq->elv.icq) {
> > >                          put_io_context(rq->elv.icq->ioc);
> > >                          rq->elv.icq = NULL; @@ -2571,6 +2573,7 @@
> > > blk_mq_alloc_hctx(struct request_queue *q, struct blk_mq_tag_set
> > > *set,
> > >                  goto free_hctx;
> > >
> > >          atomic_set(&hctx->nr_active, 0);
> > > +       atomic_set(&hctx->elevator_queued, 0);
> > >          if (node == NUMA_NO_NODE)
> > >                  node = set->numa_node;
> > >          hctx->numa_node = node;
> > > diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index
> > > 66711c7..ea1ddb1 100644
> > > --- a/include/linux/blk-mq.h
> > > +++ b/include/linux/blk-mq.h
> > > @@ -139,6 +139,10 @@ struct blk_mq_hw_ctx {
> > >           * shared across request queues.
> > >           */
> > >          atomic_t                nr_active;
> > > +       /**
> > > +        * @elevator_queued: Number of queued requests on hctx.
> > > +        */
> > > +       atomic_t                elevator_queued;
> > >
> > >          /** @cpuhp_online: List to store request if CPU is going to
die */
> > >          struct hlist_node       cpuhp_online;
> > >
> > >
> > >
> > Would it make sense to move it into the elevator itself?

I am not sure where exactly I should add this counter since I need counter
per hctx. Elevator data is per request object.
Please suggest.

>
> That is my initial suggestion, and the counter is just done for bfq &
mq-
> deadline, then we needn't to pay the cost for others.

I have updated patch -

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index a1123d4..3e0005c 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -4640,6 +4640,12 @@ static bool bfq_has_work(struct blk_mq_hw_ctx
*hctx)
 {
        struct bfq_data *bfqd = hctx->queue->elevator->elevator_data;

+       /* If current hctx has not queued any request, there is no need to
run.
+        * blk_mq_run_hw_queue() on hctx which has queued IO will handle
+        * running specific hctx.
+        */
+       if (!atomic_read(&hctx->elevator_queued))
+               return false;
        /*
         * Avoiding lock: a race on bfqd->busy_queues should cause at
         * most a call to dispatch for nothing
@@ -5554,6 +5561,7 @@ static void bfq_insert_requests(struct blk_mq_hw_ctx
*hctx,
                rq = list_first_entry(list, struct request, queuelist);
                list_del_init(&rq->queuelist);
                bfq_insert_request(hctx, rq, at_head);
+              atomic_inc(&hctx->elevator_queued);
        }
 }

@@ -5925,6 +5933,7 @@ static void bfq_finish_requeue_request(struct
request *rq)

        if (likely(rq->rq_flags & RQF_STARTED)) {
                unsigned long flags;
+              struct blk_mq_hw_ctx *mq_hctx = rq->mq_hctx;

                spin_lock_irqsave(&bfqd->lock, flags);

@@ -5934,6 +5943,7 @@ static void bfq_finish_requeue_request(struct
request *rq)
                bfq_completed_request(bfqq, bfqd);
                bfq_finish_requeue_request_body(bfqq);

+              atomic_dec(&hctx->elevator_queued);
                spin_unlock_irqrestore(&bfqd->lock, flags);
        } else {
                /*
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 126021f..946b47a 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -74,6 +74,13 @@ static inline bool blk_mq_sched_has_work(struct
blk_mq_hw_ctx *hctx)
 {
        struct elevator_queue *e = hctx->queue->elevator;

+       /* If current hctx has not queued any request, there is no need to
run.
+        * blk_mq_run_hw_queue() on hctx which has queued IO will handle
+        * running specific hctx.
+        */
+       if (!atomic_read(&hctx->elevator_queued))
+               return false;
+
        if (e && e->type->ops.has_work)
                return e->type->ops.has_work(hctx);

diff --git a/block/blk-mq.c b/block/blk-mq.c
index f73a2f9..82dd152 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2571,6 +2571,7 @@ blk_mq_alloc_hctx(struct request_queue *q, struct
blk_mq_tag_set *set,
                goto free_hctx;

        atomic_set(&hctx->nr_active, 0);
+      atomic_set(&hctx->elevator_queued, 0);
        if (node == NUMA_NO_NODE)
                node = set->numa_node;
        hctx->numa_node = node;
diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index b57470e..703ac55 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -533,6 +533,7 @@ static void dd_insert_requests(struct blk_mq_hw_ctx
*hctx,
                rq = list_first_entry(list, struct request, queuelist);
                list_del_init(&rq->queuelist);
                dd_insert_request(hctx, rq, at_head);
+              atomic_inc(&hctx->elevator_queued);
        }
        spin_unlock(&dd->lock);
 }
@@ -562,6 +563,7 @@ static void dd_prepare_request(struct request *rq)
 static void dd_finish_request(struct request *rq)
 {
        struct request_queue *q = rq->q;
+      struct blk_mq_hw_ctx *hctx = rq->mq_hctx;

        if (blk_queue_is_zoned(q)) {
                struct deadline_data *dd = q->elevator->elevator_data;
@@ -570,15 +572,23 @@ static void dd_finish_request(struct request *rq)
                spin_lock_irqsave(&dd->zone_lock, flags);
                blk_req_zone_write_unlock(rq);
                if (!list_empty(&dd->fifo_list[WRITE]))
-                       blk_mq_sched_mark_restart_hctx(rq->mq_hctx);
+                       blk_mq_sched_mark_restart_hctx(hctx);
                spin_unlock_irqrestore(&dd->zone_lock, flags);
        }
+       atomic_dec(&hctx->elevator_queued);
 }

 static bool dd_has_work(struct blk_mq_hw_ctx *hctx)
 {
        struct deadline_data *dd = hctx->queue->elevator->elevator_data;

+       /* If current hctx has not queued any request, there is no need to
run.
+        * blk_mq_run_hw_queue() on hctx which has queued IO will handle
+        * running specific hctx.
+        */
+       if (!atomic_read(&hctx->elevator_queued))
+               return false;
+
        return !list_empty_careful(&dd->dispatch) ||
                !list_empty_careful(&dd->fifo_list[0]) ||
                !list_empty_careful(&dd->fifo_list[1]);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 66711c7..ea1ddb1 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -139,6 +139,10 @@ struct blk_mq_hw_ctx {
         * shared across request queues.
         */
        atomic_t                nr_active;
+       /**
+        * @elevator_queued: Number of queued requests on hctx.
+        */
+       atomic_t                elevator_queued;

        /** @cpuhp_online: List to store request if CPU is going to die */
        struct hlist_node       cpuhp_online;


>
> Thanks,
> Ming

^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: [PATCH RFC v7 00/12] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs
  2020-06-23  0:55                 ` Ming Lei
  2020-06-23 11:50                   ` Kashyap Desai
@ 2020-06-23 12:11                   ` Kashyap Desai
  1 sibling, 0 replies; 123+ messages in thread
From: Kashyap Desai @ 2020-06-23 12:11 UTC (permalink / raw)
  To: Ming Lei, Hannes Reinecke
  Cc: John Garry, axboe, jejb, martin.petersen, don.brace,
	Sumit Saxena, bvanassche, hare, hch, Shivasharan Srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	PDL,MEGARAIDLINUX

> > > >
> > > Would it make sense to move it into the elevator itself?
>
> I am not sure where exactly I should add this counter since I need
counter per
> hctx. Elevator data is per request object.
> Please suggest.
>
> >
> > That is my initial suggestion, and the counter is just done for bfq &
> > mq- deadline, then we needn't to pay the cost for others.
>
> I have updated patch -
>
> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c index
a1123d4..3e0005c
> 100644
> --- a/block/bfq-iosched.c
> +++ b/block/bfq-iosched.c
> @@ -4640,6 +4640,12 @@ static bool bfq_has_work(struct blk_mq_hw_ctx
> *hctx)  {
>         struct bfq_data *bfqd = hctx->queue->elevator->elevator_data;
>
> +       /* If current hctx has not queued any request, there is no need
to run.
> +        * blk_mq_run_hw_queue() on hctx which has queued IO will handle
> +        * running specific hctx.
> +        */
> +       if (!atomic_read(&hctx->elevator_queued))
> +               return false;
>         /*
>          * Avoiding lock: a race on bfqd->busy_queues should cause at
>          * most a call to dispatch for nothing @@ -5554,6 +5561,7 @@
static void
> bfq_insert_requests(struct blk_mq_hw_ctx *hctx,
>                 rq = list_first_entry(list, struct request, queuelist);
>                 list_del_init(&rq->queuelist);
>                 bfq_insert_request(hctx, rq, at_head);
> +              atomic_inc(&hctx->elevator_queued);
>         }
>  }
>
> @@ -5925,6 +5933,7 @@ static void bfq_finish_requeue_request(struct
> request *rq)
>
>         if (likely(rq->rq_flags & RQF_STARTED)) {
>                 unsigned long flags;
> +              struct blk_mq_hw_ctx *mq_hctx = rq->mq_hctx;
>
>                 spin_lock_irqsave(&bfqd->lock, flags);
>
> @@ -5934,6 +5943,7 @@ static void bfq_finish_requeue_request(struct
> request *rq)
>                 bfq_completed_request(bfqq, bfqd);
>                 bfq_finish_requeue_request_body(bfqq);
>
> +              atomic_dec(&hctx->elevator_queued);
>                 spin_unlock_irqrestore(&bfqd->lock, flags);
>         } else {
>                 /*
> diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h index
> 126021f..946b47a 100644
> --- a/block/blk-mq-sched.h
> +++ b/block/blk-mq-sched.h
> @@ -74,6 +74,13 @@ static inline bool blk_mq_sched_has_work(struct
> blk_mq_hw_ctx *hctx)  {
>         struct elevator_queue *e = hctx->queue->elevator;
>
> +       /* If current hctx has not queued any request, there is no need
to run.
> +        * blk_mq_run_hw_queue() on hctx which has queued IO will handle
> +        * running specific hctx.
> +        */
> +       if (!atomic_read(&hctx->elevator_queued))
> +               return false;
> +

I have missed this. I will remove above code since it is now managed
within mq-deadline and bfq-iosched *has_work* callback.

>         if (e && e->type->ops.has_work)
>                 return e->type->ops.has_work(hctx);
>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 02/12] blk-mq: rename blk_mq_update_tag_set_depth()
  2020-06-23 11:25       ` John Garry
@ 2020-06-23 14:23         ` Hannes Reinecke
  2020-06-24  8:13           ` Kashyap Desai
  2020-08-10 16:51           ` Kashyap Desai
  0 siblings, 2 replies; 123+ messages in thread
From: Hannes Reinecke @ 2020-06-23 14:23 UTC (permalink / raw)
  To: John Garry, Ming Lei
  Cc: axboe, jejb, martin.petersen, don.brace, kashyap.desai,
	sumit.saxena, bvanassche, hare, hch, shivasharan.srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	Kashyap Desai


[-- Attachment #1: Type: text/plain, Size: 1385 bytes --]

On 6/23/20 1:25 PM, John Garry wrote:
> On 11/06/2020 09:26, John Garry wrote:
>> On 11/06/2020 03:57, Ming Lei wrote:
>>> On Thu, Jun 11, 2020 at 01:29:09AM +0800, John Garry wrote:
>>>> From: Hannes Reinecke <hare@suse.de>
>>>>
>>>> The function does not set the depth, but rather transitions from
>>>> shared to non-shared queues and vice versa.
>>>> So rename it to blk_mq_update_tag_set_shared() to better reflect
>>>> its purpose.
>>>
>>> It is fine to rename it for me, however:
>>>
>>> This patch claims to rename blk_mq_update_tag_set_shared(), but also
>>> change blk_mq_init_bitmap_tags's signature.
>>
>> I was going to update the commit message here, but forgot again...
>>
>>>
>>> So suggest to split this patch into two or add comment log on changing
>>> blk_mq_init_bitmap_tags().
>>
>> I think I'll just split into 2x commits.
> 
> Hi Hannes,
> 
> Do you have any issue with splitting the undocumented changes into 
> another patch as so:
> 
No, that's perfectly fine.

Kashyap, I've also attached an updated patch for the elevator_count 
patch; if you agree John can include it in the next version.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                               +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

[-- Attachment #2: 0001-elevator-count-requests-per-hctx-to-improve-performa.patch --]
[-- Type: text/x-patch, Size: 3484 bytes --]

From d50b5f773713070208c405f7c7056eb1afed896a Mon Sep 17 00:00:00 2001
From: Hannes Reinecke <hare@suse.de>
Date: Tue, 23 Jun 2020 16:18:40 +0200
Subject: [PATCH] elevator: count requests per hctx to improve performance

Add a 'elevator_queued' count to the hctx to avoid triggering
the elevator even though there are no requests queued.

Suggested-by: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 block/bfq-iosched.c    | 5 +++++
 block/blk-mq.c         | 1 +
 block/mq-deadline.c    | 5 +++++
 include/linux/blk-mq.h | 4 ++++
 4 files changed, 15 insertions(+)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index a1123d4d586d..3d63b35f6121 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -4640,6 +4640,9 @@ static bool bfq_has_work(struct blk_mq_hw_ctx *hctx)
 {
 	struct bfq_data *bfqd = hctx->queue->elevator->elevator_data;
 
+	if (!atomic_read(&hctx->elevator_queued))
+		return false;
+
 	/*
 	 * Avoiding lock: a race on bfqd->busy_queues should cause at
 	 * most a call to dispatch for nothing
@@ -5554,6 +5557,7 @@ static void bfq_insert_requests(struct blk_mq_hw_ctx *hctx,
 		rq = list_first_entry(list, struct request, queuelist);
 		list_del_init(&rq->queuelist);
 		bfq_insert_request(hctx, rq, at_head);
+		atomic_inc(&hctx->elevator_queued)
 	}
 }
 
@@ -5933,6 +5937,7 @@ static void bfq_finish_requeue_request(struct request *rq)
 
 		bfq_completed_request(bfqq, bfqd);
 		bfq_finish_requeue_request_body(bfqq);
+		atomic_dec(&rq->mq_hctx->elevator_queued);
 
 		spin_unlock_irqrestore(&bfqd->lock, flags);
 	} else {
diff --git a/block/blk-mq.c b/block/blk-mq.c
index e06e8c9f326f..f5403fc97572 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2542,6 +2542,7 @@ blk_mq_alloc_hctx(struct request_queue *q, struct blk_mq_tag_set *set,
 		goto free_hctx;
 
 	atomic_set(&hctx->nr_active, 0);
+	atomic_set(&hctx->elevator_queued, 0);
 	if (node == NUMA_NO_NODE)
 		node = set->numa_node;
 	hctx->numa_node = node;
diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index b57470e154c8..9d753745e6be 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -533,6 +533,7 @@ static void dd_insert_requests(struct blk_mq_hw_ctx *hctx,
 		rq = list_first_entry(list, struct request, queuelist);
 		list_del_init(&rq->queuelist);
 		dd_insert_request(hctx, rq, at_head);
+		atomic_inc(&hctx->elevator_queued);
 	}
 	spin_unlock(&dd->lock);
 }
@@ -573,12 +574,16 @@ static void dd_finish_request(struct request *rq)
 			blk_mq_sched_mark_restart_hctx(rq->mq_hctx);
 		spin_unlock_irqrestore(&dd->zone_lock, flags);
 	}
+	atomic_dec(&rq->mq_hctx->elevator_queued);
 }
 
 static bool dd_has_work(struct blk_mq_hw_ctx *hctx)
 {
 	struct deadline_data *dd = hctx->queue->elevator->elevator_data;
 
+	if (!atomic_read(&hctx->elevator_queued))
+		return false;
+
 	return !list_empty_careful(&dd->dispatch) ||
 		!list_empty_careful(&dd->fifo_list[0]) ||
 		!list_empty_careful(&dd->fifo_list[1]);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 66711c7234db..a18c506b14e7 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -139,6 +139,10 @@ struct blk_mq_hw_ctx {
 	 * shared across request queues.
 	 */
 	atomic_t		nr_active;
+	/**
+	 * @elevator_queued: Number of queued requests on hctx.
+	 */
+	atomic_t                elevator_queued;
 
 	/** @cpuhp_online: List to store request if CPU is going to die */
 	struct hlist_node	cpuhp_online;
-- 
2.26.2


^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: [PATCH RFC v7 02/12] blk-mq: rename blk_mq_update_tag_set_depth()
  2020-06-23 14:23         ` Hannes Reinecke
@ 2020-06-24  8:13           ` Kashyap Desai
  2020-06-29 16:18             ` John Garry
  2020-08-10 16:51           ` Kashyap Desai
  1 sibling, 1 reply; 123+ messages in thread
From: Kashyap Desai @ 2020-06-24  8:13 UTC (permalink / raw)
  To: Hannes Reinecke, John Garry, Ming Lei
  Cc: axboe, jejb, martin.petersen, don.brace, Sumit Saxena,
	bvanassche, hare, hch, Shivasharan Srikanteshwara, linux-block,
	linux-scsi, esc.storagedev, chenxiang66

>
> On 6/23/20 1:25 PM, John Garry wrote:
> > On 11/06/2020 09:26, John Garry wrote:
> >> On 11/06/2020 03:57, Ming Lei wrote:
> >>> On Thu, Jun 11, 2020 at 01:29:09AM +0800, John Garry wrote:
> >>>> From: Hannes Reinecke <hare@suse.de>
> >>>>
> >>>> The function does not set the depth, but rather transitions from
> >>>> shared to non-shared queues and vice versa.
> >>>> So rename it to blk_mq_update_tag_set_shared() to better reflect
> >>>> its purpose.
> >>>
> >>> It is fine to rename it for me, however:
> >>>
> >>> This patch claims to rename blk_mq_update_tag_set_shared(), but also
> >>> change blk_mq_init_bitmap_tags's signature.
> >>
> >> I was going to update the commit message here, but forgot again...
> >>
> >>>
> >>> So suggest to split this patch into two or add comment log on
> >>> changing blk_mq_init_bitmap_tags().
> >>
> >> I think I'll just split into 2x commits.
> >
> > Hi Hannes,
> >
> > Do you have any issue with splitting the undocumented changes into
> > another patch as so:
> >
> No, that's perfectly fine.
>
> Kashyap, I've also attached an updated patch for the elevator_count patch;
> if
> you agree John can include it in the next version.

Hannes - Patch looks good.   Header does not include problem statement. How
about adding below in header ?

High CPU utilization on "native_queued_spin_lock_slowpath" due to lock
contention is possible in mq-deadline and bfq io scheduler when nr_hw_queues
is more than one.
It is because kblockd work queue can submit IO from all online CPUs (through
blk_mq_run_hw_queues) even though only one hctx has pending commands.
Elevator callback "has_work" for mq-deadline and bfq scheduler consider
pending work if there are any IOs on request queue and it does not account
hctx context.

I have not seen performance drop after this patch, but I will continue
further testing.

John - One more thing, I am working on megaraid_sas driver to provide both
host_tagset = 1 and 0 option through module parameter.

Kashyap

>
> Cheers,
>
> Hannes
> --
> Dr. Hannes Reinecke            Teamlead Storage & Networking
> hare@suse.de                               +49 911 74053 688
> SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg HRB 36809
> (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 123+ messages in thread

* About sbitmap_bitmap_show() and cleared bits (was Re: [PATCH RFC v7 07/12] blk-mq: Add support in hctx_tags_bitmap_show() for a shared sbitmap)
  2020-06-12  6:06       ` Hannes Reinecke
@ 2020-06-29 15:32         ` John Garry
  2020-06-30  6:33           ` Hannes Reinecke
  2020-06-30 14:55           ` Bart Van Assche
  2020-07-13  9:41         ` [PATCH RFC v7 07/12] blk-mq: Add support in hctx_tags_bitmap_show() for a shared sbitmap John Garry
  1 sibling, 2 replies; 123+ messages in thread
From: John Garry @ 2020-06-29 15:32 UTC (permalink / raw)
  To: axboe, linux-block
  Cc: Hannes Reinecke, jejb, martin.petersen, don.brace, kashyap.desai,
	sumit.saxena, ming.lei, bvanassche, hare, hch,
	shivasharan.srikanteshwara, linux-scsi, esc.storagedev,
	chenxiang66, megaraidlinux.pdl

Hi all,

I noticed that sbitmap_bitmap_show() only shows set bits and does not 
consider cleared bits. Is that the proper thing to do?

I ask, as from trying to support sbitmap_bitmap_show() for hostwide 
shared tags feature, we currently use blk_mq_queue_tag_busy_iter() to 
find active requests (and associated tags/bits) for a particular hctx. 
So, AFAICT, would give a change in behavior for sbitmap_bitmap_show(), 
in that it would effectively show set and not cleared bits.

Any thoughts on this?

Thanks!

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 02/12] blk-mq: rename blk_mq_update_tag_set_depth()
  2020-06-24  8:13           ` Kashyap Desai
@ 2020-06-29 16:18             ` John Garry
  0 siblings, 0 replies; 123+ messages in thread
From: John Garry @ 2020-06-29 16:18 UTC (permalink / raw)
  To: Kashyap Desai, Hannes Reinecke, Ming Lei
  Cc: axboe, jejb, martin.petersen, don.brace, Sumit Saxena,
	bvanassche, hare, hch, Shivasharan Srikanteshwara, linux-block,
	linux-scsi, esc.storagedev, chenxiang66

On 24/06/2020 09:13, Kashyap Desai wrote:
>>> Hi Hannes,
>>>
>>> Do you have any issue with splitting the undocumented changes into
>>> another patch as so:
>>>
>> No, that's perfectly fine.
>>
>> Kashyap, I've also attached an updated patch for the elevator_count patch;
>> if
>> you agree John can include it in the next version.

ok, but I need to check it myself.

> Hannes - Patch looks good.   Header does not include problem statement. How
> about adding below in header ?
> 
> High CPU utilization on "native_queued_spin_lock_slowpath" due to lock
> contention is possible in mq-deadline and bfq io scheduler when nr_hw_queues
> is more than one.
> It is because kblockd work queue can submit IO from all online CPUs (through
> blk_mq_run_hw_queues) even though only one hctx has pending commands.
> Elevator callback "has_work" for mq-deadline and bfq scheduler consider
> pending work if there are any IOs on request queue and it does not account
> hctx context.
> 
> I have not seen performance drop after this patch, but I will continue
> further testing.
> 
> John - One more thing, I am working on megaraid_sas driver to provide both
> host_tagset = 1 and 0 option through module parameter.

I was hoping that we wouldn't have this, and have host_tagset = 1 
always. Or maybe host_tagset = 1 by default, and allow module param to 
set = 0. Your choice, though.

Thanks,
John


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: About sbitmap_bitmap_show() and cleared bits (was Re: [PATCH RFC v7 07/12] blk-mq: Add support in hctx_tags_bitmap_show() for a shared sbitmap)
  2020-06-29 15:32         ` About sbitmap_bitmap_show() and cleared bits (was Re: [PATCH RFC v7 07/12] blk-mq: Add support in hctx_tags_bitmap_show() for a shared sbitmap) John Garry
@ 2020-06-30  6:33           ` Hannes Reinecke
  2020-06-30  7:30             ` John Garry
  2020-06-30 14:55           ` Bart Van Assche
  1 sibling, 1 reply; 123+ messages in thread
From: Hannes Reinecke @ 2020-06-30  6:33 UTC (permalink / raw)
  To: John Garry, axboe, linux-block
  Cc: jejb, martin.petersen, don.brace, kashyap.desai, sumit.saxena,
	ming.lei, bvanassche, hare, hch, shivasharan.srikanteshwara,
	linux-scsi, esc.storagedev, chenxiang66, megaraidlinux.pdl

On 6/29/20 5:32 PM, John Garry wrote:
> Hi all,
> 
> I noticed that sbitmap_bitmap_show() only shows set bits and does not 
> consider cleared bits. Is that the proper thing to do?
> 
> I ask, as from trying to support sbitmap_bitmap_show() for hostwide 
> shared tags feature, we currently use blk_mq_queue_tag_busy_iter() to 
> find active requests (and associated tags/bits) for a particular hctx. 
> So, AFAICT, would give a change in behavior for sbitmap_bitmap_show(), 
> in that it would effectively show set and not cleared bits.
> 
Why would you need to do this?
Where would be the point traversing cleared bits?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                               +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: About sbitmap_bitmap_show() and cleared bits (was Re: [PATCH RFC v7 07/12] blk-mq: Add support in hctx_tags_bitmap_show() for a shared sbitmap)
  2020-06-30  6:33           ` Hannes Reinecke
@ 2020-06-30  7:30             ` John Garry
  2020-06-30 11:36               ` John Garry
  0 siblings, 1 reply; 123+ messages in thread
From: John Garry @ 2020-06-30  7:30 UTC (permalink / raw)
  To: Hannes Reinecke, axboe, linux-block
  Cc: jejb, martin.petersen, don.brace, kashyap.desai, sumit.saxena,
	ming.lei, bvanassche, hare, hch, shivasharan.srikanteshwara,
	linux-scsi, esc.storagedev, chenxiang66, megaraidlinux.pdl

On 30/06/2020 07:33, Hannes Reinecke wrote:
> On 6/29/20 5:32 PM, John Garry wrote:
>> Hi all,
>>
>> I noticed that sbitmap_bitmap_show() only shows set bits and does not 
>> consider cleared bits. Is that the proper thing to do?
>>
>> I ask, as from trying to support sbitmap_bitmap_show() for hostwide 
>> shared tags feature, we currently use blk_mq_queue_tag_busy_iter() to 
>> find active requests (and associated tags/bits) for a particular hctx. 
>> So, AFAICT, would give a change in behavior for sbitmap_bitmap_show(), 
>> in that it would effectively show set and not cleared bits.
>>
> Why would you need to do this?
> Where would be the point traversing cleared bits?

I'm not talking about traversing cleared bits specifically. I just think 
that today sbitmap_bitmap_show() only showing the bits in 
sbitmap_word.word may not be useful or even misleading, as in reality 
the "set" bits are sbitmap_word.word & ~sbitmap_word.cleared.

And for hostwide shared tags feature, iterating the busy rqs to find the 
per-hctx tags/bits would effectively give us the "set" bits, above, so 
there would be a difference.

Thanks,
John

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: About sbitmap_bitmap_show() and cleared bits (was Re: [PATCH RFC v7 07/12] blk-mq: Add support in hctx_tags_bitmap_show() for a shared sbitmap)
  2020-06-30  7:30             ` John Garry
@ 2020-06-30 11:36               ` John Garry
  0 siblings, 0 replies; 123+ messages in thread
From: John Garry @ 2020-06-30 11:36 UTC (permalink / raw)
  To: Hannes Reinecke, axboe, linux-block
  Cc: jejb, martin.petersen, don.brace, kashyap.desai, sumit.saxena,
	ming.lei, bvanassche, hare, hch, shivasharan.srikanteshwara,
	linux-scsi, esc.storagedev, chenxiang66, megaraidlinux.pdl

On 30/06/2020 08:30, John Garry wrote:
> On 30/06/2020 07:33, Hannes Reinecke wrote:
>> On 6/29/20 5:32 PM, John Garry wrote:
>>> Hi all,
>>>
>>> I noticed that sbitmap_bitmap_show() only shows set bits and does not 
>>> consider cleared bits. Is that the proper thing to do?
>>>
>>> I ask, as from trying to support sbitmap_bitmap_show() for hostwide 
>>> shared tags feature, we currently use blk_mq_queue_tag_busy_iter() to 
>>> find active requests (and associated tags/bits) for a particular 
>>> hctx. So, AFAICT, would give a change in behavior for 
>>> sbitmap_bitmap_show(), in that it would effectively show set and not 
>>> cleared bits.
>>>
>> Why would you need to do this?
>> Where would be the point traversing cleared bits?
> 
> I'm not talking about traversing cleared bits specifically. I just think 
> that today sbitmap_bitmap_show() only showing the bits in 
> sbitmap_word.word may not be useful or even misleading, as in reality 
> the "set" bits are sbitmap_word.word & ~sbitmap_word.cleared.
> 
> And for hostwide shared tags feature, iterating the busy rqs to find the 
> per-hctx tags/bits would effectively give us the "set" bits, above, so 
> there would be a difference.
> 

As an example, here's a sample tags_bitmap output:

00000000: 00f0 0fff 03c0 0000 0000 0000 efff fdff
00000010: 0000 c0f7 7fff ffff 0000 00e0 fef7 ffff
00000020: 0000 0000 f0ff ffff 0000 ffff 01d0 ffff
00000030: 0f80

And here's what we would have taking cleared bits into account:

00000000: 00f0 0fff 03c0 0000 0000 0000 0000 0000 (20 bits set)
00000010: 0000 0000 0000 0000 0000 0000 0000 0000
00000020: 0000 0000 0000 0000 0000 f8ff 0110 8000 (16 bits set)
00000030: 0f00					  (1 bit set)

Here's tags file output:

nr_tags=400
nr_reserved_tags=0
active_queues=2

bitmap_tags:
depth=400
busy=40
cleared=182
bits_per_word=64
map_nr=7
alloc_hint={22, 0, 0, 0, 103, 389, 231, 57, 377, 167, 0, 0, 69, 24, 44, 
50, 54,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0
, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}
wake_batch=8
wake_index=0

[snip]

20+16+1=39 more closely matches with busy=40.

So it seems sensible to go this way for whether hostwide tags are used 
or not.

thanks,
John

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: About sbitmap_bitmap_show() and cleared bits (was Re: [PATCH RFC v7 07/12] blk-mq: Add support in hctx_tags_bitmap_show() for a shared sbitmap)
  2020-06-29 15:32         ` About sbitmap_bitmap_show() and cleared bits (was Re: [PATCH RFC v7 07/12] blk-mq: Add support in hctx_tags_bitmap_show() for a shared sbitmap) John Garry
  2020-06-30  6:33           ` Hannes Reinecke
@ 2020-06-30 14:55           ` Bart Van Assche
  1 sibling, 0 replies; 123+ messages in thread
From: Bart Van Assche @ 2020-06-30 14:55 UTC (permalink / raw)
  To: John Garry, axboe, linux-block
  Cc: Hannes Reinecke, jejb, martin.petersen, don.brace, kashyap.desai,
	sumit.saxena, ming.lei, hare, hch, shivasharan.srikanteshwara,
	linux-scsi, esc.storagedev, chenxiang66, megaraidlinux.pdl

On 2020-06-29 08:32, John Garry wrote:
> I noticed that sbitmap_bitmap_show() only shows set bits and does not
> consider cleared bits. Is that the proper thing to do?
> 
> I ask, as from trying to support sbitmap_bitmap_show() for hostwide
> shared tags feature, we currently use blk_mq_queue_tag_busy_iter() to
> find active requests (and associated tags/bits) for a particular hctx.
> So, AFAICT, would give a change in behavior for sbitmap_bitmap_show(),
> in that it would effectively show set and not cleared bits.
> 
> Any thoughts on this?

Probably this is something that got overlooked when the cleared bits
were introduced? See e.g. 8c2def893afc ("sbitmap: fix
sbitmap_for_each_set()").

Bart.


^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: [PATCH RFC v7 10/12] megaraid_sas: switch fusion adapters to MQ
  2020-06-10 17:29 ` [PATCH RFC v7 10/12] megaraid_sas: switch fusion adapters " John Garry
@ 2020-07-02 10:23   ` Kashyap Desai
  2020-07-06  8:23     ` John Garry
  0 siblings, 1 reply; 123+ messages in thread
From: Kashyap Desai @ 2020-07-02 10:23 UTC (permalink / raw)
  To: John Garry, axboe, jejb, martin.petersen, don.brace,
	Sumit Saxena, ming.lei, bvanassche, hare, hch,
	Shivasharan Srikanteshwara
  Cc: linux-block, linux-scsi, esc.storagedev, chenxiang66, PDL,MEGARAIDLINUX

>
> From: Hannes Reinecke <hare@suse.com>
>
> Fusion adapters can steer completions to individual queues, and we now
have
> support for shared host-wide tags.
> So we can enable multiqueue support for fusion adapters and drop the
hand-
> crafted interrupt affinity settings.

Shared host tag is primarily introduced for completeness of CPU hotplug as
discussed earlier -
https://lwn.net/Articles/819419/

How shall I test CPU hotplug on megaraid_sas driver ? My understanding is
- This RFC + patch set from above link is required for it. I could not see
above series is committed.
Am I missing anything. ?

We do not want to completely move to shared host tag. It will be shared
host tag support by default, but user should have choice to go back to
legacy path.
We will completely move to shared host tag path once it is stable and no
more field issue observed over a period of time. -

Updated <megaraid_sas> patch will looks like this -

diff --git a/megaraid_sas_base.c b/megaraid_sas_base.c
index 0066833..3b503cb 100644
--- a/megaraid_sas_base.c
+++ b/megaraid_sas_base.c
@@ -37,6 +37,7 @@
 #include <linux/poll.h>
 #include <linux/vmalloc.h>
 #include <linux/irq_poll.h>
+#include <linux/blk-mq-pci.h>

 #include <scsi/scsi.h>
 #include <scsi/scsi_cmnd.h>
@@ -113,6 +114,10 @@ unsigned int enable_sdev_max_qd;
 module_param(enable_sdev_max_qd, int, 0444);
 MODULE_PARM_DESC(enable_sdev_max_qd, "Enable sdev max qd as can_queue.
Default: 0");

+int host_tagset_disabled = 0;
+module_param(host_tagset_disabled, int, 0444);
+MODULE_PARM_DESC(host_tagset_disabled, "Shared host tagset enable/disable
Default: enable(1)");
+
 MODULE_LICENSE("GPL");
 MODULE_VERSION(MEGASAS_VERSION);
 MODULE_AUTHOR("megaraidlinux.pdl@broadcom.com");
@@ -3115,6 +3120,18 @@ megasas_bios_param(struct scsi_device *sdev, struct
block_device *bdev,
        return 0;
 }

+static int megasas_map_queues(struct Scsi_Host *shost)
+{
+       struct megasas_instance *instance;
+       instance = (struct megasas_instance *)shost->hostdata;
+
+       if (instance->host->nr_hw_queues == 1)
+               return 0;
+
+       return
blk_mq_pci_map_queues(&shost->tag_set.map[HCTX_TYPE_DEFAULT],
+                       instance->pdev,
instance->low_latency_index_start);
+}
+
 static void megasas_aen_polling(struct work_struct *work);

 /**
@@ -3423,8 +3440,10 @@ static struct scsi_host_template megasas_template =
{
        .eh_timed_out = megasas_reset_timer,
        .shost_attrs = megaraid_host_attrs,
        .bios_param = megasas_bios_param,
+       .map_queues = megasas_map_queues,
        .change_queue_depth = scsi_change_queue_depth,
        .max_segment_size = 0xffffffff,
+       .host_tagset = 1,
 };

 /**
@@ -6793,7 +6812,21 @@ static int megasas_io_attach(struct
megasas_instance *instance)
        host->max_id = MEGASAS_MAX_DEV_PER_CHANNEL;
        host->max_lun = MEGASAS_MAX_LUN;
        host->max_cmd_len = 16;
+       host->nr_hw_queues = 1;

+       /* Use shared host tagset only for fusion adaptors
+        * if there are more than one managed interrupts.
+        */
+       if ((instance->adapter_type != MFI_SERIES) &&
+               (instance->msix_vectors > 0) &&
+               !host_tagset_disabled &&
+               instance->smp_affinity_enable)
+               host->nr_hw_queues = instance->msix_vectors -
+                       instance->low_latency_index_start;
+
+       dev_info(&instance->pdev->dev, "Max firmware commands: %d"
+               " for nr_hw_queues = %d\n", instance->max_fw_cmds,
+               host->nr_hw_queues);
        /*
         * Notify the mid-layer about the new controller
         */
@@ -8842,6 +8875,7 @@ static int __init megasas_init(void)
                msix_vectors = 1;
                rdpq_enable = 0;
                dual_qdepth_disable = 1;
+               host_tagset_disabled = 1;
        }

        /*
diff --git a/megaraid_sas_fusion.c b/megaraid_sas_fusion.c
index 319f241..14d4f35 100755
--- a/megaraid_sas_fusion.c
+++ b/megaraid_sas_fusion.c
@@ -373,24 +373,28 @@ megasas_get_msix_index(struct megasas_instance
*instance,
 {
        int sdev_busy;

-       /* nr_hw_queue = 1 for MegaRAID */
-       struct blk_mq_hw_ctx *hctx =
-               scmd->device->request_queue->queue_hw_ctx[0];
-
-       sdev_busy = atomic_read(&hctx->nr_active);
+       /* TBD - if sml remove device_busy in future, driver
+        * should track counter in internal structure.
+        */
+       sdev_busy = atomic_read(&scmd->device->device_busy);

        if (instance->perf_mode == MR_BALANCED_PERF_MODE &&
-           sdev_busy > (data_arms * MR_DEVICE_HIGH_IOPS_DEPTH))
+           sdev_busy > (data_arms * MR_DEVICE_HIGH_IOPS_DEPTH)) {
                cmd->request_desc->SCSIIO.MSIxIndex =
                        mega_mod64((atomic64_add_return(1,
&instance->high_iops_outstanding) /
                                        MR_HIGH_IOPS_BATCH_COUNT),
instance->low_latency_index_start);
-       else if (instance->msix_load_balance)
+       } else if (instance->msix_load_balance) {
                cmd->request_desc->SCSIIO.MSIxIndex =
                        (mega_mod64(atomic64_add_return(1,
&instance->total_io_count),
                                instance->msix_vectors));
-       else
+       } else if (instance->host->nr_hw_queues > 1) {
+               u32 tag = blk_mq_unique_tag(scmd->request);
+               cmd->request_desc->SCSIIO.MSIxIndex =
blk_mq_unique_tag_to_hwq(tag) +
+                       instance->low_latency_index_start;
+       } else {
                cmd->request_desc->SCSIIO.MSIxIndex =
                        instance->reply_map[raw_smp_processor_id()];
+       }
 }

 /**
@@ -970,9 +974,6 @@ megasas_alloc_cmds_fusion(struct megasas_instance
*instance)
        if (megasas_alloc_cmdlist_fusion(instance))
                goto fail_exit;

-       dev_info(&instance->pdev->dev, "Configured max firmware commands:
%d\n",
-                instance->max_fw_cmds);
-
        /* The first 256 bytes (SMID 0) is not used. Don't add to the cmd
list */
        io_req_base = fusion->io_request_frames +
MEGA_MPI2_RAID_DEFAULT_IO_FRAME_SIZE;
        io_req_base_phys = fusion->io_request_frames_phys +
MEGA_MPI2_RAID_DEFAULT_IO_FRAME_SIZE;

Kashyap

>
> Signed-off-by: Hannes Reinecke <hare@suse.com>
> Signed-off-by: John Garry <john.garry@huawei.com>
> ---
>  drivers/scsi/megaraid/megaraid_sas.h        |  1 -
>  drivers/scsi/megaraid/megaraid_sas_base.c   | 59 +++++++--------------
>  drivers/scsi/megaraid/megaraid_sas_fusion.c | 24 +++++----
>  3 files changed, 32 insertions(+), 52 deletions(-)
>
> diff --git a/drivers/scsi/megaraid/megaraid_sas.h
> b/drivers/scsi/megaraid/megaraid_sas.h
> index af2c7a2a9565..b27a34a5f5de 100644
> --- a/drivers/scsi/megaraid/megaraid_sas.h
> +++ b/drivers/scsi/megaraid/megaraid_sas.h
> @@ -2261,7 +2261,6 @@ enum MR_PERF_MODE {
>
>  struct megasas_instance {
>
> -	unsigned int *reply_map;
>  	__le32 *producer;
>  	dma_addr_t producer_h;
>  	__le32 *consumer;
> diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c
> b/drivers/scsi/megaraid/megaraid_sas_base.c
> index 00668335c2af..e6bb2a64d51c 100644
> --- a/drivers/scsi/megaraid/megaraid_sas_base.c
> +++ b/drivers/scsi/megaraid/megaraid_sas_base.c
> @@ -37,6 +37,7 @@
>  #include <linux/poll.h>
>  #include <linux/vmalloc.h>
>  #include <linux/irq_poll.h>
> +#include <linux/blk-mq-pci.h>
>
>  #include <scsi/scsi.h>
>  #include <scsi/scsi_cmnd.h>
> @@ -3115,6 +3116,19 @@ megasas_bios_param(struct scsi_device *sdev,
> struct block_device *bdev,
>  	return 0;
>  }
>
> +static int megasas_map_queues(struct Scsi_Host *shost) {
> +	struct megasas_instance *instance;
> +
> +	instance = (struct megasas_instance *)shost->hostdata;
> +
> +	if (!instance->smp_affinity_enable)
> +		return 0;
> +
> +	return blk_mq_pci_map_queues(&shost-
> >tag_set.map[HCTX_TYPE_DEFAULT],
> +			instance->pdev,
instance->low_latency_index_start);
> +}
> +
>  static void megasas_aen_polling(struct work_struct *work);
>
>  /**
> @@ -3423,8 +3437,10 @@ static struct scsi_host_template
> megasas_template = {
>  	.eh_timed_out = megasas_reset_timer,
>  	.shost_attrs = megaraid_host_attrs,
>  	.bios_param = megasas_bios_param,
> +	.map_queues = megasas_map_queues,
>  	.change_queue_depth = scsi_change_queue_depth,
>  	.max_segment_size = 0xffffffff,
> +	.host_tagset = 1,
>  };
>
>  /**
> @@ -5708,34 +5724,6 @@ megasas_setup_jbod_map(struct
> megasas_instance *instance)
>  		instance->use_seqnum_jbod_fp = false;  }
>
> -static void megasas_setup_reply_map(struct megasas_instance *instance)
-{
> -	const struct cpumask *mask;
> -	unsigned int queue, cpu, low_latency_index_start;
> -
> -	low_latency_index_start = instance->low_latency_index_start;
> -
> -	for (queue = low_latency_index_start; queue < instance-
> >msix_vectors; queue++) {
> -		mask = pci_irq_get_affinity(instance->pdev, queue);
> -		if (!mask)
> -			goto fallback;
> -
> -		for_each_cpu(cpu, mask)
> -			instance->reply_map[cpu] = queue;
> -	}
> -	return;
> -
> -fallback:
> -	queue = low_latency_index_start;
> -	for_each_possible_cpu(cpu) {
> -		instance->reply_map[cpu] = queue;
> -		if (queue == (instance->msix_vectors - 1))
> -			queue = low_latency_index_start;
> -		else
> -			queue++;
> -	}
> -}
> -
>  /**
>   * megasas_get_device_list -	Get the PD and LD device list from FW.
>   * @instance:			Adapter soft state
> @@ -6158,8 +6146,6 @@ static int megasas_init_fw(struct megasas_instance
> *instance)
>  			goto fail_init_adapter;
>  	}
>
> -	megasas_setup_reply_map(instance);
> -
>  	dev_info(&instance->pdev->dev,
>  		"current msix/online cpus\t: (%d/%d)\n",
>  		instance->msix_vectors, (unsigned int)num_online_cpus());
> @@ -6793,6 +6779,9 @@ static int megasas_io_attach(struct
> megasas_instance *instance)
>  	host->max_id = MEGASAS_MAX_DEV_PER_CHANNEL;
>  	host->max_lun = MEGASAS_MAX_LUN;
>  	host->max_cmd_len = 16;
> +	if (instance->adapter_type != MFI_SERIES && instance->msix_vectors
> > 0)
> +		host->nr_hw_queues = instance->msix_vectors -
> +			instance->low_latency_index_start;
>
>  	/*
>  	 * Notify the mid-layer about the new controller @@ -6960,11
> +6949,6 @@ static inline int megasas_alloc_mfi_ctrl_mem(struct
> megasas_instance *instance)
>   */
>  static int megasas_alloc_ctrl_mem(struct megasas_instance *instance)  {
> -	instance->reply_map = kcalloc(nr_cpu_ids, sizeof(unsigned int),
> -				      GFP_KERNEL);
> -	if (!instance->reply_map)
> -		return -ENOMEM;
> -
>  	switch (instance->adapter_type) {
>  	case MFI_SERIES:
>  		if (megasas_alloc_mfi_ctrl_mem(instance))
> @@ -6981,8 +6965,6 @@ static int megasas_alloc_ctrl_mem(struct
> megasas_instance *instance)
>
>  	return 0;
>   fail:
> -	kfree(instance->reply_map);
> -	instance->reply_map = NULL;
>  	return -ENOMEM;
>  }
>
> @@ -6995,7 +6977,6 @@ static int megasas_alloc_ctrl_mem(struct
> megasas_instance *instance)
>   */
>  static inline void megasas_free_ctrl_mem(struct megasas_instance
> *instance)  {
> -	kfree(instance->reply_map);
>  	if (instance->adapter_type == MFI_SERIES) {
>  		if (instance->producer)
>  			dma_free_coherent(&instance->pdev->dev,
> sizeof(u32), @@ -7683,8 +7664,6 @@ megasas_resume(struct pci_dev
> *pdev)
>  			goto fail_reenable_msix;
>  	}
>
> -	megasas_setup_reply_map(instance);
> -
>  	if (instance->adapter_type != MFI_SERIES) {
>  		megasas_reset_reply_desc(instance);
>  		if (megasas_ioc_init_fusion(instance)) { diff --git
> a/drivers/scsi/megaraid/megaraid_sas_fusion.c
> b/drivers/scsi/megaraid/megaraid_sas_fusion.c
> index 319f241da4b6..8e25b700988e 100644
> --- a/drivers/scsi/megaraid/megaraid_sas_fusion.c
> +++ b/drivers/scsi/megaraid/megaraid_sas_fusion.c
> @@ -373,24 +373,24 @@ megasas_get_msix_index(struct megasas_instance
> *instance,  {
>  	int sdev_busy;
>
> -	/* nr_hw_queue = 1 for MegaRAID */
> -	struct blk_mq_hw_ctx *hctx =
> -		scmd->device->request_queue->queue_hw_ctx[0];
> +	struct blk_mq_hw_ctx *hctx = scmd->request->mq_hctx;
>
>  	sdev_busy = atomic_read(&hctx->nr_active);
>
>  	if (instance->perf_mode == MR_BALANCED_PERF_MODE &&
> -	    sdev_busy > (data_arms * MR_DEVICE_HIGH_IOPS_DEPTH))
> +	    sdev_busy > (data_arms * MR_DEVICE_HIGH_IOPS_DEPTH)) {
>  		cmd->request_desc->SCSIIO.MSIxIndex =
>  			mega_mod64((atomic64_add_return(1, &instance-
> >high_iops_outstanding) /
>  					MR_HIGH_IOPS_BATCH_COUNT),
> instance->low_latency_index_start);
> -	else if (instance->msix_load_balance)
> +	} else if (instance->msix_load_balance) {
>  		cmd->request_desc->SCSIIO.MSIxIndex =
>  			(mega_mod64(atomic64_add_return(1, &instance-
> >total_io_count),
>  				instance->msix_vectors));
> -	else
> -		cmd->request_desc->SCSIIO.MSIxIndex =
> -			instance->reply_map[raw_smp_processor_id()];
> +	} else {
> +		u32 tag = blk_mq_unique_tag(scmd->request);
> +
> +		cmd->request_desc->SCSIIO.MSIxIndex =
> blk_mq_unique_tag_to_hwq(tag) + instance->low_latency_index_start;
> +	}
>  }
>
>  /**
> @@ -3326,7 +3326,7 @@ megasas_build_and_issue_cmd_fusion(struct
> megasas_instance *instance,  {
>  	struct megasas_cmd_fusion *cmd, *r1_cmd = NULL;
>  	union MEGASAS_REQUEST_DESCRIPTOR_UNION *req_desc;
> -	u32 index;
> +	u32 index, blk_tag, unique_tag;
>
>  	if ((megasas_cmd_type(scmd) == READ_WRITE_LDIO) &&
>  		instance->ldio_threshold &&
> @@ -3342,7 +3342,9 @@ megasas_build_and_issue_cmd_fusion(struct
> megasas_instance *instance,
>  		return SCSI_MLQUEUE_HOST_BUSY;
>  	}
>
> -	cmd = megasas_get_cmd_fusion(instance, scmd->request->tag);
> +	unique_tag = blk_mq_unique_tag(scmd->request);
> +	blk_tag = blk_mq_unique_tag_to_tag(unique_tag);
> +	cmd = megasas_get_cmd_fusion(instance, blk_tag);
>
>  	if (!cmd) {
>  		atomic_dec(&instance->fw_outstanding);
> @@ -3383,7 +3385,7 @@ megasas_build_and_issue_cmd_fusion(struct
> megasas_instance *instance,
>  	 */
>  	if (cmd->r1_alt_dev_handle != MR_DEVHANDLE_INVALID) {
>  		r1_cmd = megasas_get_cmd_fusion(instance,
> -				(scmd->request->tag + instance-
> >max_fw_cmds));
> +				(blk_tag + instance->max_fw_cmds));
>  		megasas_prepare_secondRaid1_IO(instance, cmd, r1_cmd);
>  	}
>
> --
> 2.26.2

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 10/12] megaraid_sas: switch fusion adapters to MQ
  2020-07-02 10:23   ` Kashyap Desai
@ 2020-07-06  8:23     ` John Garry
  2020-07-06  8:45       ` Hannes Reinecke
  2020-07-06 19:19       ` Kashyap Desai
  0 siblings, 2 replies; 123+ messages in thread
From: John Garry @ 2020-07-06  8:23 UTC (permalink / raw)
  To: Kashyap Desai, axboe, jejb, martin.petersen, don.brace,
	Sumit Saxena, ming.lei, bvanassche, hare, hch,
	Shivasharan Srikanteshwara
  Cc: linux-block, linux-scsi, esc.storagedev, chenxiang66, PDL,MEGARAIDLINUX

On 02/07/2020 11:23, Kashyap Desai wrote:
>>
>> From: Hannes Reinecke <hare@suse.com>
>>
>> Fusion adapters can steer completions to individual queues, and we now
> have
>> support for shared host-wide tags.
>> So we can enable multiqueue support for fusion adapters and drop the
> hand-
>> crafted interrupt affinity settings.
> 
> Shared host tag is primarily introduced for completeness of CPU hotplug as
> discussed earlier -
> https://lwn.net/Articles/819419/
> 
> How shall I test CPU hotplug on megaraid_sas driver ?

I have scripts like this:

----8<-----

# hotplug.sh
# enable all cpus in the system
./enable_all.sh

for((i = 0; i < 50 ; i++))
do
echo "Looping ... number $i"
# run fio on all cpus with 40 second runtime
./create_fio_task_cpu.sh 4k read 2048 1&
echo "short sleep, then disable"
sleep 5
# disable some set of cpus which means managed interrupts get shutdown
# like cpu1-50 from 0-63
./disable_all.sh
echo "long sleep $i"
sleep 50
echo "long sleep over number $i"
./enable_all.sh
sleep 3
done

----->8-----

# enable_all.sh
for((i=0; i<63; i++))
do
echo 1 > /sys/devices/system/cpu/cpu$i/online
done

--->8----

I hope to add such a test to blktests when I get a chance.

> My understanding is
> - This RFC + patch set from above link is required for it. I could not see
> above series is committed.

It is committed and part of 5.8-rc1

The latest rc should have some scheduler fixes also.

I also note that there has been much churn on blk-mq tag code lately, 
and something may be broken, so I plan to verify latest rc myself soon.

> Am I missing anything. ?

You could also add this from Hannes (and add megaraid sas support):

https://lore.kernel.org/linux-scsi/20200629072021.9864-1-hare@suse.de/T/#t

That is, if it is required. I am not sure if megaraid sas uses 
"internal" commands which needs to be guarded against cpu hotplug. Nor 
would any of these commands be used during a test. For hisi_sas testing, 
I did not bother adding support, and I guess that you don't need to either.

> 
> We do not want to completely move to shared host tag. It will be shared
> host tag support by default, but user should have choice to go back to
> legacy path.
> We will completely move to shared host tag path once it is stable and no
> more field issue observed over a period of time. -
> 
> Updated <megaraid_sas> patch will looks like this -
> 
> diff --git a/megaraid_sas_base.c b/megaraid_sas_base.c
> index 0066833..3b503cb 100644
> --- a/megaraid_sas_base.c
> +++ b/megaraid_sas_base.c
> @@ -37,6 +37,7 @@
>   #include <linux/poll.h>
>   #include <linux/vmalloc.h>
>   #include <linux/irq_poll.h>
> +#include <linux/blk-mq-pci.h>
> 
>   #include <scsi/scsi.h>
>   #include <scsi/scsi_cmnd.h>
> @@ -113,6 +114,10 @@ unsigned int enable_sdev_max_qd;
>   module_param(enable_sdev_max_qd, int, 0444);
>   MODULE_PARM_DESC(enable_sdev_max_qd, "Enable sdev max qd as can_queue.
> Default: 0");
> 
> +int host_tagset_disabled = 0;
> +module_param(host_tagset_disabled, int, 0444);
> +MODULE_PARM_DESC(host_tagset_disabled, "Shared host tagset enable/disable
> Default: enable(1)");

The logic seems inverted here: for passing 1, I would expect Shared host 
tagset enabled, while it actually means to disable, right?

> +
>   MODULE_LICENSE("GPL");
>   MODULE_VERSION(MEGASAS_VERSION);
>   MODULE_AUTHOR("megaraidlinux.pdl@broadcom.com");
> @@ -3115,6 +3120,18 @@ megasas_bios_param(struct scsi_device *sdev, struct
> block_device *bdev,
>          return 0;
>   }
> 
> +static int megasas_map_queues(struct Scsi_Host *shost)
> +{
> +       struct megasas_instance *instance;
> +       instance = (struct megasas_instance *)shost->hostdata;
> +
> +       if (instance->host->nr_hw_queues == 1)
> +               return 0;
> +
> +       return
> blk_mq_pci_map_queues(&shost->tag_set.map[HCTX_TYPE_DEFAULT],
> +                       instance->pdev,
> instance->low_latency_index_start);
> +}
> +
>   static void megasas_aen_polling(struct work_struct *work);
> 
>   /**
> @@ -3423,8 +3440,10 @@ static struct scsi_host_template megasas_template =
> {
>          .eh_timed_out = megasas_reset_timer,
>          .shost_attrs = megaraid_host_attrs,
>          .bios_param = megasas_bios_param,
> +       .map_queues = megasas_map_queues,
>          .change_queue_depth = scsi_change_queue_depth,
>          .max_segment_size = 0xffffffff,
> +       .host_tagset = 1,

Is your intention to always have this set for Scsi_Host, and just change 
nr_hw_queues?

>   };
> 
>   /**
> @@ -6793,7 +6812,21 @@ static int megasas_io_attach(struct
> megasas_instance *instance)
>          host->max_id = MEGASAS_MAX_DEV_PER_CHANNEL;
>          host->max_lun = MEGASAS_MAX_LUN;
>          host->max_cmd_len = 16;
> +       host->nr_hw_queues = 1;
> 
> +       /* Use shared host tagset only for fusion adaptors
> +        * if there are more than one managed interrupts.
> +        */
> +       if ((instance->adapter_type != MFI_SERIES) &&
> +               (instance->msix_vectors > 0) &&
> +               !host_tagset_disabled &&
> +               instance->smp_affinity_enable)
> +               host->nr_hw_queues = instance->msix_vectors -
> +                       instance->low_latency_index_start;
> +
> +       dev_info(&instance->pdev->dev, "Max firmware commands: %d"
> +               " for nr_hw_queues = %d\n", instance->max_fw_cmds,
> +               host->nr_hw_queues);

note: it may be good for us to add a nr_hw_queues file to scsi host 
sysfs folder

>          /*
>           * Notify the mid-layer about the new controller
>           */
> @@ -8842,6 +8875,7 @@ static int __init megasas_init(void)
>                  msix_vectors = 1;
>                  rdpq_enable = 0;
>                  dual_qdepth_disable = 1;
> +               host_tagset_disabled = 1;
>          }
> 

Thanks,
John

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 10/12] megaraid_sas: switch fusion adapters to MQ
  2020-07-06  8:23     ` John Garry
@ 2020-07-06  8:45       ` Hannes Reinecke
  2020-07-06  9:26         ` John Garry
  2020-07-06 19:19       ` Kashyap Desai
  1 sibling, 1 reply; 123+ messages in thread
From: Hannes Reinecke @ 2020-07-06  8:45 UTC (permalink / raw)
  To: John Garry, Kashyap Desai, axboe, jejb, martin.petersen,
	don.brace, Sumit Saxena, ming.lei, bvanassche, hare, hch,
	Shivasharan Srikanteshwara
  Cc: linux-block, linux-scsi, esc.storagedev, chenxiang66, PDL,MEGARAIDLINUX

On 7/6/20 10:23 AM, John Garry wrote:
> On 02/07/2020 11:23, Kashyap Desai wrote:
[ .. ]
>> My understanding is
>> - This RFC + patch set from above link is required for it. I could not
>> see
>> above series is committed.
> 
> It is committed and part of 5.8-rc1
> 
> The latest rc should have some scheduler fixes also.
> 
> I also note that there has been much churn on blk-mq tag code lately,
> and something may be broken, so I plan to verify latest rc myself soon.
> 
>> Am I missing anything. ?
> 
> You could also add this from Hannes (and add megaraid sas support):
> 
> https://lore.kernel.org/linux-scsi/20200629072021.9864-1-hare@suse.de/T/#t
> 
> That is, if it is required. I am not sure if megaraid sas uses
> "internal" commands which needs to be guarded against cpu hotplug. Nor
> would any of these commands be used during a test. For hisi_sas testing,
> I did not bother adding support, and I guess that you don't need to either.
> 
Oh, it certainly uses internal commands, most notably to set up the
queue mapping :-)

The idea of the 'internal command' patchset is to let the block-layer
control _all_ tags, be it for internal or 'normal' I/O.
With that we can do away with all the tag mapping etc these drivers have
to do (cf the hpsa conversion for an example).
And only then we can safely use the blk layer busy iterators, knowing
that all tags are accounted for and a 1:1 mapping between tags and
internal hardware resources exist.
Originally I thought it would help for CPU hotplug, too, but typically
the internal commands are not bound to any specific CPU, so they
typically will not accounted for when looking at the CPU-related resources.
But that depends on the driver etc, so it's hard to give a guideline.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		           Kernel Storage Architect
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 10/12] megaraid_sas: switch fusion adapters to MQ
  2020-07-06  8:45       ` Hannes Reinecke
@ 2020-07-06  9:26         ` John Garry
  2020-07-06  9:40           ` Hannes Reinecke
  0 siblings, 1 reply; 123+ messages in thread
From: John Garry @ 2020-07-06  9:26 UTC (permalink / raw)
  To: Hannes Reinecke, Kashyap Desai, axboe, jejb, martin.petersen,
	don.brace, Sumit Saxena, ming.lei, bvanassche, hare, hch,
	Shivasharan Srikanteshwara
  Cc: linux-block, linux-scsi, esc.storagedev, chenxiang66, PDL,MEGARAIDLINUX

On 06/07/2020 09:45, Hannes Reinecke wrote:
> Originally I thought it would help for CPU hotplug, too, but typically
> the internal commands are not bound to any specific CPU, 

When we alloc the request in scsi_get_internal_cmd() - > 
blk_mq_alloc_request() -> __blk_mq_alloc_request(), the request will 
have an associated hctx.

As such, I would expect the LLDD to honor this, in that it should use 
the hwq associated with the hctx to send/receive the command.

And from that, the hwq managed interrupt should not be shut down until 
the queue is drained, including internal commands.

Is there something wrong with this idea?

Thanks,
John

> so they
> typically will not accounted for when looking at the CPU-related resources.
> But that depends on the driver etc, so it's hard to give a guideline.


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 10/12] megaraid_sas: switch fusion adapters to MQ
  2020-07-06  9:26         ` John Garry
@ 2020-07-06  9:40           ` Hannes Reinecke
  0 siblings, 0 replies; 123+ messages in thread
From: Hannes Reinecke @ 2020-07-06  9:40 UTC (permalink / raw)
  To: John Garry, Kashyap Desai, axboe, jejb, martin.petersen,
	don.brace, Sumit Saxena, ming.lei, bvanassche, hare, hch,
	Shivasharan Srikanteshwara
  Cc: linux-block, linux-scsi, esc.storagedev, chenxiang66, PDL,MEGARAIDLINUX

On 7/6/20 11:26 AM, John Garry wrote:
> On 06/07/2020 09:45, Hannes Reinecke wrote:
>> Originally I thought it would help for CPU hotplug, too, but typically
>> the internal commands are not bound to any specific CPU, 
> 
> When we alloc the request in scsi_get_internal_cmd() - >
> blk_mq_alloc_request() -> __blk_mq_alloc_request(), the request will
> have an associated hctx.
> 
> As such, I would expect the LLDD to honor this, in that it should use
> the hwq associated with the hctx to send/receive the command.
> 
> And from that, the hwq managed interrupt should not be shut down until
> the queue is drained, including internal commands.
> 
> Is there something wrong with this idea?
> 
Oh, no, not at all.

What I'm referring to are driver internals; some driver allow (or set)
the MSIx vector only for 'normal' I/O; internal commands don't have an
MSIx vector set.
As such it's pretty much driver-dependent _where_ they end up, be it on
the same CPU, or on any CPU who might be listening.
Be it as it may, when waiting for hctx to drain we are pretty safe, and
hence CPU hotplug will be improved here.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		           Kernel Storage Architect
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer

^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: [PATCH RFC v7 10/12] megaraid_sas: switch fusion adapters to MQ
  2020-07-06  8:23     ` John Garry
  2020-07-06  8:45       ` Hannes Reinecke
@ 2020-07-06 19:19       ` Kashyap Desai
  2020-07-07  7:58         ` John Garry
  2020-07-08 11:31         ` John Garry
  1 sibling, 2 replies; 123+ messages in thread
From: Kashyap Desai @ 2020-07-06 19:19 UTC (permalink / raw)
  To: John Garry, axboe, jejb, martin.petersen, don.brace,
	Sumit Saxena, ming.lei, bvanassche, hare, hch,
	Shivasharan Srikanteshwara
  Cc: linux-block, linux-scsi, esc.storagedev, chenxiang66, PDL,MEGARAIDLINUX

>
> On 02/07/2020 11:23, Kashyap Desai wrote:
> >>
> >> From: Hannes Reinecke <hare@suse.com>
> >>
> >> Fusion adapters can steer completions to individual queues, and we
> >> now
> > have
> >> support for shared host-wide tags.
> >> So we can enable multiqueue support for fusion adapters and drop the
> > hand-
> >> crafted interrupt affinity settings.
> >
> > Shared host tag is primarily introduced for completeness of CPU
> > hotplug as discussed earlier - https://lwn.net/Articles/819419/
> >
> > How shall I test CPU hotplug on megaraid_sas driver ?
>
> I have scripts like this:
>
> ----8<-----
>
> # hotplug.sh
> # enable all cpus in the system
> ./enable_all.sh
>
> for((i = 0; i < 50 ; i++))
> do
> echo "Looping ... number $i"
> # run fio on all cpus with 40 second runtime ./create_fio_task_cpu.sh 4k
> read
> 2048 1& echo "short sleep, then disable"
> sleep 5
> # disable some set of cpus which means managed interrupts get shutdown #
> like cpu1-50 from 0-63 ./disable_all.sh echo "long sleep $i"
> sleep 50
> echo "long sleep over number $i"
> ./enable_all.sh
> sleep 3
> done
>
> ----->8-----
>
> # enable_all.sh
> for((i=0; i<63; i++))
> do
> echo 1 > /sys/devices/system/cpu/cpu$i/online
> done
>
> --->8----
>
> I hope to add such a test to blktests when I get a chance.
>
> > My understanding is
> > - This RFC + patch set from above link is required for it. I could not
> > see above series is committed.
>
> It is committed and part of 5.8-rc1
>
> The latest rc should have some scheduler fixes also.
>
> I also note that there has been much churn on blk-mq tag code lately, and
> something may be broken, so I plan to verify latest rc myself soon.

Thanks. I will try merging 5.8-rc1 and RFC and see how CPU hot plug works.

>
> > Am I missing anything. ?
>
> You could also add this from Hannes (and add megaraid sas support):
>
> https://lore.kernel.org/linux-scsi/20200629072021.9864-1-
> hare@suse.de/T/#t
>
> That is, if it is required. I am not sure if megaraid sas uses "internal"
> commands which needs to be guarded against cpu hotplug. Nor would any of
> these commands be used during a test. For hisi_sas testing, I did not
> bother
> adding support, and I guess that you don't need to either.

Megaraid driver use internal command but it is excluded from can_queue.
All internal command are mapped to msix index 0, which is non-managed. So we
are good w.r.t internal command.

>
> >
> > We do not want to completely move to shared host tag. It will be shared
> > host tag support by default, but user should have choice to go back to
> > legacy path.
> > We will completely move to shared host tag path once it is stable and no
> > more field issue observed over a period of time. -
> >
> > Updated <megaraid_sas> patch will looks like this -
> >
> > diff --git a/megaraid_sas_base.c b/megaraid_sas_base.c
> > index 0066833..3b503cb 100644
> > --- a/megaraid_sas_base.c
> > +++ b/megaraid_sas_base.c
> > @@ -37,6 +37,7 @@
> >   #include <linux/poll.h>
> >   #include <linux/vmalloc.h>
> >   #include <linux/irq_poll.h>
> > +#include <linux/blk-mq-pci.h>
> >
> >   #include <scsi/scsi.h>
> >   #include <scsi/scsi_cmnd.h>
> > @@ -113,6 +114,10 @@ unsigned int enable_sdev_max_qd;
> >   module_param(enable_sdev_max_qd, int, 0444);
> >   MODULE_PARM_DESC(enable_sdev_max_qd, "Enable sdev max qd as
> can_queue.
> > Default: 0");
> >
> > +int host_tagset_disabled = 0;
> > +module_param(host_tagset_disabled, int, 0444);
> > +MODULE_PARM_DESC(host_tagset_disabled, "Shared host tagset
> enable/disable
> > Default: enable(1)");
>
> The logic seems inverted here: for passing 1, I would expect Shared host
> tagset enabled, while it actually means to disable, right?

No. passing 1 means shared_hosttag support will be turned off.
>
> > +
> >   MODULE_LICENSE("GPL");
> >   MODULE_VERSION(MEGASAS_VERSION);
> >   MODULE_AUTHOR("megaraidlinux.pdl@broadcom.com");
> > @@ -3115,6 +3120,18 @@ megasas_bios_param(struct scsi_device *sdev,
> struct
> > block_device *bdev,
> >          return 0;
> >   }
> >
> > +static int megasas_map_queues(struct Scsi_Host *shost)
> > +{
> > +       struct megasas_instance *instance;
> > +       instance = (struct megasas_instance *)shost->hostdata;
> > +
> > +       if (instance->host->nr_hw_queues == 1)
> > +               return 0;
> > +
> > +       return
> > blk_mq_pci_map_queues(&shost->tag_set.map[HCTX_TYPE_DEFAULT],
> > +                       instance->pdev,
> > instance->low_latency_index_start);
> > +}
> > +
> >   static void megasas_aen_polling(struct work_struct *work);
> >
> >   /**
> > @@ -3423,8 +3440,10 @@ static struct scsi_host_template
> megasas_template =
> > {
> >          .eh_timed_out = megasas_reset_timer,
> >          .shost_attrs = megaraid_host_attrs,
> >          .bios_param = megasas_bios_param,
> > +       .map_queues = megasas_map_queues,
> >          .change_queue_depth = scsi_change_queue_depth,
> >          .max_segment_size = 0xffffffff,
> > +       .host_tagset = 1,
>
> Is your intention to always have this set for Scsi_Host, and just change
> nr_hw_queues?

Actually I wanted to turn off  this feature using host_tagset and not
through nr_hw_queue. I will address this.

Additional request -
In MR we have old controllers (called MFI_SERIES). We prefer not to change
behavior for those controller.
Having host_tagset in template does not allow to cherry pick different
values for different type of controller.
If host_tagset is part of Scsi_Host OR we add check in scsi_lib.c that
host_tagset = 1 only make sense if nr_hw_queues > 1, we can cherry pick in
driver.


>
> >   };
> >
> >   /**
> > @@ -6793,7 +6812,21 @@ static int megasas_io_attach(struct
> > megasas_instance *instance)
> >          host->max_id = MEGASAS_MAX_DEV_PER_CHANNEL;
> >          host->max_lun = MEGASAS_MAX_LUN;
> >          host->max_cmd_len = 16;
> > +       host->nr_hw_queues = 1;
> >
> > +       /* Use shared host tagset only for fusion adaptors
> > +        * if there are more than one managed interrupts.
> > +        */
> > +       if ((instance->adapter_type != MFI_SERIES) &&
> > +               (instance->msix_vectors > 0) &&
> > +               !host_tagset_disabled &&
> > +               instance->smp_affinity_enable)
> > +               host->nr_hw_queues = instance->msix_vectors -
> > +                       instance->low_latency_index_start;
> > +
> > +       dev_info(&instance->pdev->dev, "Max firmware commands: %d"
> > +               " for nr_hw_queues = %d\n", instance->max_fw_cmds,
> > +               host->nr_hw_queues);
>
> note: it may be good for us to add a nr_hw_queues file to scsi host
> sysfs folder

I will accommodate this.

>
> >          /*
> >           * Notify the mid-layer about the new controller
> >           */
> > @@ -8842,6 +8875,7 @@ static int __init megasas_init(void)
> >                  msix_vectors = 1;
> >                  rdpq_enable = 0;
> >                  dual_qdepth_disable = 1;
> > +               host_tagset_disabled = 1;
> >          }
> >
>
> Thanks,
> John

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 10/12] megaraid_sas: switch fusion adapters to MQ
  2020-07-06 19:19       ` Kashyap Desai
@ 2020-07-07  7:58         ` John Garry
  2020-07-07 14:45           ` Kashyap Desai
  2020-07-08 11:31         ` John Garry
  1 sibling, 1 reply; 123+ messages in thread
From: John Garry @ 2020-07-07  7:58 UTC (permalink / raw)
  To: Kashyap Desai, axboe, jejb, martin.petersen, don.brace,
	Sumit Saxena, ming.lei, bvanassche, hare, hch,
	Shivasharan Srikanteshwara
  Cc: linux-block, linux-scsi, esc.storagedev, chenxiang66, PDL,MEGARAIDLINUX

>>>
>>>    #include <scsi/scsi.h>
>>>    #include <scsi/scsi_cmnd.h>
>>> @@ -113,6 +114,10 @@ unsigned int enable_sdev_max_qd;
>>>    module_param(enable_sdev_max_qd, int, 0444);
>>>    MODULE_PARM_DESC(enable_sdev_max_qd, "Enable sdev max qd as
>> can_queue.
>>> Default: 0");
>>>
>>> +int host_tagset_disabled = 0;
>>> +module_param(host_tagset_disabled, int, 0444);
>>> +MODULE_PARM_DESC(host_tagset_disabled, "Shared host tagset
>> enable/disable
>>> Default: enable(1)");
>> The logic seems inverted here: for passing 1, I would expect Shared host
>> tagset enabled, while it actually means to disable, right?
> No. passing 1 means shared_hosttag support will be turned off.

Just reading "Shared host tagset enable/disable Default: enable(1)" 
looked inconsistent to me.

>>> +
>>>    MODULE_LICENSE("GPL");
>>>    MODULE_VERSION(MEGASAS_VERSION);
>>>    MODULE_AUTHOR("megaraidlinux.pdl@broadcom.com");
>>> @@ -3115,6 +3120,18 @@ megasas_bios_param(struct scsi_device *sdev,
>> struct
>>> block_device *bdev,
>>>           return 0;
>>>    }
>>>
>>> +static int megasas_map_queues(struct Scsi_Host *shost)
>>> +{
>>> +       struct megasas_instance *instance;
>>> +       instance = (struct megasas_instance *)shost->hostdata;
>>> +
>>> +       if (instance->host->nr_hw_queues == 1)
>>> +               return 0;
>>> +
>>> +       return
>>> blk_mq_pci_map_queues(&shost->tag_set.map[HCTX_TYPE_DEFAULT],
>>> +                       instance->pdev,
>>> instance->low_latency_index_start);
>>> +}
>>> +
>>>    static void megasas_aen_polling(struct work_struct *work);
>>>
>>>    /**
>>> @@ -3423,8 +3440,10 @@ static struct scsi_host_template
>> megasas_template =
>>> {
>>>           .eh_timed_out = megasas_reset_timer,
>>>           .shost_attrs = megaraid_host_attrs,
>>>           .bios_param = megasas_bios_param,
>>> +       .map_queues = megasas_map_queues,
>>>           .change_queue_depth = scsi_change_queue_depth,
>>>           .max_segment_size = 0xffffffff,
>>> +       .host_tagset = 1,
>> Is your intention to always have this set for Scsi_Host, and just change
>> nr_hw_queues?
> Actually I wanted to turn off  this feature using host_tagset and not
> through nr_hw_queue. I will address this.
> 
> Additional request -
> In MR we have old controllers (called MFI_SERIES). We prefer not to change
> behavior for those controller.
> Having host_tagset in template does not allow to cherry pick different
> values for different type of controller.

Ok, so it seems sensible to add host_tagset to Scsi_Host structure also, 
to allow overwriting during probe time.

If you want to share an updated megaraid sas driver patch based on that, 
then that's fine. I can incorporate that change in the patch where we 
add host_tagset to the scsi host template.

> If host_tagset is part of Scsi_Host OR we add check in scsi_lib.c that
> host_tagset = 1 only make sense if nr_hw_queues > 1, we can cherry pick in
> driver.
> 
> 




^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: [PATCH RFC v7 10/12] megaraid_sas: switch fusion adapters to MQ
  2020-07-07  7:58         ` John Garry
@ 2020-07-07 14:45           ` Kashyap Desai
  2020-07-07 16:17             ` John Garry
  0 siblings, 1 reply; 123+ messages in thread
From: Kashyap Desai @ 2020-07-07 14:45 UTC (permalink / raw)
  To: John Garry, axboe, jejb, martin.petersen, don.brace,
	Sumit Saxena, ming.lei, bvanassche, hare, hch,
	Shivasharan Srikanteshwara
  Cc: linux-block, linux-scsi, esc.storagedev, chenxiang66, PDL,MEGARAIDLINUX

> >>>
> >>>    #include <scsi/scsi.h>
> >>>    #include <scsi/scsi_cmnd.h>
> >>> @@ -113,6 +114,10 @@ unsigned int enable_sdev_max_qd;
> >>>    module_param(enable_sdev_max_qd, int, 0444);
> >>>    MODULE_PARM_DESC(enable_sdev_max_qd, "Enable sdev max qd as
> >> can_queue.
> >>> Default: 0");
> >>>
> >>> +int host_tagset_disabled = 0;
> >>> +module_param(host_tagset_disabled, int, 0444);
> >>> +MODULE_PARM_DESC(host_tagset_disabled, "Shared host tagset
> >> enable/disable
> >>> Default: enable(1)");
> >> The logic seems inverted here: for passing 1, I would expect Shared
> >> host tagset enabled, while it actually means to disable, right?
> > No. passing 1 means shared_hosttag support will be turned off.
>
> Just reading "Shared host tagset enable/disable Default: enable(1)"
> looked inconsistent to me.

I will change to "host_tagset_enable" that will be good for readability.
Default value will of host_tagset_enable will be 1 and user can turnoff
passing 0.

>
> >>> +
> >>>    MODULE_LICENSE("GPL");
> >>>    MODULE_VERSION(MEGASAS_VERSION);
> >>>    MODULE_AUTHOR("megaraidlinux.pdl@broadcom.com");
> >>> @@ -3115,6 +3120,18 @@ megasas_bios_param(struct scsi_device
> *sdev,
> >> struct
> >>> block_device *bdev,
> >>>           return 0;
> >>>    }
> >>>
> >>> +static int megasas_map_queues(struct Scsi_Host *shost) {
> >>> +       struct megasas_instance *instance;
> >>> +       instance = (struct megasas_instance *)shost->hostdata;
> >>> +
> >>> +       if (instance->host->nr_hw_queues == 1)
> >>> +               return 0;
> >>> +
> >>> +       return
> >>> blk_mq_pci_map_queues(&shost->tag_set.map[HCTX_TYPE_DEFAULT],
> >>> +                       instance->pdev,
> >>> instance->low_latency_index_start);
> >>> +}
> >>> +
> >>>    static void megasas_aen_polling(struct work_struct *work);
> >>>
> >>>    /**
> >>> @@ -3423,8 +3440,10 @@ static struct scsi_host_template
> >> megasas_template =
> >>> {
> >>>           .eh_timed_out = megasas_reset_timer,
> >>>           .shost_attrs = megaraid_host_attrs,
> >>>           .bios_param = megasas_bios_param,
> >>> +       .map_queues = megasas_map_queues,
> >>>           .change_queue_depth = scsi_change_queue_depth,
> >>>           .max_segment_size = 0xffffffff,
> >>> +       .host_tagset = 1,
> >> Is your intention to always have this set for Scsi_Host, and just
> >> change nr_hw_queues?
> > Actually I wanted to turn off  this feature using host_tagset and not
> > through nr_hw_queue. I will address this.
> >
> > Additional request -
> > In MR we have old controllers (called MFI_SERIES). We prefer not to
> > change behavior for those controller.
> > Having host_tagset in template does not allow to cherry pick different
> > values for different type of controller.
>
> Ok, so it seems sensible to add host_tagset to Scsi_Host structure also,
> to
> allow overwriting during probe time.
>
> If you want to share an updated megaraid sas driver patch based on that,
> then
> that's fine. I can incorporate that change in the patch where we add
> host_tagset to the scsi host template.

If you share git repo link of next submission, I can send you megaraid_sas
driver patch which you can include in series.

Kashyap
>
> > If host_tagset is part of Scsi_Host OR we add check in scsi_lib.c that
> > host_tagset = 1 only make sense if nr_hw_queues > 1, we can cherry
> > pick in driver.
> >
> >
>
>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 10/12] megaraid_sas: switch fusion adapters to MQ
  2020-07-07 14:45           ` Kashyap Desai
@ 2020-07-07 16:17             ` John Garry
  2020-07-09 19:01               ` Kashyap Desai
  0 siblings, 1 reply; 123+ messages in thread
From: John Garry @ 2020-07-07 16:17 UTC (permalink / raw)
  To: Kashyap Desai, axboe, jejb, martin.petersen, don.brace,
	Sumit Saxena, ming.lei, bvanassche, hare, hch,
	Shivasharan Srikanteshwara
  Cc: linux-block, linux-scsi, esc.storagedev, chenxiang66, PDL,MEGARAIDLINUX

On 07/07/2020 15:45, Kashyap Desai wrote:
>>>>>            .eh_timed_out = megasas_reset_timer,
>>>>>            .shost_attrs = megaraid_host_attrs,
>>>>>            .bios_param = megasas_bios_param,
>>>>> +       .map_queues = megasas_map_queues,
>>>>>            .change_queue_depth = scsi_change_queue_depth,
>>>>>            .max_segment_size = 0xffffffff,
>>>>> +       .host_tagset = 1,
>>>> Is your intention to always have this set for Scsi_Host, and just
>>>> change nr_hw_queues?
>>> Actually I wanted to turn off  this feature using host_tagset and not
>>> through nr_hw_queue. I will address this.
>>>
>>> Additional request -
>>> In MR we have old controllers (called MFI_SERIES). We prefer not to
>>> change behavior for those controller.
>>> Having host_tagset in template does not allow to cherry pick different
>>> values for different type of controller.
>> Ok, so it seems sensible to add host_tagset to Scsi_Host structure also,
>> to
>> allow overwriting during probe time.
>>
>> If you want to share an updated megaraid sas driver patch based on that,
>> then
>> that's fine. I can incorporate that change in the patch where we add
>> host_tagset to the scsi host template.
> If you share git repo link of next submission, I can send you megaraid_sas
> driver patch which you can include in series.

So this is my work-en-progress branch:

https://github.com/hisilicon/kernel-dev/commits/private-topic-blk-mq-shared-tags-rfc-v8

I just updated to include the change to have Scsi_Host.host_tagset in 
4291f617a02b commit ("scsi: Add host and host template flag 'host_tagset'")

megaraid sas support is not on the branch yet, but I think everything 
else required is. And it is mutable, so I'd clone it now if I were you - 
or just replace the required patch onto your v7 branch.

Thanks,
John

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 10/12] megaraid_sas: switch fusion adapters to MQ
  2020-07-06 19:19       ` Kashyap Desai
  2020-07-07  7:58         ` John Garry
@ 2020-07-08 11:31         ` John Garry
  1 sibling, 0 replies; 123+ messages in thread
From: John Garry @ 2020-07-08 11:31 UTC (permalink / raw)
  To: Kashyap Desai, axboe, jejb, martin.petersen, don.brace,
	Sumit Saxena, ming.lei, bvanassche, hare, hch,
	Shivasharan Srikanteshwara
  Cc: linux-block, linux-scsi, esc.storagedev, chenxiang66, PDL,MEGARAIDLINUX

On 06/07/2020 20:19, Kashyap Desai wrote:
>>> My understanding is
>>> - This RFC + patch set from above link is required for it. I could not
>>> see above series is committed.
>> It is committed and part of 5.8-rc1
>>
>> The latest rc should have some scheduler fixes also.
>>
>> I also note that there has been much churn on blk-mq tag code lately, and
>> something may be broken, so I plan to verify latest rc myself soon.
> Thanks. I will try merging 5.8-rc1 and RFC and see how CPU hot plug works.
> 

JFYI, I tested 5.8-rc4 for scheduler=none and =mq-deadline (no fully 
tested), and it looks ok

john


^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: [PATCH RFC v7 10/12] megaraid_sas: switch fusion adapters to MQ
  2020-07-07 16:17             ` John Garry
@ 2020-07-09 19:01               ` Kashyap Desai
  2020-07-10  8:10                 ` John Garry
  0 siblings, 1 reply; 123+ messages in thread
From: Kashyap Desai @ 2020-07-09 19:01 UTC (permalink / raw)
  To: John Garry, axboe, jejb, martin.petersen, don.brace,
	Sumit Saxena, ming.lei, bvanassche, hare, hch,
	Shivasharan Srikanteshwara
  Cc: linux-block, linux-scsi, esc.storagedev, chenxiang66, PDL,MEGARAIDLINUX

> >> If you want to share an updated megaraid sas driver patch based on
> >> that, then that's fine. I can incorporate that change in the patch
> >> where we add host_tagset to the scsi host template.
> > If you share git repo link of next submission, I can send you
> > megaraid_sas driver patch which you can include in series.
>
> So this is my work-en-progress branch:
>
> https://github.com/hisilicon/kernel-dev/commits/private-topic-blk-mq-
> shared-tags-rfc-v8

I tested this repo + megaraid_sas shared hosttag driver. This repo (5.8-rc)
has CPU hotplug patch.
" bf0beec0607d blk-mq: drain I/O when all CPUs in a hctx are offline"

Looking at description of above patch and changes, it looks like
megaraid_sas driver can still work without shared host tag for this feature.

I observe CPU hotplug works irrespective of shared host tag in megaraid_sas
on 5.8-rc.

Without shared host tag, megaraid driver will expose single hctx and all the
CPU will be mapped to hctx0.
Any CPU offline event will have " blk_mq_hctx_notify_offline" callback in
blk-mq module. If we do not have this callback/patch, we will see IO
timeout.
blk_mq_hctx_notify_offline callback will make sure all the outstanding on
hctx0 is cleared and only after it is cleared, CPU will go offline.

megaraid_sas driver has  internal reply_queue mapping which helps to get IO
completion on same cpu.  Driver get msix index from that table based on "
raw_smp_processor_id".
If table is mapped correctly at probe time,  It is not possible to pick
entry of offline CPU.

Am I missing anything ?

If you can help me to understand why we need shared host tag for CPU
hotplug, I can try to frame some test case for possible reproduction.

>
> I just updated to include the change to have Scsi_Host.host_tagset in
> 4291f617a02b commit ("scsi: Add host and host template flag
> 'host_tagset'")
>
> megaraid sas support is not on the branch yet, but I think everything else
> required is. And it is mutable, so I'd clone it now if I were you - or
> just replace
> the required patch onto your v7 branch.

I am working on this.

>
> Thanks,
> John

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 10/12] megaraid_sas: switch fusion adapters to MQ
  2020-07-09 19:01               ` Kashyap Desai
@ 2020-07-10  8:10                 ` John Garry
  2020-07-13  7:55                   ` Kashyap Desai
  0 siblings, 1 reply; 123+ messages in thread
From: John Garry @ 2020-07-10  8:10 UTC (permalink / raw)
  To: Kashyap Desai, axboe, jejb, martin.petersen, don.brace,
	Sumit Saxena, ming.lei, bvanassche, hare, hch,
	Shivasharan Srikanteshwara
  Cc: linux-block, linux-scsi, esc.storagedev, chenxiang66, PDL,MEGARAIDLINUX


>>
>> https://github.com/hisilicon/kernel-dev/commits/private-topic-blk-mq-
>> shared-tags-rfc-v8
> I tested this repo + megaraid_sas shared hosttag driver. This repo (5.8-rc)
> has CPU hotplug patch.
> " bf0beec0607d blk-mq: drain I/O when all CPUs in a hctx are offline"
> 
> Looking at description of above patch and changes, it looks like
> megaraid_sas driver can still work without shared host tag for this feature.
> 
> I observe CPU hotplug works irrespective of shared host tag

Can you be clear exactly what you mean by "irrespective of shared host tag"?

Do you mean that for your test Scsi_Host.nr_hw_queues is set to expose 
hw queues and scsi_host_template.map_queues = blk_mq_pci_map_queues(), 
but you just don't set the host_tagset flag?

  in megaraid_sas
> on 5.8-rc.
> 
> Without shared host tag, megaraid driver will expose single hctx and all the
> CPU will be mapped to hctx0.

right

> Any CPU offline event will have " blk_mq_hctx_notify_offline" callback in
> blk-mq module. If we do not have this callback/patch, we will see IO
> timeout.
> blk_mq_hctx_notify_offline callback will make sure all the outstanding on
> hctx0 is cleared and only after it is cleared, CPU will go offline.

But that is only for when the last CPU for the hctx is going offline. If 
nr_hw_queues == 1, then hctx0 would cover all CPUs, so that would never 
occur during normal operation. See initial check in 
blk_mq_hctx_notify_offline():

static int blk_mq_hctx_notify_offline(unsigned int cpu, struct 
hlist_node *node)
{
	if (!cpumask_test_cpu(cpu, hctx->cpumask) ||
	    !blk_mq_last_cpu_in_hctx(cpu, hctx))
		return 0;

> 
> megaraid_sas driver has  internal reply_queue mapping which helps to get IO
> completion on same cpu.  Driver get msix index from that table based on "
> raw_smp_processor_id".
> If table is mapped correctly at probe time,  It is not possible to pick
> entry of offline CPU.
> 
> Am I missing anything ?

Not sure, I think I need to be clear exactly what you're doing.

> 
> If you can help me to understand why we need shared host tag for CPU
> hotplug, I can try to frame some test case for possible reproduction.

I think it's best explained in cover letter for "blk-mq: Facilitate a 
shared sbitmap per tagset".

See points "HBA HW queues are required to be mapped to that of the 
blk-mq hctx", "HBA LLDD would have to generate this tag internally", and 
"blk-mq assumes the host may accept (Scsi_host.can_queue * #hw queue) 
commands".
> 
>> I just updated to include the change to have Scsi_Host.host_tagset in
>> 4291f617a02b commit ("scsi: Add host and host template flag
>> 'host_tagset'")
>>
>> megaraid sas support is not on the branch yet, but I think everything else
>> required is. And it is mutable, so I'd clone it now if I were you - or
>> just replace
>> the required patch onto your v7 branch.
> I am working on this.
> 

Great, thanks


^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: [PATCH RFC v7 10/12] megaraid_sas: switch fusion adapters to MQ
  2020-07-10  8:10                 ` John Garry
@ 2020-07-13  7:55                   ` Kashyap Desai
  2020-07-13  8:42                     ` John Garry
  0 siblings, 1 reply; 123+ messages in thread
From: Kashyap Desai @ 2020-07-13  7:55 UTC (permalink / raw)
  To: John Garry, axboe, jejb, martin.petersen, don.brace,
	Sumit Saxena, ming.lei, bvanassche, hare, hch,
	Shivasharan Srikanteshwara
  Cc: linux-block, linux-scsi, esc.storagedev, chenxiang66, PDL,MEGARAIDLINUX

> > Looking at description of above patch and changes, it looks like
> > megaraid_sas driver can still work without shared host tag for this
> > feature.
> >
> > I observe CPU hotplug works irrespective of shared host tag
>
> Can you be clear exactly what you mean by "irrespective of shared host
> tag"?
>
> Do you mean that for your test Scsi_Host.nr_hw_queues is set to expose hw
> queues and scsi_host_template.map_queues = blk_mq_pci_map_queues(),
> but you just don't set the host_tagset flag?

Yes. I only disabled "host_tagset". <map_queue> is still hooked.

>
>   in megaraid_sas
> > on 5.8-rc.
> >
> > Without shared host tag, megaraid driver will expose single hctx and
> > all the CPU will be mapped to hctx0.
>
> right
>
> > Any CPU offline event will have " blk_mq_hctx_notify_offline" callback
> > in blk-mq module. If we do not have this callback/patch, we will see
> > IO timeout.
> > blk_mq_hctx_notify_offline callback will make sure all the outstanding
> > on
> > hctx0 is cleared and only after it is cleared, CPU will go offline.
>
> But that is only for when the last CPU for the hctx is going offline. If
> nr_hw_queues == 1, then hctx0 would cover all CPUs, so that would never
> occur during normal operation. See initial check in
> blk_mq_hctx_notify_offline():
>
> static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node
> *node) {
> 	if (!cpumask_test_cpu(cpu, hctx->cpumask) ||
> 	    !blk_mq_last_cpu_in_hctx(cpu, hctx))
> 		return 0;
>

Thanks John for this pointer. I missed this part and now I understood what
was happening in my testing.
There were more than one CPU mapped to one msix index in my earlier testing
and because of that I could see Interrupt migration happens on available CPU
from affinity mask. So my earlier testing was incorrect.

Now I am consistently able to reproduce issue - Best setup is have 1:1
mapping of CPU to MSIX vector mapping. I used 128 logical CPU and 128 msix
vectors and I noticed IO timeout without this RFC (without host_tagset).
I did not noticed IO timeout with RFC (with host_tagset.) I will update this
data in Driver's commit message.

Just for my understanding -
What if we have below code in blk_mq_hctx_notify_offline, CPU hotplug should
work for megaraid_sas driver without this RFC (without shared host tagset).
Right ?
If answer is yes, will there be any side effect of having below code in
block layer ?

static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node
 *node) {
 	if (hctx->queue->nr_hw_queues > 1
	    && (!cpumask_test_cpu(cpu, hctx->cpumask) ||
 	    !blk_mq_last_cpu_in_hctx(cpu, hctx)))
 		return 0;

I also noticed nr_hw_queues are now exposed in sysfs -

/sys/devices/pci0000:85/0000:85:00.0/0000:86:00.0/0000:87:04.0/0000:8b:00.0/0000:8c:00.0/0000:8d:00.0/host14/scsi_host/host14/nr_hw_queues:128

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 10/12] megaraid_sas: switch fusion adapters to MQ
  2020-07-13  7:55                   ` Kashyap Desai
@ 2020-07-13  8:42                     ` John Garry
  2020-07-19 19:07                       ` Kashyap Desai
  2020-07-20  7:23                       ` Kashyap Desai
  0 siblings, 2 replies; 123+ messages in thread
From: John Garry @ 2020-07-13  8:42 UTC (permalink / raw)
  To: Kashyap Desai, axboe, jejb, martin.petersen, don.brace,
	Sumit Saxena, ming.lei, bvanassche, hare, hch,
	Shivasharan Srikanteshwara
  Cc: linux-block, linux-scsi, esc.storagedev, chenxiang66, PDL,MEGARAIDLINUX

On 13/07/2020 08:55, Kashyap Desai wrote:
>> ring normal operation. See initial check in
>> blk_mq_hctx_notify_offline():
>>
>> static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node
>> *node) {
>> 	if (!cpumask_test_cpu(cpu, hctx->cpumask) ||
>> 	    !blk_mq_last_cpu_in_hctx(cpu, hctx))
>> 		return 0;
>>
> Thanks John for this pointer. I missed this part and now I understood what
> was happening in my testing.

JFYI, I always have this as a sanity check for my testing:

void irq_shutdown(struct irq_desc *desc)
{
+	pr_err("%s irq%d\n", __func__, desc->irq_data.irq);
+
	if (irqd_is_started(&desc->irq_data)) {
		desc->depth = 1;
		if (desc->irq_data.chip->irq_shutdown) {

> There were more than one CPU mapped to one msix index in my earlier testing
> and because of that I could see Interrupt migration happens on available CPU
> from affinity mask. So my earlier testing was incorrect.
> 
> Now I am consistently able to reproduce issue - Best setup is have 1:1
> mapping of CPU to MSIX vector mapping. I used 128 logical CPU and 128 msix
> vectors and I noticed IO timeout without this RFC (without host_tagset).
> I did not noticed IO timeout with RFC (with host_tagset.) I will update this
> data in Driver's commit message.

ok, great. That's what we want. I'm not sure exactly what your test 
consists of, though.

> 
> Just for my understanding -
> What if we have below code in blk_mq_hctx_notify_offline, CPU hotplug should
> work for megaraid_sas driver without this RFC (without shared host tagset).
> Right ?

> If answer is yes, will there be any side effect of having below code in
> block layer ?
> 

Sorry, I'm sure sure what you're getting at with this change, below.

It seems that you're trying to drain hctx0 (which is your only hctx, as 
nr_hw_queues = 0 without this patchset) and set it inactive whenever any 
CPU is offlined. If so, that's not right.

> static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node
>   *node) {
>   	if (hctx->queue->nr_hw_queues > 1
> 	    && (!cpumask_test_cpu(cpu, hctx->cpumask) ||
>   	    !blk_mq_last_cpu_in_hctx(cpu, hctx)))
>   		return 0;
> 
> I also noticed nr_hw_queues are now exposed in sysfs -
> 
> /sys/devices/pci0000:85/0000:85:00.0/0000:86:00.0/0000:87:04.0/0000:8b:00.0/0000:8c:00.0/0000:8d:00.0/host14/scsi_host/host14/nr_hw_queues:128
> .

That's on my v8 wip branch, so I guess you're picking it up from there.

Thanks,
John

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 07/12] blk-mq: Add support in hctx_tags_bitmap_show() for a shared sbitmap
  2020-06-12  6:06       ` Hannes Reinecke
  2020-06-29 15:32         ` About sbitmap_bitmap_show() and cleared bits (was Re: [PATCH RFC v7 07/12] blk-mq: Add support in hctx_tags_bitmap_show() for a shared sbitmap) John Garry
@ 2020-07-13  9:41         ` John Garry
  2020-07-13 12:20           ` Hannes Reinecke
  1 sibling, 1 reply; 123+ messages in thread
From: John Garry @ 2020-07-13  9:41 UTC (permalink / raw)
  To: Hannes Reinecke, axboe, jejb, martin.petersen, don.brace,
	kashyap.desai, sumit.saxena, ming.lei, bvanassche, hare, hch,
	shivasharan.srikanteshwara
  Cc: linux-block, linux-scsi, esc.storagedev, chenxiang66, megaraidlinux.pdl

On 12/06/2020 07:06, Hannes Reinecke wrote:
>>>>
>>>> +out:
>>>> +    sbitmap_free(&shared_sb);
>>>> +    return res;
>>>> +}
>>>> +
>>>>   static int hctx_tags_bitmap_show(void *data, struct seq_file *m)
>>>>   {
>>>>       struct blk_mq_hw_ctx *hctx = data;
>>>> @@ -823,6 +884,7 @@ static const struct blk_mq_debugfs_attr 
>>>> blk_mq_debugfs_hctx_shared_sbitmap_attrs
>>>>       {"busy", 0400, hctx_busy_show},
>>>>       {"ctx_map", 0400, hctx_ctx_map_show},
>>>>       {"tags", 0400, hctx_tags_show},
>>>> +    {"tags_bitmap", 0400, hctx_tags_shared_sbitmap_bitmap_show},
>>>>       {"sched_tags", 0400, hctx_sched_tags_show},
>>>>       {"sched_tags_bitmap", 0400, hctx_sched_tags_bitmap_show},
>>>>       {"io_poll", 0600, hctx_io_poll_show, hctx_io_poll_write},
>>>>
>>> Ah, right. Here it is.
>>>
>>> But I don't get it; why do we have to allocate a temporary bitmap and 
>>> can't walk the existing shared sbitmap?
>>

Just coming back to this now...

>> For the bitmap dump - sbitmap_bitmap_show() - it is passed a struct 
>> sbitmap *. So we have to filter into a temp per-hctx struct sbitmap. 
>> We could change sbitmap_bitmap_show() to accept a filter iterator - 
>> which I think you're getting at - but I am not sure it's worth the 
>> change. Or else use the allocated sbitmap for the hctx, as above*, but 
>> I may be remove that (see 4/12 response).
>>
> Yes, I do think I would prefer updating sbitmap_bitmap_show() to accept 
> a filter. Especially as Ming Lei has now updated the tag iterators to 
> accept a filter, too, so it should be an easy change.

Can you please explain how you would use an iterator here?

In fact, I am half thinking of dropping this patch entirely.

So I feel that current behavior is a little strange, as some may expect 
/sys/kernel/debug/block/sdX/hctxY/tags_bitmap would only show tags for 
hctxY for sdX, while it is for hctxY for all queues. Same for 
/sys/kernel/debug/block/sdX/hctxY/tags

And then, for what we have in this patch:

static void hctx_filter_sb(struct sbitmap *sb, struct blk_mq_hw_ctx *hctx)
{
struct hctx_sb_data hctx_sb_data = { .sb = sb, .hctx = hctx };

blk_mq_queue_tag_busy_iter(hctx->queue, hctx_filter_fn, &hctx_sb_data);
}

This will give tags only for this queue. So not the same. So I feel it's 
better to change current behavior (so consistent) or change neither. And 
changing current behavior would also mean we need to change what we show 
in /sys/kernel/debug/block/sdX/hctxY/tags, and that looks troublesome also.

Opinion?

Thanks,
John

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 07/12] blk-mq: Add support in hctx_tags_bitmap_show() for a shared sbitmap
  2020-07-13  9:41         ` [PATCH RFC v7 07/12] blk-mq: Add support in hctx_tags_bitmap_show() for a shared sbitmap John Garry
@ 2020-07-13 12:20           ` Hannes Reinecke
  0 siblings, 0 replies; 123+ messages in thread
From: Hannes Reinecke @ 2020-07-13 12:20 UTC (permalink / raw)
  To: John Garry, axboe, jejb, martin.petersen, don.brace,
	kashyap.desai, sumit.saxena, ming.lei, bvanassche, hare, hch,
	shivasharan.srikanteshwara
  Cc: linux-block, linux-scsi, esc.storagedev, chenxiang66, megaraidlinux.pdl

On 7/13/20 11:41 AM, John Garry wrote:
> On 12/06/2020 07:06, Hannes Reinecke wrote:
>>>>>
>>>>> +out:
>>>>> +    sbitmap_free(&shared_sb);
>>>>> +    return res;
>>>>> +}
>>>>> +
>>>>>   static int hctx_tags_bitmap_show(void *data, struct seq_file *m)
>>>>>   {
>>>>>       struct blk_mq_hw_ctx *hctx = data;
>>>>> @@ -823,6 +884,7 @@ static const struct blk_mq_debugfs_attr
>>>>> blk_mq_debugfs_hctx_shared_sbitmap_attrs
>>>>>       {"busy", 0400, hctx_busy_show},
>>>>>       {"ctx_map", 0400, hctx_ctx_map_show},
>>>>>       {"tags", 0400, hctx_tags_show},
>>>>> +    {"tags_bitmap", 0400, hctx_tags_shared_sbitmap_bitmap_show},
>>>>>       {"sched_tags", 0400, hctx_sched_tags_show},
>>>>>       {"sched_tags_bitmap", 0400, hctx_sched_tags_bitmap_show},
>>>>>       {"io_poll", 0600, hctx_io_poll_show, hctx_io_poll_write},
>>>>>
>>>> Ah, right. Here it is.
>>>>
>>>> But I don't get it; why do we have to allocate a temporary bitmap
>>>> and can't walk the existing shared sbitmap?
>>>
> 
> Just coming back to this now...
> 
>>> For the bitmap dump - sbitmap_bitmap_show() - it is passed a struct
>>> sbitmap *. So we have to filter into a temp per-hctx struct sbitmap.
>>> We could change sbitmap_bitmap_show() to accept a filter iterator -
>>> which I think you're getting at - but I am not sure it's worth the
>>> change. Or else use the allocated sbitmap for the hctx, as above*,
>>> but I may be remove that (see 4/12 response).
>>>
>> Yes, I do think I would prefer updating sbitmap_bitmap_show() to
>> accept a filter. Especially as Ming Lei has now updated the tag
>> iterators to accept a filter, too, so it should be an easy change.
> 
> Can you please explain how you would use an iterator here?
> 
> In fact, I am half thinking of dropping this patch entirely.
> 
> So I feel that current behavior is a little strange, as some may expect
> /sys/kernel/debug/block/sdX/hctxY/tags_bitmap would only show tags for
> hctxY for sdX, while it is for hctxY for all queues. Same for
> /sys/kernel/debug/block/sdX/hctxY/tags
> 
> And then, for what we have in this patch:
> 
> static void hctx_filter_sb(struct sbitmap *sb, struct blk_mq_hw_ctx *hctx)
> {
> struct hctx_sb_data hctx_sb_data = { .sb = sb, .hctx = hctx };
> 
> blk_mq_queue_tag_busy_iter(hctx->queue, hctx_filter_fn, &hctx_sb_data);
> }
> 
> This will give tags only for this queue. So not the same. So I feel it's
> better to change current behavior (so consistent) or change neither. And
> changing current behavior would also mean we need to change what we show
> in /sys/kernel/debug/block/sdX/hctxY/tags, and that looks troublesome also.
> 
> Opinion?
> 
The whole notion of having sysfs presenting tags per hctx doesn't really
apply anymore when running with shared tags.

We could be sticking with the per-hctx attribute, but then busy tags
won't be displayed correctly as tags might be busy, but not on this hctx.
The alternative idea of projecting everything over to hctx0 (or just
duplicating the output from hctx0) would be technically correct, but
would be missing the per-hctx information.

Ideally we would have some sort of tri-state information here: busy,
busy on other hctx, not busy.
Then the per-hctx attribute would start making sense again.

Otherwise I would just leave it for now.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		           Kernel Storage Architect
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 12/12] hpsa: enable host_tagset and switch to MQ
  2020-06-10 17:29 ` [PATCH RFC v7 12/12] hpsa: enable host_tagset and switch to MQ John Garry
@ 2020-07-14  7:37   ` John Garry
  2020-07-14  7:41     ` Hannes Reinecke
                       ` (2 more replies)
  0 siblings, 3 replies; 123+ messages in thread
From: John Garry @ 2020-07-14  7:37 UTC (permalink / raw)
  To: don.brace, Hannes Reinecke
  Cc: axboe, jejb, martin.petersen, kashyap.desai, sumit.saxena,
	ming.lei, bvanassche, hare, hch, shivasharan.srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	megaraidlinux.pdl

On 10/06/2020 18:29, John Garry wrote:
> From: Hannes Reinecke <hare@suse.de>
> 
> The smart array HBAs can steer interrupt completion, so this
> patch switches the implementation to use multiqueue and enables
> 'host_tagset' as the HBA has a shared host-wide tagset.
> 

Hi Don,

I am preparing the next iteration of this series, and we're getting 
close to dropping the RFC tags. The series has grown a bit, and I am not 
sure what to do with hpsa support.

The latest versions of this series have not been tested for hpsa, AFAIK. 
Can you let me know if you can test and review this patch? Or someone 
else let me know it's tested (Hannes?)

Thanks

> Signed-off-by: Hannes Reinecke <hare@suse.de>
> ---
>   drivers/scsi/hpsa.c | 44 +++++++-------------------------------------
>   drivers/scsi/hpsa.h |  1 -
>   2 files changed, 7 insertions(+), 38 deletions(-)
> 
> diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c
> index 1e9302e99d05..f807f9bdae85 100644
> --- a/drivers/scsi/hpsa.c
> +++ b/drivers/scsi/hpsa.c
> @@ -980,6 +980,7 @@ static struct scsi_host_template hpsa_driver_template = {
>   	.shost_attrs = hpsa_shost_attrs,
>   	.max_sectors = 2048,
>   	.no_write_same = 1,
> +	.host_tagset = 1,
>   };
>   
>   static inline u32 next_command(struct ctlr_info *h, u8 q)
> @@ -1144,12 +1145,14 @@ static void dial_up_lockup_detection_on_fw_flash_complete(struct ctlr_info *h,
>   static void __enqueue_cmd_and_start_io(struct ctlr_info *h,
>   	struct CommandList *c, int reply_queue)
>   {
> +	u32 blk_tag = blk_mq_unique_tag(c->scsi_cmd->request);
> +
>   	dial_down_lockup_detection_during_fw_flash(h, c);
>   	atomic_inc(&h->commands_outstanding);
>   	if (c->device)
>   		atomic_inc(&c->device->commands_outstanding);
>   
> -	reply_queue = h->reply_map[raw_smp_processor_id()];
> +	reply_queue = blk_mq_unique_tag_to_hwq(blk_tag);
>   	switch (c->cmd_type) {
>   	case CMD_IOACCEL1:
>   		set_ioaccel1_performant_mode(h, c, reply_queue);
> @@ -5653,8 +5656,6 @@ static int hpsa_scsi_queue_command(struct Scsi_Host *sh, struct scsi_cmnd *cmd)
>   	/* Get the ptr to our adapter structure out of cmd->host. */
>   	h = sdev_to_hba(cmd->device);
>   
> -	BUG_ON(cmd->request->tag < 0);
> -
>   	dev = cmd->device->hostdata;
>   	if (!dev) {
>   		cmd->result = DID_NO_CONNECT << 16;
> @@ -5830,7 +5831,7 @@ static int hpsa_scsi_host_alloc(struct ctlr_info *h)
>   	sh->hostdata[0] = (unsigned long) h;
>   	sh->irq = pci_irq_vector(h->pdev, 0);
>   	sh->unique_id = sh->irq;
> -
> +	sh->nr_hw_queues = h->msix_vectors > 0 ? h->msix_vectors : 1;
>   	h->scsi_host = sh;
>   	return 0;
>   }
> @@ -5856,7 +5857,8 @@ static int hpsa_scsi_add_host(struct ctlr_info *h)
>    */
>   static int hpsa_get_cmd_index(struct scsi_cmnd *scmd)
>   {
> -	int idx = scmd->request->tag;
> +	u32 blk_tag = blk_mq_unique_tag(scmd->request);
> +	int idx = blk_mq_unique_tag_to_tag(blk_tag);
>   
>   	if (idx < 0)
>   		return idx;
> @@ -7456,26 +7458,6 @@ static void hpsa_disable_interrupt_mode(struct ctlr_info *h)
>   	h->msix_vectors = 0;
>   }
>   
> -static void hpsa_setup_reply_map(struct ctlr_info *h)
> -{
> -	const struct cpumask *mask;
> -	unsigned int queue, cpu;
> -
> -	for (queue = 0; queue < h->msix_vectors; queue++) {
> -		mask = pci_irq_get_affinity(h->pdev, queue);
> -		if (!mask)
> -			goto fallback;
> -
> -		for_each_cpu(cpu, mask)
> -			h->reply_map[cpu] = queue;
> -	}
> -	return;
> -
> -fallback:
> -	for_each_possible_cpu(cpu)
> -		h->reply_map[cpu] = 0;
> -}
> -
>   /* If MSI/MSI-X is supported by the kernel we will try to enable it on
>    * controllers that are capable. If not, we use legacy INTx mode.
>    */
> @@ -7872,9 +7854,6 @@ static int hpsa_pci_init(struct ctlr_info *h)
>   	if (err)
>   		goto clean1;
>   
> -	/* setup mapping between CPU and reply queue */
> -	hpsa_setup_reply_map(h);
> -
>   	err = hpsa_pci_find_memory_BAR(h->pdev, &h->paddr);
>   	if (err)
>   		goto clean2;	/* intmode+region, pci */
> @@ -8613,7 +8592,6 @@ static struct workqueue_struct *hpsa_create_controller_wq(struct ctlr_info *h,
>   
>   static void hpda_free_ctlr_info(struct ctlr_info *h)
>   {
> -	kfree(h->reply_map);
>   	kfree(h);
>   }
>   
> @@ -8622,14 +8600,6 @@ static struct ctlr_info *hpda_alloc_ctlr_info(void)
>   	struct ctlr_info *h;
>   
>   	h = kzalloc(sizeof(*h), GFP_KERNEL);
> -	if (!h)
> -		return NULL;
> -
> -	h->reply_map = kcalloc(nr_cpu_ids, sizeof(*h->reply_map), GFP_KERNEL);
> -	if (!h->reply_map) {
> -		kfree(h);
> -		return NULL;
> -	}
>   	return h;
>   }
>   
> diff --git a/drivers/scsi/hpsa.h b/drivers/scsi/hpsa.h
> index f8c88fc7b80a..ea4a609e3eb7 100644
> --- a/drivers/scsi/hpsa.h
> +++ b/drivers/scsi/hpsa.h
> @@ -161,7 +161,6 @@ struct bmic_controller_parameters {
>   #pragma pack()
>   
>   struct ctlr_info {
> -	unsigned int *reply_map;
>   	int	ctlr;
>   	char	devname[8];
>   	char    *product_name;
> 


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 12/12] hpsa: enable host_tagset and switch to MQ
  2020-07-14  7:37   ` John Garry
@ 2020-07-14  7:41     ` Hannes Reinecke
  2020-07-14  7:52       ` John Garry
  2020-07-16 16:14     ` Don.Brace
  2020-07-16 19:45     ` Don.Brace
  2 siblings, 1 reply; 123+ messages in thread
From: Hannes Reinecke @ 2020-07-14  7:41 UTC (permalink / raw)
  To: John Garry, don.brace
  Cc: axboe, jejb, martin.petersen, kashyap.desai, sumit.saxena,
	ming.lei, bvanassche, hare, hch, shivasharan.srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	megaraidlinux.pdl

On 7/14/20 9:37 AM, John Garry wrote:
> On 10/06/2020 18:29, John Garry wrote:
>> From: Hannes Reinecke <hare@suse.de>
>>
>> The smart array HBAs can steer interrupt completion, so this
>> patch switches the implementation to use multiqueue and enables
>> 'host_tagset' as the HBA has a shared host-wide tagset.
>>
> 
> Hi Don,
> 
> I am preparing the next iteration of this series, and we're getting
> close to dropping the RFC tags. The series has grown a bit, and I am not
> sure what to do with hpsa support.
> 
> The latest versions of this series have not been tested for hpsa, AFAIK.
> Can you let me know if you can test and review this patch? Or someone
> else let me know it's tested (Hannes?)
> I'll give it a go.

Which git repository should I base the tests on?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		           Kernel Storage Architect
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 12/12] hpsa: enable host_tagset and switch to MQ
  2020-07-14  7:41     ` Hannes Reinecke
@ 2020-07-14  7:52       ` John Garry
  2020-07-14  8:06         ` Ming Lei
  2020-08-03 20:39         ` Don.Brace
  0 siblings, 2 replies; 123+ messages in thread
From: John Garry @ 2020-07-14  7:52 UTC (permalink / raw)
  To: Hannes Reinecke, don.brace
  Cc: axboe, jejb, martin.petersen, kashyap.desai, sumit.saxena,
	ming.lei, bvanassche, hare, hch, shivasharan.srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	megaraidlinux.pdl

On 14/07/2020 08:41, Hannes Reinecke wrote:
> On 7/14/20 9:37 AM, John Garry wrote:
>> On 10/06/2020 18:29, John Garry wrote:
>>> From: Hannes Reinecke <hare@suse.de>
>>>
>>> The smart array HBAs can steer interrupt completion, so this
>>> patch switches the implementation to use multiqueue and enables
>>> 'host_tagset' as the HBA has a shared host-wide tagset.
>>>
>>
>> Hi Don,
>>
>> I am preparing the next iteration of this series, and we're getting
>> close to dropping the RFC tags. The series has grown a bit, and I am not
>> sure what to do with hpsa support.
>>
>> The latest versions of this series have not been tested for hpsa, AFAIK.
>> Can you let me know if you can test and review this patch? Or someone
>> else let me know it's tested (Hannes?)
>> I'll give it a go.
> 
> Which git repository should I base the tests on?

v7 is here:

https://github.com/hisilicon/kernel-dev/commits/private-topic-blk-mq-shared-tags-rfc-v7

So that should be good to test with for now.

And I was going to ask this same question about smartpqi, so can you 
please let me know about this one?

Thanks,
John

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 12/12] hpsa: enable host_tagset and switch to MQ
  2020-07-14  7:52       ` John Garry
@ 2020-07-14  8:06         ` Ming Lei
  2020-07-14  9:53           ` John Garry
  2020-08-03 20:39         ` Don.Brace
  1 sibling, 1 reply; 123+ messages in thread
From: Ming Lei @ 2020-07-14  8:06 UTC (permalink / raw)
  To: John Garry
  Cc: Hannes Reinecke, don.brace, axboe, jejb, martin.petersen,
	kashyap.desai, sumit.saxena, bvanassche, hare, hch,
	shivasharan.srikanteshwara, linux-block, linux-scsi,
	esc.storagedev, chenxiang66, megaraidlinux.pdl

On Tue, Jul 14, 2020 at 08:52:36AM +0100, John Garry wrote:
> On 14/07/2020 08:41, Hannes Reinecke wrote:
> > On 7/14/20 9:37 AM, John Garry wrote:
> > > On 10/06/2020 18:29, John Garry wrote:
> > > > From: Hannes Reinecke <hare@suse.de>
> > > > 
> > > > The smart array HBAs can steer interrupt completion, so this
> > > > patch switches the implementation to use multiqueue and enables
> > > > 'host_tagset' as the HBA has a shared host-wide tagset.
> > > > 
> > > 
> > > Hi Don,
> > > 
> > > I am preparing the next iteration of this series, and we're getting
> > > close to dropping the RFC tags. The series has grown a bit, and I am not
> > > sure what to do with hpsa support.
> > > 
> > > The latest versions of this series have not been tested for hpsa, AFAIK.
> > > Can you let me know if you can test and review this patch? Or someone
> > > else let me know it's tested (Hannes?)
> > > I'll give it a go.
> > 
> > Which git repository should I base the tests on?
> 
> v7 is here:
> 
> https://github.com/hisilicon/kernel-dev/commits/private-topic-blk-mq-shared-tags-rfc-v7
> 
> So that should be good to test with for now.
> 
> And I was going to ask this same question about smartpqi, so can you please
> let me know about this one?

smartpqi is real MQ HBA, do you need any change wrt. shared tags?


Thanks,
Ming


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 12/12] hpsa: enable host_tagset and switch to MQ
  2020-07-14  8:06         ` Ming Lei
@ 2020-07-14  9:53           ` John Garry
  2020-07-14 10:14             ` Ming Lei
  2020-07-14 10:19             ` Hannes Reinecke
  0 siblings, 2 replies; 123+ messages in thread
From: John Garry @ 2020-07-14  9:53 UTC (permalink / raw)
  To: Ming Lei
  Cc: Hannes Reinecke, don.brace, axboe, jejb, martin.petersen,
	kashyap.desai, sumit.saxena, bvanassche, hare, hch,
	shivasharan.srikanteshwara, linux-block, linux-scsi,
	esc.storagedev, chenxiang66, megaraidlinux.pdl

On 14/07/2020 09:06, Ming Lei wrote:
>> v7 is here:
>>
>> https://github.com/hisilicon/kernel-dev/commits/private-topic-blk-mq-shared-tags-rfc-v7
>>
>> So that should be good to test with for now.
>>
>> And I was going to ask this same question about smartpqi, so can you please
>> let me know about this one?

Hi Ming,

> smartpqi is real MQ HBA, do you need any change wrt. shared tags?

Is it really?

As I see, today it maintains a single tagset per HBA. So Hannes' change 
in this series seems ok. However, I do worry that mainline code may be 
wrong, as block layer may send can_queue * nr_hw_queues requests, when 
it seems the HBA can only handle can_queue requests.

Thanks,
john

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 12/12] hpsa: enable host_tagset and switch to MQ
  2020-07-14  9:53           ` John Garry
@ 2020-07-14 10:14             ` Ming Lei
  2020-07-14 10:43               ` Hannes Reinecke
  2020-07-14 10:19             ` Hannes Reinecke
  1 sibling, 1 reply; 123+ messages in thread
From: Ming Lei @ 2020-07-14 10:14 UTC (permalink / raw)
  To: John Garry
  Cc: Hannes Reinecke, don.brace, axboe, jejb, martin.petersen,
	kashyap.desai, sumit.saxena, bvanassche, hare, hch,
	shivasharan.srikanteshwara, linux-block, linux-scsi,
	esc.storagedev, chenxiang66, megaraidlinux.pdl

On Tue, Jul 14, 2020 at 10:53:32AM +0100, John Garry wrote:
> On 14/07/2020 09:06, Ming Lei wrote:
> > > v7 is here:
> > > 
> > > https://github.com/hisilicon/kernel-dev/commits/private-topic-blk-mq-shared-tags-rfc-v7
> > > 
> > > So that should be good to test with for now.
> > > 
> > > And I was going to ask this same question about smartpqi, so can you please
> > > let me know about this one?
> 
> Hi Ming,
> 
> > smartpqi is real MQ HBA, do you need any change wrt. shared tags?
> 
> Is it really?

Yes, it is.

pqi_register_scsi():
        shost->nr_hw_queues = ctrl_info->num_queue_groups;

pqi_enable_msix_interrupts():
        num_vectors_enabled = pci_alloc_irq_vectors(ctrl_info->pci_dev,
                        PQI_MIN_MSIX_VECTORS, ctrl_info->num_queue_groups,
                        PCI_IRQ_MSIX | PCI_IRQ_AFFINITY);

> 
> As I see, today it maintains a single tagset per HBA. So Hannes' change in

No, each hw queue has one independent tagset for smartpqi.

> this series seems ok. However, I do worry that mainline code may be wrong,
> as block layer may send can_queue * nr_hw_queues requests, when it seems the
> HBA can only handle can_queue requests.

I have one machine in which all system are installed on smartpqi disks,
and I almost work on the system everyday, so far so good with this real
MQ style.

Can you explain a bit why you worry the mainline code may be wrong?


Thanks,
Ming


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 12/12] hpsa: enable host_tagset and switch to MQ
  2020-07-14  9:53           ` John Garry
  2020-07-14 10:14             ` Ming Lei
@ 2020-07-14 10:19             ` Hannes Reinecke
  2020-07-14 10:35               ` John Garry
  2020-07-14 10:44               ` Ming Lei
  1 sibling, 2 replies; 123+ messages in thread
From: Hannes Reinecke @ 2020-07-14 10:19 UTC (permalink / raw)
  To: John Garry, Ming Lei
  Cc: don.brace, axboe, jejb, martin.petersen, kashyap.desai,
	sumit.saxena, bvanassche, hare, hch, shivasharan.srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	megaraidlinux.pdl

On 7/14/20 11:53 AM, John Garry wrote:
> On 14/07/2020 09:06, Ming Lei wrote:
>>> v7 is here:
>>>
>>> https://github.com/hisilicon/kernel-dev/commits/private-topic-blk-mq-shared-tags-rfc-v7
>>>
>>>
>>> So that should be good to test with for now.
>>>
>>> And I was going to ask this same question about smartpqi, so can you
>>> please
>>> let me know about this one?
> 
> Hi Ming,
> 
>> smartpqi is real MQ HBA, do you need any change wrt. shared tags?
> 
> Is it really?
> 
> As I see, today it maintains a single tagset per HBA. So Hannes' change
> in this series seems ok. However, I do worry that mainline code may be
> wrong, as block layer may send can_queue * nr_hw_queues requests, when
> it seems the HBA can only handle can_queue requests.
> 
Correct. There is only one tagset per host, even if the host supports
several queues (guess why it's called smart PQI :-).
And mainline code isn't really wrong, it just allocates the next free
tag from the host tagset; it's not using the block-layer tags at all.
Precisely because the block layer currently cannot guarantee that tags
are unique per host.

And the point of this patchset is exactly that the block layer will only
send up to 'can_queue' requests, irrespective on how many hardware
queues are present.

But anyway, I'll test it on smartpqi, too.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		           Kernel Storage Architect
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 12/12] hpsa: enable host_tagset and switch to MQ
  2020-07-14 10:19             ` Hannes Reinecke
@ 2020-07-14 10:35               ` John Garry
  2020-07-14 10:44               ` Ming Lei
  1 sibling, 0 replies; 123+ messages in thread
From: John Garry @ 2020-07-14 10:35 UTC (permalink / raw)
  To: Hannes Reinecke, Ming Lei
  Cc: don.brace, axboe, jejb, martin.petersen, kashyap.desai,
	sumit.saxena, bvanassche, hare, hch, shivasharan.srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	megaraidlinux.pdl

On 14/07/2020 11:19, Hannes Reinecke wrote:
> Correct. There is only one tagset per host, even if the host supports
> several queues (guess why it's called smart PQI:-).
> And mainline code isn't really wrong, it just allocates the next free
> tag from the host tagset; it's not using the block-layer tags at all.
> Precisely because the block layer currently cannot guarantee that tags
> are unique per host.

ok, but it's not exemplary in what it does. And I suppose that's why it 
needs to spin indefinitely for a free tag in pqi_alloc_io_request(), 
instead of failing immediately when none are free.

> 
> And the point of this patchset is exactly that the block layer will only
> send up to 'can_queue' requests, irrespective on how many hardware
> queues are present.
> 
> But anyway, I'll test it on smartpqi, too.
> 

Great, thanks.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 12/12] hpsa: enable host_tagset and switch to MQ
  2020-07-14 10:14             ` Ming Lei
@ 2020-07-14 10:43               ` Hannes Reinecke
  0 siblings, 0 replies; 123+ messages in thread
From: Hannes Reinecke @ 2020-07-14 10:43 UTC (permalink / raw)
  To: Ming Lei, John Garry
  Cc: don.brace, axboe, jejb, martin.petersen, kashyap.desai,
	sumit.saxena, bvanassche, hare, hch, shivasharan.srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	megaraidlinux.pdl

On 7/14/20 12:14 PM, Ming Lei wrote:
> On Tue, Jul 14, 2020 at 10:53:32AM +0100, John Garry wrote:
>> On 14/07/2020 09:06, Ming Lei wrote:
>>>> v7 is here:
>>>>
>>>> https://github.com/hisilicon/kernel-dev/commits/private-topic-blk-mq-shared-tags-rfc-v7
>>>>
>>>> So that should be good to test with for now.
>>>>
>>>> And I was going to ask this same question about smartpqi, so can you please
>>>> let me know about this one?
>>
>> Hi Ming,
>>
>>> smartpqi is real MQ HBA, do you need any change wrt. shared tags?
>>
>> Is it really?
> 
> Yes, it is.
> 
> pqi_register_scsi():
>         shost->nr_hw_queues = ctrl_info->num_queue_groups;
> 
> pqi_enable_msix_interrupts():
>         num_vectors_enabled = pci_alloc_irq_vectors(ctrl_info->pci_dev,
>                         PQI_MIN_MSIX_VECTORS, ctrl_info->num_queue_groups,
>                         PCI_IRQ_MSIX | PCI_IRQ_AFFINITY);
> 
>>
>> As I see, today it maintains a single tagset per HBA. So Hannes' change in
> 
> No, each hw queue has one independent tagset for smartpqi.
> 
Has it really?

The code has this:

static struct pqi_io_request *pqi_alloc_io_request(
	struct pqi_ctrl_info *ctrl_info)
{
	struct pqi_io_request *io_request;
	u16 i = ctrl_info->next_io_request_slot;	/* benignly racy */

	while (1) {
		io_request = &ctrl_info->io_request_pool[i];
		if (atomic_inc_return(&io_request->refcount) == 1)
			break;
		atomic_dec(&io_request->refcount);
		i = (i + 1) % ctrl_info->max_io_slots;
	}


which means that at least the driver assumes a host-wide tagset.
Looking at the code the HW _should_ support per-queue tagsets (it's PQI,
after all), but this doesn't seem to be carried over for the entire
driver; possibly due to legacy concerns.
Don should be able to shed some light here.

Meanwhile we have to assume that the _driver_ uses a per-host tagset; we
might be able to convert it to a per-queue tagset, but that looks like a
major update of the driver itself.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		           Kernel Storage Architect
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 12/12] hpsa: enable host_tagset and switch to MQ
  2020-07-14 10:19             ` Hannes Reinecke
  2020-07-14 10:35               ` John Garry
@ 2020-07-14 10:44               ` Ming Lei
  2020-07-14 10:52                 ` John Garry
  1 sibling, 1 reply; 123+ messages in thread
From: Ming Lei @ 2020-07-14 10:44 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: John Garry, don.brace, axboe, jejb, martin.petersen,
	kashyap.desai, sumit.saxena, bvanassche, hare, hch,
	shivasharan.srikanteshwara, linux-block, linux-scsi,
	esc.storagedev, chenxiang66, megaraidlinux.pdl

On Tue, Jul 14, 2020 at 12:19:07PM +0200, Hannes Reinecke wrote:
> On 7/14/20 11:53 AM, John Garry wrote:
> > On 14/07/2020 09:06, Ming Lei wrote:
> >>> v7 is here:
> >>>
> >>> https://github.com/hisilicon/kernel-dev/commits/private-topic-blk-mq-shared-tags-rfc-v7
> >>>
> >>>
> >>> So that should be good to test with for now.
> >>>
> >>> And I was going to ask this same question about smartpqi, so can you
> >>> please
> >>> let me know about this one?
> > 
> > Hi Ming,
> > 
> >> smartpqi is real MQ HBA, do you need any change wrt. shared tags?
> > 
> > Is it really?
> > 
> > As I see, today it maintains a single tagset per HBA. So Hannes' change
> > in this series seems ok. However, I do worry that mainline code may be
> > wrong, as block layer may send can_queue * nr_hw_queues requests, when
> > it seems the HBA can only handle can_queue requests.
> > 
> Correct. There is only one tagset per host, even if the host supports
> several queues (guess why it's called smart PQI :-).
> And mainline code isn't really wrong, it just allocates the next free
> tag from the host tagset; it's not using the block-layer tags at all.
> Precisely because the block layer currently cannot guarantee that tags
> are unique per host.

OK, pqi_alloc_io_request() does the real tag allocation, which looks a
very bad implementation, cause the real tags can be used up easily.

In my machine, there are 32 queues(32 cpu cores), each queue has 1013
tags, so there can be 32*1013 requests coming from block layer, meantime
smartpqi can only handles 1013 requests. I guess it isn't hard to
trigger softlock by running heavy/concurrent smartpqi IO.

> 
> And the point of this patchset is exactly that the block layer will only
> send up to 'can_queue' requests, irrespective on how many hardware
> queues are present.

That is only true for shared tags.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 12/12] hpsa: enable host_tagset and switch to MQ
  2020-07-14 10:44               ` Ming Lei
@ 2020-07-14 10:52                 ` John Garry
  2020-07-14 12:04                   ` Ming Lei
  0 siblings, 1 reply; 123+ messages in thread
From: John Garry @ 2020-07-14 10:52 UTC (permalink / raw)
  To: Ming Lei, Hannes Reinecke
  Cc: don.brace, axboe, jejb, martin.petersen, kashyap.desai,
	sumit.saxena, bvanassche, hare, hch, shivasharan.srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	megaraidlinux.pdl

> 
> In my machine, there are 32 queues(32 cpu cores), each queue has 1013
> tags, so there can be 32*1013 requests coming from block layer, meantime
> smartpqi can only handles 1013 requests. I guess it isn't hard to
> trigger softlock by running heavy/concurrent smartpqi IO.

Since pqi_alloc_io_request() does not use spinlock, disable preemption, 
etc., so I guess that there is more of a chance of simply IO timeout.

But I see in pqi_get_physical_disk_info() that there is some 
intelligence to set the queue depth, which may reduce chance of timeout 
(by reducing disk queue depth). Not sure.

> 
>>
>> And the point of this patchset is exactly that the block layer will only
>> send up to 'can_queue' requests, irrespective on how many hardware
>> queues are present.
> 
> That is only true for shared tags.
> 

Thanks,
John


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 12/12] hpsa: enable host_tagset and switch to MQ
  2020-07-14 10:52                 ` John Garry
@ 2020-07-14 12:04                   ` Ming Lei
  0 siblings, 0 replies; 123+ messages in thread
From: Ming Lei @ 2020-07-14 12:04 UTC (permalink / raw)
  To: John Garry
  Cc: Hannes Reinecke, don.brace, axboe, jejb, martin.petersen,
	kashyap.desai, sumit.saxena, bvanassche, hare, hch,
	shivasharan.srikanteshwara, linux-block, linux-scsi,
	esc.storagedev, chenxiang66, megaraidlinux.pdl

On Tue, Jul 14, 2020 at 11:52:52AM +0100, John Garry wrote:
> > 
> > In my machine, there are 32 queues(32 cpu cores), each queue has 1013
> > tags, so there can be 32*1013 requests coming from block layer, meantime
> > smartpqi can only handles 1013 requests. I guess it isn't hard to
> > trigger softlock by running heavy/concurrent smartpqi IO.
> 
> Since pqi_alloc_io_request() does not use spinlock, disable preemption,

rcu read lock is held when calling .queue_rq(), and preempt_disable() is
implied in case that CONFIG_PREEMPT_RCU is off.

A CPU looping in an RCU read-side critical section may cause some
related issues, cause RCU's CPU Stall Detector will warn on that.

> etc., so I guess that there is more of a chance of simply IO timeout.
> 
> But I see in pqi_get_physical_disk_info() that there is some intelligence to
> set the queue depth, which may reduce chance of timeout (by reducing disk
> queue depth). Not sure.

It may not work, see:

[root@hp-dl380g10-01 mingl]# cat /sys/block/sd[a-f]/device/queue_depth
1013
1013
1013
1013
1013
1013

All sd[a-f] are smartpqi LUNs.

Thanks, 
Ming


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 11/12] smartpqi: enable host tagset
  2020-06-10 17:29 ` [PATCH RFC v7 11/12] smartpqi: enable host tagset John Garry
@ 2020-07-14 13:16   ` John Garry
  2020-07-14 13:31     ` John Garry
  2020-07-14 14:02     ` Hannes Reinecke
  0 siblings, 2 replies; 123+ messages in thread
From: John Garry @ 2020-07-14 13:16 UTC (permalink / raw)
  To: don.brace, hare
  Cc: axboe, jejb, martin.petersen, kashyap.desai, sumit.saxena,
	ming.lei, bvanassche, hch, shivasharan.srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	megaraidlinux.pdl, Hannes Reinecke

Hi Hannes,

>   static struct pqi_io_request *pqi_alloc_io_request(
> -	struct pqi_ctrl_info *ctrl_info)
> +	struct pqi_ctrl_info *ctrl_info, struct scsi_cmnd *scmd)
>   {
>   	struct pqi_io_request *io_request;
> +	unsigned int limit = PQI_RESERVED_IO_SLOTS;
>   	u16 i = ctrl_info->next_io_request_slot;	/* benignly racy */
>   
> -	while (1) {
> +	if (scmd) {
> +		u32 blk_tag = blk_mq_unique_tag(scmd->request);
> +
> +		i = blk_mq_unique_tag_to_tag(blk_tag) + limit;
>   		io_request = &ctrl_info->io_request_pool[i];

This looks ok

> -		if (atomic_inc_return(&io_request->refcount) == 1)
> -			break;
> -		atomic_dec(&io_request->refcount);
> -		i = (i + 1) % ctrl_info->max_io_slots;
> +		if (WARN_ON(atomic_inc_return(&io_request->refcount) > 1)) {
> +			atomic_dec(&io_request->refcount);
> +			return NULL;
> +		}
> +	} else {
> +		while (1) {
> +			io_request = &ctrl_info->io_request_pool[i];
> +			if (atomic_inc_return(&io_request->refcount) == 1)
> +				break;
> +			atomic_dec(&io_request->refcount);
> +			i = (i + 1) % limit;

To me, the range we use here looks incorrect. I would assume we should 
restrict range to [max_io_slots - PQI_RESERVED_IO_SLOTS, max_io_slots).

But then your reserved commands support would solve that.

> +		}
>   	}
>   

Thanks,
John

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 11/12] smartpqi: enable host tagset
  2020-07-14 13:16   ` John Garry
@ 2020-07-14 13:31     ` John Garry
  2020-07-14 18:16       ` Don.Brace
  2020-07-14 14:02     ` Hannes Reinecke
  1 sibling, 1 reply; 123+ messages in thread
From: John Garry @ 2020-07-14 13:31 UTC (permalink / raw)
  To: don.brace, hare
  Cc: axboe, jejb, martin.petersen, kashyap.desai, sumit.saxena,
	ming.lei, bvanassche, hch, shivasharan.srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	megaraidlinux.pdl, Hannes Reinecke

On 14/07/2020 14:16, John Garry wrote:
> Hi Hannes,
> 
>>   static struct pqi_io_request *pqi_alloc_io_request(
>> -    struct pqi_ctrl_info *ctrl_info)
>> +    struct pqi_ctrl_info *ctrl_info, struct scsi_cmnd *scmd)
>>   {
>>       struct pqi_io_request *io_request;
>> +    unsigned int limit = PQI_RESERVED_IO_SLOTS;
>>       u16 i = ctrl_info->next_io_request_slot;    /* benignly racy */
>> -    while (1) {
>> +    if (scmd) {
>> +        u32 blk_tag = blk_mq_unique_tag(scmd->request);
>> +
>> +        i = blk_mq_unique_tag_to_tag(blk_tag) + limit;
>>           io_request = &ctrl_info->io_request_pool[i];
> 
> This looks ok
> 
>> -        if (atomic_inc_return(&io_request->refcount) == 1)
>> -            break;
>> -        atomic_dec(&io_request->refcount);
>> -        i = (i + 1) % ctrl_info->max_io_slots;
>> +        if (WARN_ON(atomic_inc_return(&io_request->refcount) > 1)) {
>> +            atomic_dec(&io_request->refcount);
>> +            return NULL;
>> +        }
>> +    } else {
>> +        while (1) {
>> +            io_request = &ctrl_info->io_request_pool[i];
>> +            if (atomic_inc_return(&io_request->refcount) == 1)
>> +                break;
>> +            atomic_dec(&io_request->refcount);
>> +            i = (i + 1) % limit;
> 
> To me, the range we use here looks incorrect. I would assume we should 
> restrict range to [max_io_slots - PQI_RESERVED_IO_SLOTS, max_io_slots).
> 

Sorry, I missed that you use i = blk_mq_unique_tag_to_tag(blk_tag) + 
limit for regular command.

But we still set next_io_request_slot for regular command and maybe then 
read and use it for reserved command.

Thanks

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 11/12] smartpqi: enable host tagset
  2020-07-14 13:16   ` John Garry
  2020-07-14 13:31     ` John Garry
@ 2020-07-14 14:02     ` Hannes Reinecke
  2020-08-18  8:33       ` John Garry
  1 sibling, 1 reply; 123+ messages in thread
From: Hannes Reinecke @ 2020-07-14 14:02 UTC (permalink / raw)
  To: John Garry, don.brace, hare
  Cc: axboe, jejb, martin.petersen, kashyap.desai, sumit.saxena,
	ming.lei, bvanassche, hch, shivasharan.srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	megaraidlinux.pdl

On 7/14/20 3:16 PM, John Garry wrote:
> Hi Hannes,
> 
>>   static struct pqi_io_request *pqi_alloc_io_request(
>> -    struct pqi_ctrl_info *ctrl_info)
>> +    struct pqi_ctrl_info *ctrl_info, struct scsi_cmnd *scmd)
>>   {
>>       struct pqi_io_request *io_request;
>> +    unsigned int limit = PQI_RESERVED_IO_SLOTS;
>>       u16 i = ctrl_info->next_io_request_slot;    /* benignly racy */
>>   -    while (1) {
>> +    if (scmd) {
>> +        u32 blk_tag = blk_mq_unique_tag(scmd->request);
>> +
>> +        i = blk_mq_unique_tag_to_tag(blk_tag) + limit;
>>           io_request = &ctrl_info->io_request_pool[i];
> 
> This looks ok
> 
>> -        if (atomic_inc_return(&io_request->refcount) == 1)
>> -            break;
>> -        atomic_dec(&io_request->refcount);
>> -        i = (i + 1) % ctrl_info->max_io_slots;
>> +        if (WARN_ON(atomic_inc_return(&io_request->refcount) > 1)) {
>> +            atomic_dec(&io_request->refcount);
>> +            return NULL;
>> +        }
>> +    } else {
>> +        while (1) {
>> +            io_request = &ctrl_info->io_request_pool[i];
>> +            if (atomic_inc_return(&io_request->refcount) == 1)
>> +                break;
>> +            atomic_dec(&io_request->refcount);
>> +            i = (i + 1) % limit;
> 
> To me, the range we use here looks incorrect. I would assume we should
> restrict range to [max_io_slots - PQI_RESERVED_IO_SLOTS, max_io_slots).
> 
> But then your reserved commands support would solve that.
> 
This branch of the 'if' condition will only be taken for internal
commands, for which we only allow up to PQI_RESERVED_IO_SLOTS.
And we set the 'normal' I/O commands above at an offset, so we're fine here.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		           Kernel Storage Architect
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer

^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: [PATCH RFC v7 11/12] smartpqi: enable host tagset
  2020-07-14 13:31     ` John Garry
@ 2020-07-14 18:16       ` Don.Brace
  2020-07-15  7:28         ` John Garry
  0 siblings, 1 reply; 123+ messages in thread
From: Don.Brace @ 2020-07-14 18:16 UTC (permalink / raw)
  To: john.garry, don.brace, hare
  Cc: axboe, jejb, martin.petersen, kashyap.desai, sumit.saxena,
	ming.lei, bvanassche, hch, shivasharan.srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	megaraidlinux.pdl, hare

Subject: Re: [PATCH RFC v7 11/12] smartpqi: enable host tagset


Both the driver author and myself do not want to change how commands are processed in the smartpqi driver.

Nak.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 11/12] smartpqi: enable host tagset
  2020-07-14 18:16       ` Don.Brace
@ 2020-07-15  7:28         ` John Garry
  0 siblings, 0 replies; 123+ messages in thread
From: John Garry @ 2020-07-15  7:28 UTC (permalink / raw)
  To: Don.Brace, don.brace, hare
  Cc: axboe, jejb, martin.petersen, kashyap.desai, sumit.saxena,
	ming.lei, bvanassche, hch, shivasharan.srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	megaraidlinux.pdl, hare

On 14/07/2020 19:16, Don.Brace@microchip.com wrote:
> Subject: Re: [PATCH RFC v7 11/12] smartpqi: enable host tagset
> 
> 
> Both the driver author and myself do not want to change how commands are processed in the smartpqi driver.
> 
> Nak.
> 

ok, your call. But it still seems that the driver should set the 
host_tagset flag.

Anyway, can you also let us know your stance on the hpsa change in this 
series?

https://lore.kernel.org/linux-scsi/1591810159-240929-1-git-send-email-john.garry@huawei.com/T/#m1057b8a8c23a9bf558a048817a1cda4a576291b2

thanks



^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: [PATCH RFC v7 12/12] hpsa: enable host_tagset and switch to MQ
  2020-07-14  7:37   ` John Garry
  2020-07-14  7:41     ` Hannes Reinecke
@ 2020-07-16 16:14     ` Don.Brace
  2020-07-16 19:45     ` Don.Brace
  2 siblings, 0 replies; 123+ messages in thread
From: Don.Brace @ 2020-07-16 16:14 UTC (permalink / raw)
  To: john.garry, don.brace, hare
  Cc: axboe, jejb, martin.petersen, kashyap.desai, sumit.saxena,
	ming.lei, bvanassche, hare, hch, shivasharan.srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	megaraidlinux.pdl

Subject: Re: [PATCH RFC v7 12/12] hpsa: enable host_tagset and switch to MQ

EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe

On 10/06/2020 18:29, John Garry wrote:
> From: Hannes Reinecke <hare@suse.de>
>
> The smart array HBAs can steer interrupt completion, so this patch 
> switches the implementation to use multiqueue and enables 
> 'host_tagset' as the HBA has a shared host-wide tagset.
>

>>Hi Don,

>>I am preparing the next iteration of this series, and >>we're getting close to dropping the RFC tags. The >>series has grown a bit, and I am not sure what to do >>with hpsa support.

>>The latest versions of this series have not been tested for hpsa, AFAIK.
>>Can you let me know if you can test and review this >>patch? Or someone else let me know it's tested (Hannes?)

Thanks

Yes, I'll run my testing soon.
Don

> Signed-off-by: Hannes Reinecke <hare@suse.de>
> ---
>   drivers/scsi/hpsa.c | 44 +++++++-------------------------------------
>   drivers/scsi/hpsa.h |  1 -
>   2 files changed, 7 insertions(+), 38 deletions(-)
>
> diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c index 
> 1e9302e99d05..f807f9bdae85 100644
> --- a/drivers/scsi/hpsa.c
> +++ b/drivers/scsi/hpsa.c
> @@ -980,6 +980,7 @@ static struct scsi_host_template hpsa_driver_template = {
>       .shost_attrs = hpsa_shost_attrs,
>       .max_sectors = 2048,
>       .no_write_same = 1,
> +     .host_tagset = 1,
>   };
>
>   static inline u32 next_command(struct ctlr_info *h, u8 q) @@ 
> -1144,12 +1145,14 @@ static void dial_up_lockup_detection_on_fw_flash_complete(struct ctlr_info *h,
>   static void __enqueue_cmd_and_start_io(struct ctlr_info *h,
>       struct CommandList *c, int reply_queue)
>   {
> +     u32 blk_tag = blk_mq_unique_tag(c->scsi_cmd->request);
> +
>       dial_down_lockup_detection_during_fw_flash(h, c);
>       atomic_inc(&h->commands_outstanding);
>       if (c->device)
>               atomic_inc(&c->device->commands_outstanding);
>
> -     reply_queue = h->reply_map[raw_smp_processor_id()];
> +     reply_queue = blk_mq_unique_tag_to_hwq(blk_tag);
>       switch (c->cmd_type) {
>       case CMD_IOACCEL1:
>               set_ioaccel1_performant_mode(h, c, reply_queue); @@ 
> -5653,8 +5656,6 @@ static int hpsa_scsi_queue_command(struct Scsi_Host *sh, struct scsi_cmnd *cmd)
>       /* Get the ptr to our adapter structure out of cmd->host. */
>       h = sdev_to_hba(cmd->device);
>
> -     BUG_ON(cmd->request->tag < 0);
> -
>       dev = cmd->device->hostdata;
>       if (!dev) {
>               cmd->result = DID_NO_CONNECT << 16; @@ -5830,7 +5831,7 
> @@ static int hpsa_scsi_host_alloc(struct ctlr_info *h)
>       sh->hostdata[0] = (unsigned long) h;
>       sh->irq = pci_irq_vector(h->pdev, 0);
>       sh->unique_id = sh->irq;
> -
> +     sh->nr_hw_queues = h->msix_vectors > 0 ? h->msix_vectors : 1;
>       h->scsi_host = sh;
>       return 0;
>   }
> @@ -5856,7 +5857,8 @@ static int hpsa_scsi_add_host(struct ctlr_info *h)
>    */
>   static int hpsa_get_cmd_index(struct scsi_cmnd *scmd)
>   {
> -     int idx = scmd->request->tag;
> +     u32 blk_tag = blk_mq_unique_tag(scmd->request);
> +     int idx = blk_mq_unique_tag_to_tag(blk_tag);
>
>       if (idx < 0)
>               return idx;
> @@ -7456,26 +7458,6 @@ static void hpsa_disable_interrupt_mode(struct ctlr_info *h)
>       h->msix_vectors = 0;
>   }
>
> -static void hpsa_setup_reply_map(struct ctlr_info *h) -{
> -     const struct cpumask *mask;
> -     unsigned int queue, cpu;
> -
> -     for (queue = 0; queue < h->msix_vectors; queue++) {
> -             mask = pci_irq_get_affinity(h->pdev, queue);
> -             if (!mask)
> -                     goto fallback;
> -
> -             for_each_cpu(cpu, mask)
> -                     h->reply_map[cpu] = queue;
> -     }
> -     return;
> -
> -fallback:
> -     for_each_possible_cpu(cpu)
> -             h->reply_map[cpu] = 0;
> -}
> -
>   /* If MSI/MSI-X is supported by the kernel we will try to enable it on
>    * controllers that are capable. If not, we use legacy INTx mode.
>    */
> @@ -7872,9 +7854,6 @@ static int hpsa_pci_init(struct ctlr_info *h)
>       if (err)
>               goto clean1;
>
> -     /* setup mapping between CPU and reply queue */
> -     hpsa_setup_reply_map(h);
> -
>       err = hpsa_pci_find_memory_BAR(h->pdev, &h->paddr);
>       if (err)
>               goto clean2;    /* intmode+region, pci */
> @@ -8613,7 +8592,6 @@ static struct workqueue_struct 
> *hpsa_create_controller_wq(struct ctlr_info *h,
>
>   static void hpda_free_ctlr_info(struct ctlr_info *h)
>   {
> -     kfree(h->reply_map);
>       kfree(h);
>   }
>
> @@ -8622,14 +8600,6 @@ static struct ctlr_info *hpda_alloc_ctlr_info(void)
>       struct ctlr_info *h;
>
>       h = kzalloc(sizeof(*h), GFP_KERNEL);
> -     if (!h)
> -             return NULL;
> -
> -     h->reply_map = kcalloc(nr_cpu_ids, sizeof(*h->reply_map), GFP_KERNEL);
> -     if (!h->reply_map) {
> -             kfree(h);
> -             return NULL;
> -     }
>       return h;
>   }
>
> diff --git a/drivers/scsi/hpsa.h b/drivers/scsi/hpsa.h index 
> f8c88fc7b80a..ea4a609e3eb7 100644
> --- a/drivers/scsi/hpsa.h
> +++ b/drivers/scsi/hpsa.h
> @@ -161,7 +161,6 @@ struct bmic_controller_parameters {
>   #pragma pack()
>
>   struct ctlr_info {
> -     unsigned int *reply_map;
>       int     ctlr;
>       char    devname[8];
>       char    *product_name;
>


^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: [PATCH RFC v7 12/12] hpsa: enable host_tagset and switch to MQ
  2020-07-14  7:37   ` John Garry
  2020-07-14  7:41     ` Hannes Reinecke
  2020-07-16 16:14     ` Don.Brace
@ 2020-07-16 19:45     ` Don.Brace
  2020-07-17 10:11       ` John Garry
  2 siblings, 1 reply; 123+ messages in thread
From: Don.Brace @ 2020-07-16 19:45 UTC (permalink / raw)
  To: john.garry, don.brace, hare
  Cc: axboe, jejb, martin.petersen, kashyap.desai, sumit.saxena,
	ming.lei, bvanassche, hare, hch, shivasharan.srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	megaraidlinux.pdl

On 10/06/2020 18:29, John Garry wrote:
> From: Hannes Reinecke <hare@suse.de>
>
> The smart array HBAs can steer interrupt completion, so this patch 
> switches the implementation to use multiqueue and enables 
> 'host_tagset' as the HBA has a shared host-wide tagset.
>

>>Hi Don,

>>I am preparing the next iteration of this series, and >>we're getting close to dropping the RFC tags. The >>series has grown a bit, and I am not sure what to do >>with hpsa support.

>>The latest versions of this series have not been tested for hpsa, AFAIK.
>>Or someone else let me know it's tested (Hannes?)

>>Thanks

John,

I cloned:
https://github.com/hisilicon/kernel-dev
switched to branch: origin/private-topic-blk-mq-shared-tags-rfc-v8

And built the kernel. The hpsa driver oopsed on load. It was attempting to do driver initiated commands, so there would need to be some reserved tags set aside to communicate with the controller.

Was I supposed to add this patch on top of Hannes's hpsa patches?

The patches in the indicated series seem to be included in the branch.
----
[   13.340977] HP HPSA Driver (v 3.4.20-170)
[   13.374719] usbcore: registered new interface driver usb-storage
[   13.379626] hpsa 0000:0d:00.0: can't disable ASPM; OS doesn't have ASPM control
[   13.473790] scsi host0: scsi_eh_0: sleeping
[   13.475191] scsi host0: uas
[   13.487978] hpsa 0000:0d:00.0: Logical aborts not supported
[   13.498111] tg3 0000:02:00.0 eth0: Tigon3 [partno(N/A) rev 5719001] (PCI Express) MAC address d4:c9:ef:ce:0a:c4
[   13.498113] tg3 0000:02:00.0 eth0: attached PHY is 5719C (10/100/1000Base-T Ethernet) (WireSpeed[1], EEE[1])
[   13.498116] tg3 0000:02:00.0 eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] TSOcap[1]
[   13.498117] tg3 0000:02:00.0 eth0: dma_rwctrl[00000001] dma_mask[64-bit]
[   13.499661] scsi host1: scsi_eh_1: sleeping
[   13.522611] tg3 0000:02:00.1 eth1: Tigon3 [partno(N/A) rev 5719001] (PCI Express) MAC address d4:c9:ef:ce:0a:c5
[   13.541519] usbcore: registered new interface driver uas
[   13.549508] scsi 0:0:0:0: Direct-Access     ASMT     2105             0    PQ: 0 ANSI: 6
[   13.576119] tg3 0000:02:00.1 eth1: attached PHY is 5719C (10/100/1000Base-T Ethernet) (WireSpeed[1], EEE[1])
[   14.046521] tg3 0000:02:00.1 eth1: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] TSOcap[1]
[   14.046524] tg3 0000:02:00.1 eth1: dma_rwctrl[00000001] dma_mask[64-bit]
[   14.047114] BUG: kernel NULL pointer dereference, address: 0000000000000010
[   14.193228] #PF: supervisor read access in kernel mode
[   14.193228] #PF: error_code(0x0000) - not-present page
[   14.193229] PGD 0 P4D 0 
[   14.193232] Oops: 0000 [#1] SMP PTI
[   14.193235] CPU: 0 PID: 495 Comm: kworker/0:8 Not tainted 5.8.0-rc1-host-tagset+ #3
[   14.193236] Hardware name: HP ProLiant ML350 Gen9/ProLiant ML350 Gen9, BIOS P92 10/17/2018
[   14.193243] Workqueue: events work_for_cpu_fn
[   14.470674] RIP: 0010:blk_mq_unique_tag+0x5/0x20
[   14.470676] Code: cd 0f 1f 40 00 0f 1f 44 00 00 8b 87 cc 00 00 00 83 f8 02 75 03 83 06 01 b8 01 00 00 00 c3 0f 1f 80 00 00 00 00 0f 1f 44 00 00 <48> 8b 47 10 0f b7 57 20 8b 80 94 01 00 00 c1 e0 10 09 d0 c3 0f 1f
[   14.470677] RSP: 0000:ffff989f42893d08 EFLAGS: 00010246
[   14.470680] RAX: ffffffffc0493f80 RBX: ffff8ab761c00000 RCX: 0000000000000000
[   14.717021] RDX: ffff8ab9b7600000 RSI: ffff8ab761c00000 RDI: 0000000000000000
[   14.717021] RBP: ffff8ab9a5b98000 R08: ffffffffffffffff R09: 0000000000000000
[   14.717022] R10: ffff8ab8b5280070 R11: 0000000000000000 R12: 000000000000000a
[   14.717022] R13: 0000000000000002 R14: ffff8ab761c00000 R15: ffffffffc0493b60
[   14.717023] FS:  0000000000000000(0000) GS:ffff8ab9b7600000(0000) knlGS:0000000000000000
[   14.717024] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   14.717025] CR2: 0000000000000010 CR3: 000000024f33e006 CR4: 00000000001606f0
[   14.717025] Call Trace:
[   14.717034]  __enqueue_cmd_and_start_io.isra.60+0x20/0x170 [hpsa]
[   14.717039]  hpsa_scsi_do_simple_cmd.isra.62+0x6b/0xd0 [hpsa]
[   14.717042]  hpsa_scsi_do_simple_cmd_with_retry+0x63/0x160 [hpsa]
[   14.717045]  hpsa_scsi_do_inquiry+0x62/0xc0 [hpsa]
[   14.717048]  hpsa_init_one+0x1167/0x1400 [hpsa]
[   14.717052]  local_pci_probe+0x42/0x80
[   14.717054]  work_for_cpu_fn+0x16/0x20
[   14.717057]  process_one_work+0x1a7/0x370
[   14.717059]  ? process_one_work+0x370/0x370
[   14.717061]  worker_thread+0x1c9/0x370
[   14.717062]  ? process_one_work+0x370/0x370
[   14.717064]  kthread+0x114/0x130
[   14.717065]  ? kthread_park+0x80/0x80
[   14.717068]  ret_from_fork+0x22/0x30
[   14.717070] Modules linked in: crc32c_intel libahci(+) uas tg3(+) libata usb_storage i2c_algo_bit hpsa(+) scsi_transport_sas wmi dm_mirror dm_region_hash dm_log dm_mod
[   14.717077] CR2: 0000000000000010
[   14.717099] ---[ end trace 3845f459e9223caa ]---
[   14.724750] ERST: [Firmware Warn]: Firmware does not respond in time.
[   14.724753] RIP: 0010:blk_mq_unique_tag+0x5/0x20
[   14.724754] Code: cd 0f 1f 40 00 0f 1f 44 00 00 8b 87 cc 00 00 00 83 f8 02 75 03 83 06 01 b8 01 00 00 00 c3 0f 1f 80 00 00 00 00 0f 1f 44 00 00 <48> 8b 47 10 0f b7 57 20 8b 80 94 01 00 00 c1 e0 10 09 d0 c3 0f 1f
[   14.724755] RSP: 0000:ffff989f42893d08 EFLAGS: 00010246
[   14.724756] RAX: ffffffffc0493f80 RBX: ffff8ab761c00000 RCX: 0000000000000000
[   14.724757] RDX: ffff8ab9b7600000 RSI: ffff8ab761c00000 RDI: 0000000000000000
[   14.724757] RBP: ffff8ab9a5b98000 R08: ffffffffffffffff R09: 0000000000000000
[   14.724758] R10: ffff8ab8b5280070 R11: 0000000000000000 R12: 000000000000000a
[   14.724758] R13: 0000000000000002 R14: ffff8ab761c00000 R15: ffffffffc0493b60
[   14.724759] FS:  0000000000000000(0000) GS:ffff8ab9b7600000(0000) knlGS:0000000000000000
[   14.724760] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   14.724760] CR2: 0000000000000010 CR3: 000000024f33e006 CR4: 00000000001606f0
[   14.724761] Kernel panic - not syncing: Fatal exception
[   14.724833] Kernel Offset: 0x38400000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[   16.487017] ---[ end Kernel panic - not syncing: Fatal exception ]---



> Signed-off-by: Hannes Reinecke <hare@suse.de>
> ---
>   drivers/scsi/hpsa.c | 44 +++++++-------------------------------------
>   drivers/scsi/hpsa.h |  1 -
>   2 files changed, 7 insertions(+), 38 deletions(-)
>
> diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c index 
> 1e9302e99d05..f807f9bdae85 100644
> --- a/drivers/scsi/hpsa.c
> +++ b/drivers/scsi/hpsa.c
> @@ -980,6 +980,7 @@ static struct scsi_host_template hpsa_driver_template = {
>       .shost_attrs = hpsa_shost_attrs,
>       .max_sectors = 2048,
>       .no_write_same = 1,
> +     .host_tagset = 1,
>   };
>
>   static inline u32 next_command(struct ctlr_info *h, u8 q) @@ 
> -1144,12 +1145,14 @@ static void dial_up_lockup_detection_on_fw_flash_complete(struct ctlr_info *h,
>   static void __enqueue_cmd_and_start_io(struct ctlr_info *h,
>       struct CommandList *c, int reply_queue)
>   {
> +     u32 blk_tag = blk_mq_unique_tag(c->scsi_cmd->request);
> +
>       dial_down_lockup_detection_during_fw_flash(h, c);
>       atomic_inc(&h->commands_outstanding);
>       if (c->device)
>               atomic_inc(&c->device->commands_outstanding);
>
> -     reply_queue = h->reply_map[raw_smp_processor_id()];
> +     reply_queue = blk_mq_unique_tag_to_hwq(blk_tag);
>       switch (c->cmd_type) {
>       case CMD_IOACCEL1:
>               set_ioaccel1_performant_mode(h, c, reply_queue); @@ 
> -5653,8 +5656,6 @@ static int hpsa_scsi_queue_command(struct Scsi_Host *sh, struct scsi_cmnd *cmd)
>       /* Get the ptr to our adapter structure out of cmd->host. */
>       h = sdev_to_hba(cmd->device);
>
> -     BUG_ON(cmd->request->tag < 0);
> -
>       dev = cmd->device->hostdata;
>       if (!dev) {
>               cmd->result = DID_NO_CONNECT << 16; @@ -5830,7 +5831,7 
> @@ static int hpsa_scsi_host_alloc(struct ctlr_info *h)
>       sh->hostdata[0] = (unsigned long) h;
>       sh->irq = pci_irq_vector(h->pdev, 0);
>       sh->unique_id = sh->irq;
> -
> +     sh->nr_hw_queues = h->msix_vectors > 0 ? h->msix_vectors : 1;
>       h->scsi_host = sh;
>       return 0;
>   }
> @@ -5856,7 +5857,8 @@ static int hpsa_scsi_add_host(struct ctlr_info *h)
>    */
>   static int hpsa_get_cmd_index(struct scsi_cmnd *scmd)
>   {
> -     int idx = scmd->request->tag;
> +     u32 blk_tag = blk_mq_unique_tag(scmd->request);
> +     int idx = blk_mq_unique_tag_to_tag(blk_tag);
>
>       if (idx < 0)
>               return idx;
> @@ -7456,26 +7458,6 @@ static void hpsa_disable_interrupt_mode(struct ctlr_info *h)
>       h->msix_vectors = 0;
>   }
>
> -static void hpsa_setup_reply_map(struct ctlr_info *h) -{
> -     const struct cpumask *mask;
> -     unsigned int queue, cpu;
> -
> -     for (queue = 0; queue < h->msix_vectors; queue++) {
> -             mask = pci_irq_get_affinity(h->pdev, queue);
> -             if (!mask)
> -                     goto fallback;
> -
> -             for_each_cpu(cpu, mask)
> -                     h->reply_map[cpu] = queue;
> -     }
> -     return;
> -
> -fallback:
> -     for_each_possible_cpu(cpu)
> -             h->reply_map[cpu] = 0;
> -}
> -
>   /* If MSI/MSI-X is supported by the kernel we will try to enable it on
>    * controllers that are capable. If not, we use legacy INTx mode.
>    */
> @@ -7872,9 +7854,6 @@ static int hpsa_pci_init(struct ctlr_info *h)
>       if (err)
>               goto clean1;
>
> -     /* setup mapping between CPU and reply queue */
> -     hpsa_setup_reply_map(h);
> -
>       err = hpsa_pci_find_memory_BAR(h->pdev, &h->paddr);
>       if (err)
>               goto clean2;    /* intmode+region, pci */
> @@ -8613,7 +8592,6 @@ static struct workqueue_struct 
> *hpsa_create_controller_wq(struct ctlr_info *h,
>
>   static void hpda_free_ctlr_info(struct ctlr_info *h)
>   {
> -     kfree(h->reply_map);
>       kfree(h);
>   }
>
> @@ -8622,14 +8600,6 @@ static struct ctlr_info *hpda_alloc_ctlr_info(void)
>       struct ctlr_info *h;
>
>       h = kzalloc(sizeof(*h), GFP_KERNEL);
> -     if (!h)
> -             return NULL;
> -
> -     h->reply_map = kcalloc(nr_cpu_ids, sizeof(*h->reply_map), GFP_KERNEL);
> -     if (!h->reply_map) {
> -             kfree(h);
> -             return NULL;
> -     }
>       return h;
>   }
>
> diff --git a/drivers/scsi/hpsa.h b/drivers/scsi/hpsa.h index 
> f8c88fc7b80a..ea4a609e3eb7 100644
> --- a/drivers/scsi/hpsa.h
> +++ b/drivers/scsi/hpsa.h
> @@ -161,7 +161,6 @@ struct bmic_controller_parameters {
>   #pragma pack()
>
>   struct ctlr_info {
> -     unsigned int *reply_map;
>       int     ctlr;
>       char    devname[8];
>       char    *product_name;
>


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 12/12] hpsa: enable host_tagset and switch to MQ
  2020-07-16 19:45     ` Don.Brace
@ 2020-07-17 10:11       ` John Garry
  0 siblings, 0 replies; 123+ messages in thread
From: John Garry @ 2020-07-17 10:11 UTC (permalink / raw)
  To: Don.Brace, don.brace, hare
  Cc: axboe, jejb, martin.petersen, kashyap.desai, sumit.saxena,
	ming.lei, bvanassche, hare, hch, shivasharan.srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	megaraidlinux.pdl

Hi Don,

Thanks for checking this.

> I cloned:
> https://github.com/hisilicon/kernel-dev
> switched to branch: origin/private-topic-blk-mq-shared-tags-rfc-v8

I would have suggested to use v7 for now, but does not look relevant here.

> 
> And built the kernel. The hpsa driver oopsed on load. It was attempting to do driver initiated commands, so there would need to be some reserved tags set aside to communicate with the controller.
> 
> Was I supposed to add this patch on top of Hannes's hpsa patches?

I didn't think so, but I now realize that it may be necessary here - 
please see below. And since Hannes's reserved commands work is not 
merged, I do not include it.

> [   14.717025] Call Trace:
> [   14.717034]  __enqueue_cmd_and_start_io.isra.60+0x20/0x170 [hpsa]
> [   14.717039]  hpsa_scsi_do_simple_cmd.isra.62+0x6b/0xd0 [hpsa]
> [   14.717042]  hpsa_scsi_do_simple_cmd_with_retry+0x63/0x160 [hpsa]
> [   14.717045]  hpsa_scsi_do_inquiry+0x62/0xc0 [hpsa]
> [   14.717048]  hpsa_init_one+0x1167/0x1400 [hpsa]
> [   14.717052]  local_pci_probe+0x42/0x80
> [   14.717054]  work_for_cpu_fn+0x16/0x20
> [   14.717057]  process_one_work+0x1a7/0x370
> [   14.717059]  ? process_one_work+0x370/0x370
> [   14.717061]  worker_thread+0x1c9/0x370
> [   14.717062]  ? process_one_work+0x370/0x370
> [   14.717064]  kthread+0x114/0x130
> [   14.717065]  ? kthread_park+0x80/0x80
> [   14.717068]  ret_from_fork+0x22/0x30
> [   14.717070] Modules linked in: crc32c_intel libahci(+) uas tg3(+) libata usb_storage i2c_algo_bit hpsa(+) scsi_transport_sas wmi dm_mirror dm_region_hash dm_log dm_mod
> [   14.717077] CR2: 0000000000000010
> [   14.717099] ---[ end trace 3845f459e9223caa ]---
> [   14.724750] ERST: [Firmware Warn]: Firmware does not respond in time.
> [   14.724753] RIP: 0010:blk_mq_unique_tag+0x5/0x20
> [   14.724754] Code: cd 0f 1f 40 00 0f 1f 44 00 00 8b 87 cc 00 00 00 83 f8 02 75 03 83 06 01 b8 01 00 00 00 c3 0f 1f 80 00 00 00 00 0f 1f 44 00 00 <48> 8b 47 10 0f b7 57 20 8b 80 94 01 00 00 c1 e0 10 09 d0 c3 0f 1f
> [   14.724755] RSP: 0000:ffff989f42893d08 EFLAGS: 00010246
> [   14.724756] RAX: ffffffffc0493f80 RBX: ffff8ab761c00000 RCX: 0000000000000000
> [   14.724757] RDX: ffff8ab9b7600000 RSI: ffff8ab761c00000 RDI: 0000000000000000
> [   14.724757] RBP: ffff8ab9a5b98000 R08: ffffffffffffffff R09: 0000000000000000
> [   14.724758] R10: ffff8ab8b5280070 R11: 0000000000000000 R12: 000000000000000a
> [   14.724758] R13: 0000000000000002 R14: ffff8ab761c00000 R15: ffffffffc0493b60
> [   14.724759] FS:  0000000000000000(0000) GS:ffff8ab9b7600000(0000) knlGS:0000000000000000
> [   14.724760] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   14.724760] CR2: 0000000000000010 CR3: 000000024f33e006 CR4: 00000000001606f0
> [   14.724761] Kernel panic - not syncing: Fatal exception
> [   14.724833] Kernel Offset: 0x38400000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> [   16.487017] ---[ end Kernel panic - not syncing: Fatal exception ]---
> 
> 
> 
>> Signed-off-by: Hannes Reinecke<hare@suse.de>
>> ---
>>    drivers/scsi/hpsa.c | 44 +++++++-------------------------------------
>>    drivers/scsi/hpsa.h |  1 -
>>    2 files changed, 7 insertions(+), 38 deletions(-)
>>
>> diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c index
>> 1e9302e99d05..f807f9bdae85 100644
>> --- a/drivers/scsi/hpsa.c
>> +++ b/drivers/scsi/hpsa.c
>> @@ -980,6 +980,7 @@ static struct scsi_host_template hpsa_driver_template = {
>>        .shost_attrs = hpsa_shost_attrs,
>>        .max_sectors = 2048,
>>        .no_write_same = 1,
>> +     .host_tagset = 1,
>>    };
>>
>>    static inline u32 next_command(struct ctlr_info *h, u8 q) @@
>> -1144,12 +1145,14 @@ static void dial_up_lockup_detection_on_fw_flash_complete(struct ctlr_info *h,
>>    static void __enqueue_cmd_and_start_io(struct ctlr_info *h,
>>        struct CommandList *c, int reply_queue)
>>    {
>> +     u32 blk_tag = blk_mq_unique_tag(c->scsi_cmd->request);

For the hpsa_scsi_do_inquiry() -> fill_cmd(HPSA_INQUIRY) call, 
c->scsi_cmd = SCSI_CMD_BUSY, which just seems to be a pointer flag.

And so I guess that c->scsi_cmd->request == NULL, and we deference this 
in blk_mq_unique_tag() -> oops. I figure that the code should look like 
this for now:

static void __enqueue_cmd_and_start_io(struct ctlr_info *h,
struct CommandList *c, int reply_queue)
{
	if (c->scsi_cmd->request) {
		u32 blk_tag = blk_mq_unique_tag(c->scsi_cmd->request);

		reply_queue = blk_mq_unique_tag_to_hwq(blk_tag);
	}

	dial_down_lockup_detection_during_fw_flash(h, c);
	atomic_inc(&h->commands_outstanding);
	if (c->device)
	atomic_inc(&c->device->commands_outstanding);

	switch (c->cmd_type) {

But then reply_queue may be = DEFAULT_REPLY_QUEUE (=1), so I am not sure 
if that is a problem. However this issue should go away with Hannes's 
reserved command work, as we allocate a "real" SCSI cmd there.

@Hannes, any suggestion what to do here?

>> +
>>        dial_down_lockup_detection_during_fw_flash(h, c);
>>        atomic_inc(&h->commands_outstanding);
>>        if (c->device)
>>                atomic_inc(&c->device->commands_outstanding);
>>
>> -     reply_queue = h->reply_map[raw_smp_processor_id()];
>> +     reply_queue = blk_mq_unique_tag_to_hwq(blk_tag);
>>        switch (c->cmd_type) {
>>        case CMD_IOACCEL1:
>>                set_ioaccel1_performant_mode(h, c, reply_queue); @@
>> -5653,8 +5656,6 @@ static int hpsa_scsi_queue_command(struct Scsi_Host *sh, struct scsi_cmnd *cmd)
>>        /* Get the ptr to our adapter structure out of cmd->host. */
>>        h = sdev_to_hba(cmd->device);
>>
>> -     BUG_ON(cmd->request->tag < 0);
>> -
>>        dev = cmd->device->hostdata;
>>        if (!dev) {
>>                cmd->result = DID_NO_CONNECT << 16; @@ -5830,7 +5831,7
>> @@ static int hpsa_scsi_host_alloc(struct ctlr_info *h)
>>        sh->hostdata[0] = (unsigned long) h;
>>        sh->irq = pci_irq_vector(h->pdev, 0);
>>        sh->unique_id = sh->irq;
>> -
>> +     sh->nr_hw_queues = h->msix_vectors > 0 ? h->msix_vectors : 1;
>>        h->scsi_host = sh;
>>        return 0;
>>    }
>> @@ -5856,7 +5857,8 @@ static int hpsa_scsi_add_host(struct ctlr_info *h)
>>     */
>>    static int hpsa_get_cmd_index(struct scsi_cmnd *scmd)
>>    {
>> -     int idx = scmd->request->tag;
>> +     u32 blk_tag = blk_mq_unique_tag(scmd->request);
>> +     int idx = blk_mq_unique_tag_to_tag(blk_tag);

@Hannes, This looks like a pointless change - we make a 32b unique tag, 
including the request->tag, and then convert back to the request->tag.

>>
>>        if (idx < 0)
>>                return idx;
>> @@ -7456,26 +7458,6 @@ static void hpsa_disable_interrupt_mode(struct ctlr_info *h)
>>        h->msix_vectors = 0;
>>    }

Thanks,
John

^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: [PATCH RFC v7 10/12] megaraid_sas: switch fusion adapters to MQ
  2020-07-13  8:42                     ` John Garry
@ 2020-07-19 19:07                       ` Kashyap Desai
  2020-07-20  7:23                       ` Kashyap Desai
  1 sibling, 0 replies; 123+ messages in thread
From: Kashyap Desai @ 2020-07-19 19:07 UTC (permalink / raw)
  To: John Garry, axboe, jejb, martin.petersen, don.brace,
	Sumit Saxena, ming.lei, bvanassche, hare, hch,
	Shivasharan Srikanteshwara
  Cc: linux-block, linux-scsi, esc.storagedev, chenxiang66, PDL,MEGARAIDLINUX

> > I also noticed nr_hw_queues are now exposed in sysfs -
> >
> > /sys/devices/pci0000:85/0000:85:00.0/0000:86:00.0/0000:87:04.0/0000:8b
> >
> :00.0/0000:8c:00.0/0000:8d:00.0/host14/scsi_host/host14/nr_hw_queues:1
> > 28
> > .
>
> That's on my v8 wip branch, so I guess you're picking it up from there.

John - I did more testing on v8 wip branch.  CPU hotplug is working as
expected, but I still see some performance issue on Logical Volumes.

I created 8 Drives Raid-0 VD on MR controller and below is performance
impact of this RFC. Looks like contention is on single <sdev>.

I used command - "numactl -N 1  fio
1vd.fio --iodepth=128 --bs=4k --rw=randread
--cpus_allowed_policy=split --ioscheduler=none
 --group_reporting --runtime=200 --numjobs=1"
IOPS without RFC = 300K IOPS with RFC = 230K.

Perf top (shared host tag. IOPS = 230K)

13.98%  [kernel]        [k] sbitmap_any_bit_set
     6.43%  [kernel]        [k] blk_mq_run_hw_queue
     6.00%  [kernel]        [k] __audit_syscall_exit
     3.47%  [kernel]        [k] read_tsc
     3.19%  [megaraid_sas]  [k] complete_cmd_fusion
     3.19%  [kernel]        [k] irq_entries_start
     2.75%  [kernel]        [k] blk_mq_run_hw_queues
     2.45%  fio             [.] fio_gettime
     1.76%  [kernel]        [k] entry_SYSCALL_64
     1.66%  [kernel]        [k] add_interrupt_randomness
     1.48%  [megaraid_sas]  [k] megasas_build_and_issue_cmd_fusion
     1.42%  [kernel]        [k] copy_user_generic_string
     1.36%  [kernel]        [k] scsi_queue_rq
     1.03%  [kernel]        [k] kmem_cache_alloc
     1.03%  [kernel]        [k] internal_get_user_pages_fast
     1.01%  [kernel]        [k] swapgs_restore_regs_and_return_to_usermode
     0.96%  [kernel]        [k] kmem_cache_free
     0.88%  [kernel]        [k] blkdev_direct_IO
     0.84%  fio             [.] td_io_queue
     0.83%  [kernel]        [k] __get_user_4

Perf top (shared host tag. IOPS = 300K)

    6.36%  [kernel]        [k] unroll_tree_refs
     5.77%  [kernel]        [k] __do_softirq
     4.56%  [kernel]        [k] irq_entries_start
     4.38%  [kernel]        [k] read_tsc
     3.95%  [megaraid_sas]  [k] complete_cmd_fusion
     3.21%  fio             [.] fio_gettime
     2.98%  [kernel]        [k] add_interrupt_randomness
     1.79%  [kernel]        [k] entry_SYSCALL_64
     1.61%  [kernel]        [k] copy_user_generic_string
     1.61%  [megaraid_sas]  [k] megasas_build_and_issue_cmd_fusion
     1.34%  [kernel]        [k] scsi_queue_rq
     1.11%  [kernel]        [k] kmem_cache_free
     1.05%  [kernel]        [k] blkdev_direct_IO
     1.05%  [kernel]        [k] internal_get_user_pages_fast
     1.00%  [kernel]        [k] __memset
     1.00%  fio             [.] td_io_queue
     0.98%  [kernel]        [k] kmem_cache_alloc
     0.94%  [kernel]        [k] __get_user_4
     0.93%  [kernel]        [k] lookup_ioctx
     0.88%  [kernel]        [k] sbitmap_any_bit_set

Kashyap

>
> Thanks,
> John

^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: [PATCH RFC v7 10/12] megaraid_sas: switch fusion adapters to MQ
  2020-07-13  8:42                     ` John Garry
  2020-07-19 19:07                       ` Kashyap Desai
@ 2020-07-20  7:23                       ` Kashyap Desai
  2020-07-20  9:18                         ` John Garry
  2020-07-21  1:13                         ` Ming Lei
  1 sibling, 2 replies; 123+ messages in thread
From: Kashyap Desai @ 2020-07-20  7:23 UTC (permalink / raw)
  To: John Garry, axboe, jejb, martin.petersen, don.brace,
	Sumit Saxena, ming.lei, bvanassche, hare, hch,
	Shivasharan Srikanteshwara
  Cc: linux-block, linux-scsi, esc.storagedev, chenxiang66, PDL,MEGARAIDLINUX

> > > I also noticed nr_hw_queues are now exposed in sysfs -
> > >
> > >
> /sys/devices/pci0000:85/0000:85:00.0/0000:86:00.0/0000:87:04.0/0000:8b
> > >
> >
> :00.0/0000:8c:00.0/0000:8d:00.0/host14/scsi_host/host14/nr_hw_queues:1
> > > 28
> > > .
> >
> > That's on my v8 wip branch, so I guess you're picking it up from there.
>
> John - I did more testing on v8 wip branch.  CPU hotplug is working as
> expected, but I still see some performance issue on Logical Volumes.
>
> I created 8 Drives Raid-0 VD on MR controller and below is performance
> impact of this RFC. Looks like contention is on single <sdev>.
>
> I used command - "numactl -N 1  fio 1vd.fio --iodepth=128 --bs=4k --
> rw=randread --cpus_allowed_policy=split --ioscheduler=none --
> group_reporting --runtime=200 --numjobs=1"
> IOPS without RFC = 300K IOPS with RFC = 230K.
>
> Perf top (shared host tag. IOPS = 230K)
>
> 13.98%  [kernel]        [k] sbitmap_any_bit_set
>      6.43%  [kernel]        [k] blk_mq_run_hw_queue

blk_mq_run_hw_queue function take more CPU which is called from "
scsi_end_request"
It looks like " blk_mq_hctx_has_pending" handles only elevator (scheduler)
case. If  queue has ioscheduler=none, we can skip. I case of scheduler=none,
IO will be pushed to hardware queue and it by pass software queue.
Based on above understanding, I added below patch and I can see performance
scale back to expectation.

Ming mentioned that - we cannot remove blk_mq_run_hw_queues() from IO
completion path otherwise we may see IO hang. So I have just modified
completion path assuming it is only required for IO scheduler case.
https://www.spinics.net/lists/linux-block/msg55049.html

Please review and let me know if this is good or we have to address with
proper fix.

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 1be7ac5a4040..b6a5b41b7fc2 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1559,6 +1559,9 @@ void blk_mq_run_hw_queues(struct request_queue *q,
bool async)
        struct blk_mq_hw_ctx *hctx;
        int i;

+       if (!q->elevator)
+               return;
+
        queue_for_each_hw_ctx(q, hctx, i) {
                if (blk_mq_hctx_stopped(hctx))
                        continue;

Kashyap

>      6.00%  [kernel]        [k] __audit_syscall_exit
>      3.47%  [kernel]        [k] read_tsc
>      3.19%  [megaraid_sas]  [k] complete_cmd_fusion
>      3.19%  [kernel]        [k] irq_entries_start
>      2.75%  [kernel]        [k] blk_mq_run_hw_queues
>      2.45%  fio             [.] fio_gettime
>      1.76%  [kernel]        [k] entry_SYSCALL_64
>      1.66%  [kernel]        [k] add_interrupt_randomness
>      1.48%  [megaraid_sas]  [k] megasas_build_and_issue_cmd_fusion
>      1.42%  [kernel]        [k] copy_user_generic_string
>      1.36%  [kernel]        [k] scsi_queue_rq
>      1.03%  [kernel]        [k] kmem_cache_alloc
>      1.03%  [kernel]        [k] internal_get_user_pages_fast
>      1.01%  [kernel]        [k] swapgs_restore_regs_and_return_to_usermode
>      0.96%  [kernel]        [k] kmem_cache_free
>      0.88%  [kernel]        [k] blkdev_direct_IO
>      0.84%  fio             [.] td_io_queue
>      0.83%  [kernel]        [k] __get_user_4
>
> Perf top (shared host tag. IOPS = 300K)
>
>     6.36%  [kernel]        [k] unroll_tree_refs
>      5.77%  [kernel]        [k] __do_softirq
>      4.56%  [kernel]        [k] irq_entries_start
>      4.38%  [kernel]        [k] read_tsc
>      3.95%  [megaraid_sas]  [k] complete_cmd_fusion
>      3.21%  fio             [.] fio_gettime
>      2.98%  [kernel]        [k] add_interrupt_randomness
>      1.79%  [kernel]        [k] entry_SYSCALL_64
>      1.61%  [kernel]        [k] copy_user_generic_string
>      1.61%  [megaraid_sas]  [k] megasas_build_and_issue_cmd_fusion
>      1.34%  [kernel]        [k] scsi_queue_rq
>      1.11%  [kernel]        [k] kmem_cache_free
>      1.05%  [kernel]        [k] blkdev_direct_IO
>      1.05%  [kernel]        [k] internal_get_user_pages_fast
>      1.00%  [kernel]        [k] __memset
>      1.00%  fio             [.] td_io_queue
>      0.98%  [kernel]        [k] kmem_cache_alloc
>      0.94%  [kernel]        [k] __get_user_4
>      0.93%  [kernel]        [k] lookup_ioctx
>      0.88%  [kernel]        [k] sbitmap_any_bit_set
>
> Kashyap
>
> >
> > Thanks,
> > John

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 10/12] megaraid_sas: switch fusion adapters to MQ
  2020-07-20  7:23                       ` Kashyap Desai
@ 2020-07-20  9:18                         ` John Garry
  2020-07-21  1:13                         ` Ming Lei
  1 sibling, 0 replies; 123+ messages in thread
From: John Garry @ 2020-07-20  9:18 UTC (permalink / raw)
  To: Kashyap Desai, axboe, jejb, martin.petersen, don.brace,
	Sumit Saxena, ming.lei, bvanassche, hare, hch,
	Shivasharan Srikanteshwara
  Cc: linux-block, linux-scsi, esc.storagedev, chenxiang (M),
	PDL,MEGARAIDLINUX

On 20/07/2020 08:23, Kashyap Desai wrote:

Hi Kashyap,

>> John - I did more testing on v8 wip branch.  CPU hotplug is working as
>> expected, 

Good to hear.

but I still see some performance issue on Logical Volumes.
>>
>> I created 8 Drives Raid-0 VD on MR controller and below is performance
>> impact of this RFC. Looks like contention is on single <sdev>.
>>
>> I used command - "numactl -N 1  fio 1vd.fio --iodepth=128 --bs=4k --
>> rw=randread --cpus_allowed_policy=split --ioscheduler=none --
>> group_reporting --runtime=200 --numjobs=1"
>> IOPS without RFC = 300K IOPS with RFC = 230K.
>>
>> Perf top (shared host tag. IOPS = 230K)
>>
>> 13.98%  [kernel]        [k] sbitmap_any_bit_set

I guess that this comes from the blk_mq_hctx_has_pending() -> 
sbitmap_any_bit_set(&hctx->ctx_map) call. The 
list_empty_careful(&hctx->dispatch) and blk_mq_sched_has_work(hctx) 
[when scheduler=none] calls look pretty lightweight.

>>       6.43%  [kernel]        [k] blk_mq_run_hw_queue
> blk_mq_run_hw_queue function take more CPU which is called from "
> scsi_end_request"
> It looks like " blk_mq_hctx_has_pending" handles only elevator (scheduler)
> case. If  queue has ioscheduler=none, we can skip. I case of scheduler=none,
> IO will be pushed to hardware queue and it by pass software queue.
> Based on above understanding, I added below patch and I can see performance
> scale back to expectation.
> 
> Ming mentioned that - we cannot remove blk_mq_run_hw_queues() from IO
> completion path otherwise we may see IO hang. So I have just modified
> completion path assuming it is only required for IO scheduler case.
> https://www.spinics.net/lists/linux-block/msg55049.html
> 
> Please review and let me know if this is good or we have to address with
> proper fix.
> 

So what you're doing looks reasonable, but I would be concerned about 
missing the blk_mq_run_hw_queue() -> blk_mq_hctx_has_pending() -> 
list_empty_careful(&hctx->dispatch) check - if it's not really needed 
there, then why not remove?

Hi Ming, any opinion on this?

> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 1be7ac5a4040..b6a5b41b7fc2 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1559,6 +1559,9 @@ void blk_mq_run_hw_queues(struct request_queue *q,
> bool async)
>          struct blk_mq_hw_ctx *hctx;
>          int i;
> 
> +       if (!q->elevator)
> +               return;
> +
>          queue_for_each_hw_ctx(q, hctx, i) {
>                  if (blk_mq_hctx_stopped(hctx))
>                          continue;
> 

Thanks,
John


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 10/12] megaraid_sas: switch fusion adapters to MQ
  2020-07-20  7:23                       ` Kashyap Desai
  2020-07-20  9:18                         ` John Garry
@ 2020-07-21  1:13                         ` Ming Lei
  2020-07-21  6:53                           ` Kashyap Desai
  1 sibling, 1 reply; 123+ messages in thread
From: Ming Lei @ 2020-07-21  1:13 UTC (permalink / raw)
  To: Kashyap Desai
  Cc: John Garry, axboe, jejb, martin.petersen, don.brace,
	Sumit Saxena, bvanassche, hare, hch, Shivasharan Srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	PDL,MEGARAIDLINUX

On Mon, Jul 20, 2020 at 12:53:55PM +0530, Kashyap Desai wrote:
> > > > I also noticed nr_hw_queues are now exposed in sysfs -
> > > >
> > > >
> > /sys/devices/pci0000:85/0000:85:00.0/0000:86:00.0/0000:87:04.0/0000:8b
> > > >
> > >
> > :00.0/0000:8c:00.0/0000:8d:00.0/host14/scsi_host/host14/nr_hw_queues:1
> > > > 28
> > > > .
> > >
> > > That's on my v8 wip branch, so I guess you're picking it up from there.
> >
> > John - I did more testing on v8 wip branch.  CPU hotplug is working as
> > expected, but I still see some performance issue on Logical Volumes.
> >
> > I created 8 Drives Raid-0 VD on MR controller and below is performance
> > impact of this RFC. Looks like contention is on single <sdev>.
> >
> > I used command - "numactl -N 1  fio 1vd.fio --iodepth=128 --bs=4k --
> > rw=randread --cpus_allowed_policy=split --ioscheduler=none --
> > group_reporting --runtime=200 --numjobs=1"
> > IOPS without RFC = 300K IOPS with RFC = 230K.
> >
> > Perf top (shared host tag. IOPS = 230K)
> >
> > 13.98%  [kernel]        [k] sbitmap_any_bit_set
> >      6.43%  [kernel]        [k] blk_mq_run_hw_queue
> 
> blk_mq_run_hw_queue function take more CPU which is called from "
> scsi_end_request"

The problem could be that nr_hw_queues is increased a lot so that
sample on blk_mq_run_hw_queue() can be observed now.

> It looks like " blk_mq_hctx_has_pending" handles only elevator (scheduler)
> case. If  queue has ioscheduler=none, we can skip. I case of scheduler=none,
> IO will be pushed to hardware queue and it by pass software queue.
> Based on above understanding, I added below patch and I can see performance
> scale back to expectation.
> 
> Ming mentioned that - we cannot remove blk_mq_run_hw_queues() from IO
> completion path otherwise we may see IO hang. So I have just modified
> completion path assuming it is only required for IO scheduler case.
> https://www.spinics.net/lists/linux-block/msg55049.html
> 
> Please review and let me know if this is good or we have to address with
> proper fix.
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 1be7ac5a4040..b6a5b41b7fc2 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -1559,6 +1559,9 @@ void blk_mq_run_hw_queues(struct request_queue *q,
> bool async)
>         struct blk_mq_hw_ctx *hctx;
>         int i;
> 
> +       if (!q->elevator)
> +               return;
> +

This way shouldn't be correct, blk_mq_run_hw_queues() is still needed for
none because request may not be dispatched successfully by direct issue.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: [PATCH RFC v7 10/12] megaraid_sas: switch fusion adapters to MQ
  2020-07-21  1:13                         ` Ming Lei
@ 2020-07-21  6:53                           ` Kashyap Desai
  2020-07-22  4:12                             ` Ming Lei
  0 siblings, 1 reply; 123+ messages in thread
From: Kashyap Desai @ 2020-07-21  6:53 UTC (permalink / raw)
  To: Ming Lei
  Cc: John Garry, axboe, jejb, martin.petersen, don.brace,
	Sumit Saxena, bvanassche, hare, hch, Shivasharan Srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	PDL,MEGARAIDLINUX

> > >
> > > Perf top (shared host tag. IOPS = 230K)
> > >
> > > 13.98%  [kernel]        [k] sbitmap_any_bit_set
> > >      6.43%  [kernel]        [k] blk_mq_run_hw_queue
> >
> > blk_mq_run_hw_queue function take more CPU which is called from "
> > scsi_end_request"
>
> The problem could be that nr_hw_queues is increased a lot so that sample
on
> blk_mq_run_hw_queue() can be observed now.

Yes. That is correct.

>
> > It looks like " blk_mq_hctx_has_pending" handles only elevator
> > (scheduler) case. If  queue has ioscheduler=none, we can skip. I case
> > of scheduler=none, IO will be pushed to hardware queue and it by pass
> software queue.
> > Based on above understanding, I added below patch and I can see
> > performance scale back to expectation.
> >
> > Ming mentioned that - we cannot remove blk_mq_run_hw_queues() from IO
> > completion path otherwise we may see IO hang. So I have just modified
> > completion path assuming it is only required for IO scheduler case.
> > https://www.spinics.net/lists/linux-block/msg55049.html
> >
> > Please review and let me know if this is good or we have to address
> > with proper fix.
> >
> > diff --git a/block/blk-mq.c b/block/blk-mq.c index
> > 1be7ac5a4040..b6a5b41b7fc2 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -1559,6 +1559,9 @@ void blk_mq_run_hw_queues(struct
> request_queue
> > *q, bool async)
> >         struct blk_mq_hw_ctx *hctx;
> >         int i;
> >
> > +       if (!q->elevator)
> > +               return;
> > +
>
> This way shouldn't be correct, blk_mq_run_hw_queues() is still needed
for
> none because request may not be dispatched successfully by direct issue.

When block layer attempt posting request to h/w queue directly (for
ioscheduler=none) and if it fails, it is calling
blk_mq_request_bypass_insert().
blk_mq_request_bypass_insert function will start the h/w queue from
submission context. Do we still have an issue if we skip running hw queue
from completion ?

Kashyap

>
>
> Thanks,
> Ming

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 10/12] megaraid_sas: switch fusion adapters to MQ
  2020-07-21  6:53                           ` Kashyap Desai
@ 2020-07-22  4:12                             ` Ming Lei
  2020-07-22  5:30                               ` Kashyap Desai
  0 siblings, 1 reply; 123+ messages in thread
From: Ming Lei @ 2020-07-22  4:12 UTC (permalink / raw)
  To: Kashyap Desai
  Cc: John Garry, axboe, jejb, martin.petersen, don.brace,
	Sumit Saxena, bvanassche, hare, hch, Shivasharan Srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	PDL,MEGARAIDLINUX

On Tue, Jul 21, 2020 at 12:23:39PM +0530, Kashyap Desai wrote:
> > > >
> > > > Perf top (shared host tag. IOPS = 230K)
> > > >
> > > > 13.98%  [kernel]        [k] sbitmap_any_bit_set
> > > >      6.43%  [kernel]        [k] blk_mq_run_hw_queue
> > >
> > > blk_mq_run_hw_queue function take more CPU which is called from "
> > > scsi_end_request"
> >
> > The problem could be that nr_hw_queues is increased a lot so that sample
> on
> > blk_mq_run_hw_queue() can be observed now.
> 
> Yes. That is correct.
> 
> >
> > > It looks like " blk_mq_hctx_has_pending" handles only elevator
> > > (scheduler) case. If  queue has ioscheduler=none, we can skip. I case
> > > of scheduler=none, IO will be pushed to hardware queue and it by pass
> > software queue.
> > > Based on above understanding, I added below patch and I can see
> > > performance scale back to expectation.
> > >
> > > Ming mentioned that - we cannot remove blk_mq_run_hw_queues() from IO
> > > completion path otherwise we may see IO hang. So I have just modified
> > > completion path assuming it is only required for IO scheduler case.
> > > https://www.spinics.net/lists/linux-block/msg55049.html
> > >
> > > Please review and let me know if this is good or we have to address
> > > with proper fix.
> > >
> > > diff --git a/block/blk-mq.c b/block/blk-mq.c index
> > > 1be7ac5a4040..b6a5b41b7fc2 100644
> > > --- a/block/blk-mq.c
> > > +++ b/block/blk-mq.c
> > > @@ -1559,6 +1559,9 @@ void blk_mq_run_hw_queues(struct
> > request_queue
> > > *q, bool async)
> > >         struct blk_mq_hw_ctx *hctx;
> > >         int i;
> > >
> > > +       if (!q->elevator)
> > > +               return;
> > > +
> >
> > This way shouldn't be correct, blk_mq_run_hw_queues() is still needed
> for
> > none because request may not be dispatched successfully by direct issue.
> 
> When block layer attempt posting request to h/w queue directly (for
> ioscheduler=none) and if it fails, it is calling
> blk_mq_request_bypass_insert().
> blk_mq_request_bypass_insert function will start the h/w queue from
> submission context. Do we still have an issue if we skip running hw queue
> from completion ?

The thing is that we can't guarantee that direct issue or adding request into
hctx->dispatch is always done for MQ/none, for example, request still
can be added to sw queue from blk_mq_flush_plug_list() when mq plug is
applied.

Also, I am not sure it is a good idea to add request into hctx->dispatch
via blk_mq_request_bypass_insert() in __blk_mq_try_issue_directly() in
case of running out of budget, because this way may hurt sequential IO
performance.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: [PATCH RFC v7 10/12] megaraid_sas: switch fusion adapters to MQ
  2020-07-22  4:12                             ` Ming Lei
@ 2020-07-22  5:30                               ` Kashyap Desai
  2020-07-22  8:04                                 ` Ming Lei
  0 siblings, 1 reply; 123+ messages in thread
From: Kashyap Desai @ 2020-07-22  5:30 UTC (permalink / raw)
  To: Ming Lei
  Cc: John Garry, axboe, jejb, martin.petersen, don.brace,
	Sumit Saxena, bvanassche, hare, hch, Shivasharan Srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	PDL,MEGARAIDLINUX

> On Tue, Jul 21, 2020 at 12:23:39PM +0530, Kashyap Desai wrote:
> > > > >
> > > > > Perf top (shared host tag. IOPS = 230K)
> > > > >
> > > > > 13.98%  [kernel]        [k] sbitmap_any_bit_set
> > > > >      6.43%  [kernel]        [k] blk_mq_run_hw_queue
> > > >
> > > > blk_mq_run_hw_queue function take more CPU which is called from "
> > > > scsi_end_request"
> > >
> > > The problem could be that nr_hw_queues is increased a lot so that
> > > sample
> > on
> > > blk_mq_run_hw_queue() can be observed now.
> >
> > Yes. That is correct.
> >
> > >
> > > > It looks like " blk_mq_hctx_has_pending" handles only elevator
> > > > (scheduler) case. If  queue has ioscheduler=none, we can skip. I
> > > > case of scheduler=none, IO will be pushed to hardware queue and it
> > > > by pass
> > > software queue.
> > > > Based on above understanding, I added below patch and I can see
> > > > performance scale back to expectation.
> > > >
> > > > Ming mentioned that - we cannot remove blk_mq_run_hw_queues()
> from
> > > > IO completion path otherwise we may see IO hang. So I have just
> > > > modified completion path assuming it is only required for IO
scheduler
> case.
> > > > https://www.spinics.net/lists/linux-block/msg55049.html
> > > >
> > > > Please review and let me know if this is good or we have to
> > > > address with proper fix.
> > > >
> > > > diff --git a/block/blk-mq.c b/block/blk-mq.c index
> > > > 1be7ac5a4040..b6a5b41b7fc2 100644
> > > > --- a/block/blk-mq.c
> > > > +++ b/block/blk-mq.c
> > > > @@ -1559,6 +1559,9 @@ void blk_mq_run_hw_queues(struct
> > > request_queue
> > > > *q, bool async)
> > > >         struct blk_mq_hw_ctx *hctx;
> > > >         int i;
> > > >
> > > > +       if (!q->elevator)
> > > > +               return;
> > > > +
> > >
> > > This way shouldn't be correct, blk_mq_run_hw_queues() is still
> > > needed
> > for
> > > none because request may not be dispatched successfully by direct
issue.
> >
> > When block layer attempt posting request to h/w queue directly (for
> > ioscheduler=none) and if it fails, it is calling
> > blk_mq_request_bypass_insert().
> > blk_mq_request_bypass_insert function will start the h/w queue from
> > submission context. Do we still have an issue if we skip running hw
> > queue from completion ?
>
> The thing is that we can't guarantee that direct issue or adding request
into
> hctx->dispatch is always done for MQ/none, for example, request still
> can be added to sw queue from blk_mq_flush_plug_list() when mq plug is
> applied.

I see even blk_mq_sched_insert_requests() from blk_mq_flush_plug_list make
sure it run the h/w queue. If all the submission path which deals with s/w
queue make sure they run h/w queue, can't we remove blk_mq_run_hw_queues()
from scsi_end_request ?

>
> Also, I am not sure it is a good idea to add request into hctx->dispatch
via
> blk_mq_request_bypass_insert() in __blk_mq_try_issue_directly() in case
of
> running out of budget, because this way may hurt sequential IO
performance.
>
> Thanks,
> Ming

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 10/12] megaraid_sas: switch fusion adapters to MQ
  2020-07-22  5:30                               ` Kashyap Desai
@ 2020-07-22  8:04                                 ` Ming Lei
  2020-07-22  9:32                                   ` John Garry
  2020-07-28  8:01                                   ` Kashyap Desai
  0 siblings, 2 replies; 123+ messages in thread
From: Ming Lei @ 2020-07-22  8:04 UTC (permalink / raw)
  To: Kashyap Desai
  Cc: John Garry, axboe, jejb, martin.petersen, don.brace,
	Sumit Saxena, bvanassche, hare, hch, Shivasharan Srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	PDL,MEGARAIDLINUX

On Wed, Jul 22, 2020 at 11:00:45AM +0530, Kashyap Desai wrote:
> > On Tue, Jul 21, 2020 at 12:23:39PM +0530, Kashyap Desai wrote:
> > > > > >
> > > > > > Perf top (shared host tag. IOPS = 230K)
> > > > > >
> > > > > > 13.98%  [kernel]        [k] sbitmap_any_bit_set
> > > > > >      6.43%  [kernel]        [k] blk_mq_run_hw_queue
> > > > >
> > > > > blk_mq_run_hw_queue function take more CPU which is called from "
> > > > > scsi_end_request"
> > > >
> > > > The problem could be that nr_hw_queues is increased a lot so that
> > > > sample
> > > on
> > > > blk_mq_run_hw_queue() can be observed now.
> > >
> > > Yes. That is correct.
> > >
> > > >
> > > > > It looks like " blk_mq_hctx_has_pending" handles only elevator
> > > > > (scheduler) case. If  queue has ioscheduler=none, we can skip. I
> > > > > case of scheduler=none, IO will be pushed to hardware queue and it
> > > > > by pass
> > > > software queue.
> > > > > Based on above understanding, I added below patch and I can see
> > > > > performance scale back to expectation.
> > > > >
> > > > > Ming mentioned that - we cannot remove blk_mq_run_hw_queues()
> > from
> > > > > IO completion path otherwise we may see IO hang. So I have just
> > > > > modified completion path assuming it is only required for IO
> scheduler
> > case.
> > > > > https://www.spinics.net/lists/linux-block/msg55049.html
> > > > >
> > > > > Please review and let me know if this is good or we have to
> > > > > address with proper fix.
> > > > >
> > > > > diff --git a/block/blk-mq.c b/block/blk-mq.c index
> > > > > 1be7ac5a4040..b6a5b41b7fc2 100644
> > > > > --- a/block/blk-mq.c
> > > > > +++ b/block/blk-mq.c
> > > > > @@ -1559,6 +1559,9 @@ void blk_mq_run_hw_queues(struct
> > > > request_queue
> > > > > *q, bool async)
> > > > >         struct blk_mq_hw_ctx *hctx;
> > > > >         int i;
> > > > >
> > > > > +       if (!q->elevator)
> > > > > +               return;
> > > > > +
> > > >
> > > > This way shouldn't be correct, blk_mq_run_hw_queues() is still
> > > > needed
> > > for
> > > > none because request may not be dispatched successfully by direct
> issue.
> > >
> > > When block layer attempt posting request to h/w queue directly (for
> > > ioscheduler=none) and if it fails, it is calling
> > > blk_mq_request_bypass_insert().
> > > blk_mq_request_bypass_insert function will start the h/w queue from
> > > submission context. Do we still have an issue if we skip running hw
> > > queue from completion ?
> >
> > The thing is that we can't guarantee that direct issue or adding request
> into
> > hctx->dispatch is always done for MQ/none, for example, request still
> > can be added to sw queue from blk_mq_flush_plug_list() when mq plug is
> > applied.
> 
> I see even blk_mq_sched_insert_requests() from blk_mq_flush_plug_list make
> sure it run the h/w queue. If all the submission path which deals with s/w
> queue make sure they run h/w queue, can't we remove blk_mq_run_hw_queues()
> from scsi_end_request ?

No, one purpose of blk_mq_run_hw_queues() is for rerun queue in case that
dispatch budget is running out of in submission path, and sdev->device_busy is
shared by all hw queues on this scsi device.

I posted one patch for avoiding it in scsi_end_request() before, looks it
never lands upstream:

https://lore.kernel.org/linux-block/20191118100640.3673-1-ming.lei@redhat.com/

Thanks,
Ming


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 10/12] megaraid_sas: switch fusion adapters to MQ
  2020-07-22  8:04                                 ` Ming Lei
@ 2020-07-22  9:32                                   ` John Garry
  2020-07-23 14:07                                     ` Ming Lei
  2020-07-28  8:01                                   ` Kashyap Desai
  1 sibling, 1 reply; 123+ messages in thread
From: John Garry @ 2020-07-22  9:32 UTC (permalink / raw)
  To: Ming Lei, Kashyap Desai
  Cc: axboe, jejb, martin.petersen, don.brace, Sumit Saxena,
	bvanassche, hare, hch, Shivasharan Srikanteshwara, linux-block,
	linux-scsi, esc.storagedev, chenxiang66, PDL,MEGARAIDLINUX

>>>>>>
>>>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c index
>>>>>> 1be7ac5a4040..b6a5b41b7fc2 100644
>>>>>> --- a/block/blk-mq.c
>>>>>> +++ b/block/blk-mq.c
>>>>>> @@ -1559,6 +1559,9 @@ void blk_mq_run_hw_queues(struct
>>>>> request_queue
>>>>>> *q, bool async)
>>>>>>          struct blk_mq_hw_ctx *hctx;
>>>>>>          int i;
>>>>>>
>>>>>> +       if (!q->elevator)
>>>>>> +               return;
>>>>>> +
>>>>> This way shouldn't be correct, blk_mq_run_hw_queues() is still
>>>>> needed

Could the logic of blk_mq_run_hw_queue() -> blk_mq_hctx_has_pending() -> 
sbitmap_any_bit_set(&hctx->ctx_map) be optimised for megaraid scenario?

As I see, since megaraid will have 1:1 mapping of CPU to hw queue, will 
there only ever possibly a single bit set in ctx_map? If so, it seems a 
waste to always check every sbitmap map. But adding logic for this may 
negate any possible gains.

>>>> for
>>>>> none because request may not be dispatched successfully by direct
>> issue.
>>>> When block layer attempt posting request to h/w queue directly (for
>>>> ioscheduler=none) and if it fails, it is calling
>>>> blk_mq_request_bypass_insert().
>>>> blk_mq_request_bypass_insert function will start the h/w queue from
>>>> submission context. Do we still have an issue if we skip running hw
>>>> queue from completion ?
>>> The thing is that we can't guarantee that direct issue or adding request
>> into
>>> hctx->dispatch is always done for MQ/none, for example, request still
>>> can be added to sw queue from blk_mq_flush_plug_list() when mq plug is
>>> applied.
>> I see even blk_mq_sched_insert_requests() from blk_mq_flush_plug_list make
>> sure it run the h/w queue. If all the submission path which deals with s/w
>> queue make sure they run h/w queue, can't we remove blk_mq_run_hw_queues()
>> from scsi_end_request ?
> No, one purpose of blk_mq_run_hw_queues() is for rerun queue in case that
> dispatch budget is running out of in submission path, and sdev->device_busy is
> shared by all hw queues on this scsi device.
> 
> I posted one patch for avoiding it in scsi_end_request() before, looks it
> never lands upstream:
> 

I saw that you actually posted the v3:
https://lore.kernel.org/linux-scsi/BL0PR2101MB11230C5F70151037B23C0C35CE2D0@BL0PR2101MB1123.namprd21.prod.outlook.com/
And it no longer applies, due to the changes in scsi_mq_get_budget(), I 
think, which look non-trivial. Any chance to repost?

Thanks,
John

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 10/12] megaraid_sas: switch fusion adapters to MQ
  2020-07-22  9:32                                   ` John Garry
@ 2020-07-23 14:07                                     ` Ming Lei
  2020-07-23 17:29                                       ` John Garry
  0 siblings, 1 reply; 123+ messages in thread
From: Ming Lei @ 2020-07-23 14:07 UTC (permalink / raw)
  To: John Garry
  Cc: Kashyap Desai, axboe, jejb, martin.petersen, don.brace,
	Sumit Saxena, bvanassche, hare, hch, Shivasharan Srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	PDL,MEGARAIDLINUX

On Wed, Jul 22, 2020 at 10:32:41AM +0100, John Garry wrote:
> > > > > > > 
> > > > > > > diff --git a/block/blk-mq.c b/block/blk-mq.c index
> > > > > > > 1be7ac5a4040..b6a5b41b7fc2 100644
> > > > > > > --- a/block/blk-mq.c
> > > > > > > +++ b/block/blk-mq.c
> > > > > > > @@ -1559,6 +1559,9 @@ void blk_mq_run_hw_queues(struct
> > > > > > request_queue
> > > > > > > *q, bool async)
> > > > > > >          struct blk_mq_hw_ctx *hctx;
> > > > > > >          int i;
> > > > > > > 
> > > > > > > +       if (!q->elevator)
> > > > > > > +               return;
> > > > > > > +
> > > > > > This way shouldn't be correct, blk_mq_run_hw_queues() is still
> > > > > > needed
> 
> Could the logic of blk_mq_run_hw_queue() -> blk_mq_hctx_has_pending() ->
> sbitmap_any_bit_set(&hctx->ctx_map) be optimised for megaraid scenario?
> 
> As I see, since megaraid will have 1:1 mapping of CPU to hw queue, will
> there only ever possibly a single bit set in ctx_map? If so, it seems a
> waste to always check every sbitmap map. But adding logic for this may
> negate any possible gains.

It really depends on min and max cpu id in the map, then sbitmap
depth can be reduced to (max - min + 1). I'd suggest to double check that
cost of sbitmap_any_bit_set() really matters.

> 
> > > > > for
> > > > > > none because request may not be dispatched successfully by direct
> > > issue.
> > > > > When block layer attempt posting request to h/w queue directly (for
> > > > > ioscheduler=none) and if it fails, it is calling
> > > > > blk_mq_request_bypass_insert().
> > > > > blk_mq_request_bypass_insert function will start the h/w queue from
> > > > > submission context. Do we still have an issue if we skip running hw
> > > > > queue from completion ?
> > > > The thing is that we can't guarantee that direct issue or adding request
> > > into
> > > > hctx->dispatch is always done for MQ/none, for example, request still
> > > > can be added to sw queue from blk_mq_flush_plug_list() when mq plug is
> > > > applied.
> > > I see even blk_mq_sched_insert_requests() from blk_mq_flush_plug_list make
> > > sure it run the h/w queue. If all the submission path which deals with s/w
> > > queue make sure they run h/w queue, can't we remove blk_mq_run_hw_queues()
> > > from scsi_end_request ?
> > No, one purpose of blk_mq_run_hw_queues() is for rerun queue in case that
> > dispatch budget is running out of in submission path, and sdev->device_busy is
> > shared by all hw queues on this scsi device.
> > 
> > I posted one patch for avoiding it in scsi_end_request() before, looks it
> > never lands upstream:
> > 
> 
> I saw that you actually posted the v3:
> https://lore.kernel.org/linux-scsi/BL0PR2101MB11230C5F70151037B23C0C35CE2D0@BL0PR2101MB1123.namprd21.prod.outlook.com/
> And it no longer applies, due to the changes in scsi_mq_get_budget(), I
> think, which look non-trivial. Any chance to repost?

OK, will post V4.


thanks,
Ming


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 10/12] megaraid_sas: switch fusion adapters to MQ
  2020-07-23 14:07                                     ` Ming Lei
@ 2020-07-23 17:29                                       ` John Garry
  2020-07-24  2:47                                         ` Ming Lei
  0 siblings, 1 reply; 123+ messages in thread
From: John Garry @ 2020-07-23 17:29 UTC (permalink / raw)
  To: Ming Lei
  Cc: Kashyap Desai, axboe, jejb, martin.petersen, don.brace,
	Sumit Saxena, bvanassche, hare, hch, Shivasharan Srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	PDL,MEGARAIDLINUX

>> As I see, since megaraid will have 1:1 mapping of CPU to hw queue, will
>> there only ever possibly a single bit set in ctx_map? If so, it seems a
>> waste to always check every sbitmap map. But adding logic for this may
>> negate any possible gains.
> 
> It really depends on min and max cpu id in the map, then sbitmap
> depth can be reduced to (max - min + 1). I'd suggest to double check that
> cost of sbitmap_any_bit_set() really matters.

Hi Ming,

I'm not sure that reducing the search range would help much, as we still 
need to load some indexes of map[], and at best this may be reduced from 
2/3 -> 1 elements, depending on nr_cpus.

> 
>>
>>>>>> for
>>>>>>> none because request may not be dispatched successfully by direct
>>>> issue.
>>>>>> When block layer attempt posting request to h/w queue directly (for
>>>>>> ioscheduler=none) and if it fails, it is calling
>>>>>> blk_mq_request_bypass_insert().
>>>>>> blk_mq_request_bypass_insert function will start the h/w queue from
>>>>>> submission context. Do we still have an issue if we skip running hw
>>>>>> queue from completion ?
>>>>> The thing is that we can't guarantee that direct issue or adding request
>>>> into
>>>>> hctx->dispatch is always done for MQ/none, for example, request still
>>>>> can be added to sw queue from blk_mq_flush_plug_list() when mq plug is
>>>>> applied.
>>>> I see even blk_mq_sched_insert_requests() from blk_mq_flush_plug_list make
>>>> sure it run the h/w queue. If all the submission path which deals with s/w
>>>> queue make sure they run h/w queue, can't we remove blk_mq_run_hw_queues()
>>>> from scsi_end_request ?
>>> No, one purpose of blk_mq_run_hw_queues() is for rerun queue in case that
>>> dispatch budget is running out of in submission path, and sdev->device_busy is
>>> shared by all hw queues on this scsi device.
>>>
>>> I posted one patch for avoiding it in scsi_end_request() before, looks it
>>> never lands upstream:
>>>
>>
>> I saw that you actually posted the v3:
>> https://lore.kernel.org/linux-scsi/BL0PR2101MB11230C5F70151037B23C0C35CE2D0@BL0PR2101MB1123.namprd21.prod.outlook.com/
>> And it no longer applies, due to the changes in scsi_mq_get_budget(), I
>> think, which look non-trivial. Any chance to repost?
> 
> OK, will post V4.

Thanks!

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 10/12] megaraid_sas: switch fusion adapters to MQ
  2020-07-23 17:29                                       ` John Garry
@ 2020-07-24  2:47                                         ` Ming Lei
  2020-07-28  7:54                                           ` John Garry
  0 siblings, 1 reply; 123+ messages in thread
From: Ming Lei @ 2020-07-24  2:47 UTC (permalink / raw)
  To: John Garry
  Cc: Kashyap Desai, axboe, jejb, martin.petersen, don.brace,
	Sumit Saxena, bvanassche, hare, hch, Shivasharan Srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	PDL,MEGARAIDLINUX

On Thu, Jul 23, 2020 at 06:29:01PM +0100, John Garry wrote:
> > > As I see, since megaraid will have 1:1 mapping of CPU to hw queue, will
> > > there only ever possibly a single bit set in ctx_map? If so, it seems a
> > > waste to always check every sbitmap map. But adding logic for this may
> > > negate any possible gains.
> > 
> > It really depends on min and max cpu id in the map, then sbitmap
> > depth can be reduced to (max - min + 1). I'd suggest to double check that
> > cost of sbitmap_any_bit_set() really matters.
> 
> Hi Ming,
> 
> I'm not sure that reducing the search range would help much, as we still
> need to load some indexes of map[], and at best this may be reduced from 2/3
> -> 1 elements, depending on nr_cpus.

I believe you misunderstood my idea, and you have to think it from implementation
viewpoint.

The only workable way is to store the min cpu id as 'offset' and set the sbitmap
depth as (max - min + 1), isn't it? Then the actual cpu id can be figured out via
'offset' + nr_bit. And the whole indexes are just spread on the actual depth. BTW,
max & min is the max / min cpu id in hctx->cpu_map. So we can improve not only on 1:1,
and I guess most of MQ cases can benefit from the change, since it shouldn't be usual
for one ctx_map to cover both 0 & nr_cpu_id - 1.

Meantime, we need to allocate the sbitmap dynamically.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 10/12] megaraid_sas: switch fusion adapters to MQ
  2020-07-24  2:47                                         ` Ming Lei
@ 2020-07-28  7:54                                           ` John Garry
  2020-07-28  8:45                                             ` Ming Lei
  0 siblings, 1 reply; 123+ messages in thread
From: John Garry @ 2020-07-28  7:54 UTC (permalink / raw)
  To: Ming Lei
  Cc: Kashyap Desai, axboe, jejb, martin.petersen, don.brace,
	Sumit Saxena, bvanassche, hare, hch, Shivasharan Srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	PDL,MEGARAIDLINUX

On 24/07/2020 03:47, Ming Lei wrote:
> On Thu, Jul 23, 2020 at 06:29:01PM +0100, John Garry wrote:
>>>> As I see, since megaraid will have 1:1 mapping of CPU to hw queue, will
>>>> there only ever possibly a single bit set in ctx_map? If so, it seems a
>>>> waste to always check every sbitmap map. But adding logic for this may
>>>> negate any possible gains.
>>>
>>> It really depends on min and max cpu id in the map, then sbitmap
>>> depth can be reduced to (max - min + 1). I'd suggest to double check that
>>> cost of sbitmap_any_bit_set() really matters.
>>
>> Hi Ming,
>>
>> I'm not sure that reducing the search range would help much, as we still
>> need to load some indexes of map[], and at best this may be reduced from 2/3
>> -> 1 elements, depending on nr_cpus.
> 
> I believe you misunderstood my idea, and you have to think it from implementation
> viewpoint.
> 
> The only workable way is to store the min cpu id as 'offset' and set the sbitmap
> depth as (max - min + 1), isn't it? Then the actual cpu id can be figured out via
> 'offset' + nr_bit. And the whole indexes are just spread on the actual depth. BTW,
> max & min is the max / min cpu id in hctx->cpu_map. So we can improve not only on 1:1,
> and I guess most of MQ cases can benefit from the change, since it shouldn't be usual
> for one ctx_map to cover both 0 & nr_cpu_id - 1.
> 
> Meantime, we need to allocate the sbitmap dynamically.

OK, so dynamically allocating the sbitmap could be good. I was thinking 
previously that we still allocate for nr_cpus size, and search a limited 
range - but this would have heavier runtime overhead.

So if you really think that this may have some value, then let me know, 
so we can look to take it forward.

Thanks,
John


^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: [PATCH RFC v7 10/12] megaraid_sas: switch fusion adapters to MQ
  2020-07-22  8:04                                 ` Ming Lei
  2020-07-22  9:32                                   ` John Garry
@ 2020-07-28  8:01                                   ` Kashyap Desai
  1 sibling, 0 replies; 123+ messages in thread
From: Kashyap Desai @ 2020-07-28  8:01 UTC (permalink / raw)
  To: Ming Lei
  Cc: John Garry, axboe, jejb, martin.petersen, don.brace,
	Sumit Saxena, bvanassche, hare, hch, Shivasharan Srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	PDL,MEGARAIDLINUX

> On Wed, Jul 22, 2020 at 11:00:45AM +0530, Kashyap Desai wrote:
> > > On Tue, Jul 21, 2020 at 12:23:39PM +0530, Kashyap Desai wrote:
> > > > > > >
> > > > > > > Perf top (shared host tag. IOPS = 230K)
> > > > > > >
> > > > > > > 13.98%  [kernel]        [k] sbitmap_any_bit_set
> > > > > > >      6.43%  [kernel]        [k] blk_mq_run_hw_queue
> > > > > >
> > > > > > blk_mq_run_hw_queue function take more CPU which is called
from
> "
> > > > > > scsi_end_request"
> > > > >
> > > > > The problem could be that nr_hw_queues is increased a lot so
> > > > > that sample
> > > > on
> > > > > blk_mq_run_hw_queue() can be observed now.
> > > >
> > > > Yes. That is correct.
> > > >
> > > > >
> > > > > > It looks like " blk_mq_hctx_has_pending" handles only elevator
> > > > > > (scheduler) case. If  queue has ioscheduler=none, we can skip.
> > > > > > I case of scheduler=none, IO will be pushed to hardware queue
> > > > > > and it by pass
> > > > > software queue.
> > > > > > Based on above understanding, I added below patch and I can
> > > > > > see performance scale back to expectation.
> > > > > >
> > > > > > Ming mentioned that - we cannot remove blk_mq_run_hw_queues()
> > > from
> > > > > > IO completion path otherwise we may see IO hang. So I have
> > > > > > just modified completion path assuming it is only required for
> > > > > > IO
> > scheduler
> > > case.
> > > > > > https://www.spinics.net/lists/linux-block/msg55049.html
> > > > > >
> > > > > > Please review and let me know if this is good or we have to
> > > > > > address with proper fix.
> > > > > >
> > > > > > diff --git a/block/blk-mq.c b/block/blk-mq.c index
> > > > > > 1be7ac5a4040..b6a5b41b7fc2 100644
> > > > > > --- a/block/blk-mq.c
> > > > > > +++ b/block/blk-mq.c
> > > > > > @@ -1559,6 +1559,9 @@ void blk_mq_run_hw_queues(struct
> > > > > request_queue
> > > > > > *q, bool async)
> > > > > >         struct blk_mq_hw_ctx *hctx;
> > > > > >         int i;
> > > > > >
> > > > > > +       if (!q->elevator)
> > > > > > +               return;
> > > > > > +
> > > > >
> > > > > This way shouldn't be correct, blk_mq_run_hw_queues() is still
> > > > > needed
> > > > for
> > > > > none because request may not be dispatched successfully by
> > > > > direct
> > issue.
> > > >
> > > > When block layer attempt posting request to h/w queue directly
> > > > (for
> > > > ioscheduler=none) and if it fails, it is calling
> > > > blk_mq_request_bypass_insert().
> > > > blk_mq_request_bypass_insert function will start the h/w queue
> > > > from submission context. Do we still have an issue if we skip
> > > > running hw queue from completion ?
> > >
> > > The thing is that we can't guarantee that direct issue or adding
> > > request
> > into
> > > hctx->dispatch is always done for MQ/none, for example, request
> > > hctx->still
> > > can be added to sw queue from blk_mq_flush_plug_list() when mq plug
> > > is applied.
> >
> > I see even blk_mq_sched_insert_requests() from blk_mq_flush_plug_list
> > make sure it run the h/w queue. If all the submission path which deals
> > with s/w queue make sure they run h/w queue, can't we remove
> > blk_mq_run_hw_queues() from scsi_end_request ?
>
> No, one purpose of blk_mq_run_hw_queues() is for rerun queue in case
that
> dispatch budget is running out of in submission path, and
sdev->device_busy
> is shared by all hw queues on this scsi device.
>
> I posted one patch for avoiding it in scsi_end_request() before, looks
it never
> lands upstream:
>
> https://lore.kernel.org/linux-block/20191118100640.3673-1-
> ming.lei@redhat.com/

Ming - I think above patch will fix the issue of performance on VD.
I fix some hunk failure and ported to 5.8 kernel -
I am testing this patch on my setup. If you post V4, I will use that.

So far looks good.  I have reduced device queue depth so that I hit budget
busy code path frequently.

Kashyap


>
> Thanks,
> Ming

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 10/12] megaraid_sas: switch fusion adapters to MQ
  2020-07-28  7:54                                           ` John Garry
@ 2020-07-28  8:45                                             ` Ming Lei
  2020-07-29  5:25                                               ` Kashyap Desai
  2020-08-04 17:00                                               ` John Garry
  0 siblings, 2 replies; 123+ messages in thread
From: Ming Lei @ 2020-07-28  8:45 UTC (permalink / raw)
  To: John Garry
  Cc: Kashyap Desai, axboe, jejb, martin.petersen, don.brace,
	Sumit Saxena, bvanassche, hare, hch, Shivasharan Srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	PDL,MEGARAIDLINUX

On Tue, Jul 28, 2020 at 08:54:27AM +0100, John Garry wrote:
> On 24/07/2020 03:47, Ming Lei wrote:
> > On Thu, Jul 23, 2020 at 06:29:01PM +0100, John Garry wrote:
> > > > > As I see, since megaraid will have 1:1 mapping of CPU to hw queue, will
> > > > > there only ever possibly a single bit set in ctx_map? If so, it seems a
> > > > > waste to always check every sbitmap map. But adding logic for this may
> > > > > negate any possible gains.
> > > > 
> > > > It really depends on min and max cpu id in the map, then sbitmap
> > > > depth can be reduced to (max - min + 1). I'd suggest to double check that
> > > > cost of sbitmap_any_bit_set() really matters.
> > > 
> > > Hi Ming,
> > > 
> > > I'm not sure that reducing the search range would help much, as we still
> > > need to load some indexes of map[], and at best this may be reduced from 2/3
> > > -> 1 elements, depending on nr_cpus.
> > 
> > I believe you misunderstood my idea, and you have to think it from implementation
> > viewpoint.
> > 
> > The only workable way is to store the min cpu id as 'offset' and set the sbitmap
> > depth as (max - min + 1), isn't it? Then the actual cpu id can be figured out via
> > 'offset' + nr_bit. And the whole indexes are just spread on the actual depth. BTW,
> > max & min is the max / min cpu id in hctx->cpu_map. So we can improve not only on 1:1,
> > and I guess most of MQ cases can benefit from the change, since it shouldn't be usual
> > for one ctx_map to cover both 0 & nr_cpu_id - 1.
> > 
> > Meantime, we need to allocate the sbitmap dynamically.
> 
> OK, so dynamically allocating the sbitmap could be good. I was thinking
> previously that we still allocate for nr_cpus size, and search a limited
> range - but this would have heavier runtime overhead.
> 
> So if you really think that this may have some value, then let me know, so
> we can look to take it forward.

Forget to mention, the in-tree code has been this shape for long
time, please see sbitmap_resize() called from blk_mq_map_swqueue().

Another update is that V4 of 'scsi: core: only re-run queue in scsi_end_request()
if device queue is busy' is quite hard to implement since commit b4fd63f42647110c9
("Revert "scsi: core: run queue if SCSI device queue isn't ready and queue is idle").


Thanks,
Ming


^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: [PATCH RFC v7 10/12] megaraid_sas: switch fusion adapters to MQ
  2020-07-28  8:45                                             ` Ming Lei
@ 2020-07-29  5:25                                               ` Kashyap Desai
  2020-07-29 15:36                                                 ` Ming Lei
  2020-08-04 17:00                                               ` John Garry
  1 sibling, 1 reply; 123+ messages in thread
From: Kashyap Desai @ 2020-07-29  5:25 UTC (permalink / raw)
  To: Ming Lei, John Garry
  Cc: axboe, jejb, martin.petersen, don.brace, Sumit Saxena,
	bvanassche, hare, hch, Shivasharan Srikanteshwara, linux-block,
	linux-scsi, esc.storagedev, chenxiang66, PDL,MEGARAIDLINUX

> On Tue, Jul 28, 2020 at 08:54:27AM +0100, John Garry wrote:
> > On 24/07/2020 03:47, Ming Lei wrote:
> > > On Thu, Jul 23, 2020 at 06:29:01PM +0100, John Garry wrote:
> > > > > > As I see, since megaraid will have 1:1 mapping of CPU to hw
> > > > > > queue, will there only ever possibly a single bit set in
> > > > > > ctx_map? If so, it seems a waste to always check every sbitmap
> > > > > > map. But adding logic for this may negate any possible gains.
> > > > >
> > > > > It really depends on min and max cpu id in the map, then sbitmap
> > > > > depth can be reduced to (max - min + 1). I'd suggest to double
> > > > > check that cost of sbitmap_any_bit_set() really matters.
> > > >
> > > > Hi Ming,
> > > >
> > > > I'm not sure that reducing the search range would help much, as we
> > > > still need to load some indexes of map[], and at best this may be
> > > > reduced from 2/3
> > > > -> 1 elements, depending on nr_cpus.
> > >
> > > I believe you misunderstood my idea, and you have to think it from
> > > implementation viewpoint.
> > >
> > > The only workable way is to store the min cpu id as 'offset' and set
> > > the sbitmap depth as (max - min + 1), isn't it? Then the actual cpu
> > > id can be figured out via 'offset' + nr_bit. And the whole indexes
> > > are just spread on the actual depth. BTW, max & min is the max / min
> > > cpu id in hctx->cpu_map. So we can improve not only on 1:1, and I
> > > guess most of MQ cases can benefit from the change, since it
shouldn't be
> usual for one ctx_map to cover both 0 & nr_cpu_id - 1.
> > >
> > > Meantime, we need to allocate the sbitmap dynamically.
> >
> > OK, so dynamically allocating the sbitmap could be good. I was
> > thinking previously that we still allocate for nr_cpus size, and
> > search a limited range - but this would have heavier runtime overhead.
> >
> > So if you really think that this may have some value, then let me
> > know, so we can look to take it forward.
>
> Forget to mention, the in-tree code has been this shape for long time,
please
> see sbitmap_resize() called from blk_mq_map_swqueue().
>
> Another update is that V4 of 'scsi: core: only re-run queue in
> scsi_end_request() if device queue is busy' is quite hard to implement
since
> commit b4fd63f42647110c9 ("Revert "scsi: core: run queue if SCSI device
> queue isn't ready and queue is idle").

Ming -

Update from my testing. I found only one case of IO stall. I can discuss
this specific topic if you like to send separate patch. It is too much
interleaved discussion in this thread.

I noted you mentioned that V4 of 'scsi: core: only re-run queue in
scsi_end_request() if device queue is busy' need underlying support of
"scsi: core: run queue if SCSI device queue isn't ready and queue is idle"
patch which is already reverted in mainline.
Overall idea of running h/w queues conditionally in your patch " scsi:
core: only re-run queue in scsi_end_request" is still worth. There can be
some race if we use this patch and for that you have concern. Am I
correct. ?

One of the race I found in my testing is fixed by below patch -

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 54f9015..bcfd33a 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -173,8 +173,10 @@ static int blk_mq_do_dispatch_ctx(struct
blk_mq_hw_ctx *hctx)
                if (!sbitmap_any_bit_set(&hctx->ctx_map))
                        break;

-               if (!blk_mq_get_dispatch_budget(hctx))
+               if (!blk_mq_get_dispatch_budget(hctx)) {
+                       blk_mq_delay_run_hw_queue(hctx,
BLK_MQ_BUDGET_DELAY);
                        break;
+               }

                rq = blk_mq_dequeue_from_ctx(hctx, ctx);
                if (!rq) {


In my test setup, I have your V3 'scsi: core: only re-run queue in
scsi_end_request() if device queue is busy' rebased on 5.8 which does not
have
"scsi: core: run queue if SCSI device queue isn't ready and queue is idle"
since it is already reverted in mainline.

I have 24 SAS SSD and I reduced QD = 16 so that I hit budget contention
frequently.  I am running ioscheduler=none.
If hctx0 has 16 IO inflight (All those IO will be posted to h/q queue
directly). Next IO (17th IO) will see budget contention and it will be
queued into s/w queue.
It is expected that queue will be kicked from scsi_end_request. It is
possible that one of the  IO completed and it reduce sdev->device_busy,
but it has not yet reach the code which will kicked the h/w queue.
Releasing budget and restarting h/w queue is not atomic. At the same time,
another IO (18th IO) from submission path get the budget and it will be
posted from below path. This IO will reset sdev->restart and it will not
allow h/w queue to be restarted from completion path. This will lead one
IO in s/w queue pending forever if there are no more new IOs.
blk_mq_sched_insert_requests
	->blk_mq_try_issue_list_directly().

I found that overall idea and attempt in recent block layer code is to
restart h/w queue from submission path in delayed context.
Making it less dependent on queue restart from completion path and/or even
though completion path restart the queue, there are certain race in
submission path which must be fixed by self-restarting h/w queue.

If we catch all such instances in submission path (as posted above in my
patch), we can still consider your patch
'scsi: core: only re-run queue in scsi_end_request() if device queue is
busy'.

Thanks, Kashyap

>
>
> Thanks,
> Ming

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 10/12] megaraid_sas: switch fusion adapters to MQ
  2020-07-29  5:25                                               ` Kashyap Desai
@ 2020-07-29 15:36                                                 ` Ming Lei
  2020-07-29 18:31                                                   ` Kashyap Desai
  0 siblings, 1 reply; 123+ messages in thread
From: Ming Lei @ 2020-07-29 15:36 UTC (permalink / raw)
  To: Kashyap Desai
  Cc: John Garry, axboe, jejb, martin.petersen, don.brace,
	Sumit Saxena, bvanassche, hare, hch, Shivasharan Srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	PDL,MEGARAIDLINUX

On Wed, Jul 29, 2020 at 10:55:22AM +0530, Kashyap Desai wrote:
> > On Tue, Jul 28, 2020 at 08:54:27AM +0100, John Garry wrote:
> > > On 24/07/2020 03:47, Ming Lei wrote:
> > > > On Thu, Jul 23, 2020 at 06:29:01PM +0100, John Garry wrote:
> > > > > > > As I see, since megaraid will have 1:1 mapping of CPU to hw
> > > > > > > queue, will there only ever possibly a single bit set in
> > > > > > > ctx_map? If so, it seems a waste to always check every sbitmap
> > > > > > > map. But adding logic for this may negate any possible gains.
> > > > > >
> > > > > > It really depends on min and max cpu id in the map, then sbitmap
> > > > > > depth can be reduced to (max - min + 1). I'd suggest to double
> > > > > > check that cost of sbitmap_any_bit_set() really matters.
> > > > >
> > > > > Hi Ming,
> > > > >
> > > > > I'm not sure that reducing the search range would help much, as we
> > > > > still need to load some indexes of map[], and at best this may be
> > > > > reduced from 2/3
> > > > > -> 1 elements, depending on nr_cpus.
> > > >
> > > > I believe you misunderstood my idea, and you have to think it from
> > > > implementation viewpoint.
> > > >
> > > > The only workable way is to store the min cpu id as 'offset' and set
> > > > the sbitmap depth as (max - min + 1), isn't it? Then the actual cpu
> > > > id can be figured out via 'offset' + nr_bit. And the whole indexes
> > > > are just spread on the actual depth. BTW, max & min is the max / min
> > > > cpu id in hctx->cpu_map. So we can improve not only on 1:1, and I
> > > > guess most of MQ cases can benefit from the change, since it
> shouldn't be
> > usual for one ctx_map to cover both 0 & nr_cpu_id - 1.
> > > >
> > > > Meantime, we need to allocate the sbitmap dynamically.
> > >
> > > OK, so dynamically allocating the sbitmap could be good. I was
> > > thinking previously that we still allocate for nr_cpus size, and
> > > search a limited range - but this would have heavier runtime overhead.
> > >
> > > So if you really think that this may have some value, then let me
> > > know, so we can look to take it forward.
> >
> > Forget to mention, the in-tree code has been this shape for long time,
> please
> > see sbitmap_resize() called from blk_mq_map_swqueue().
> >
> > Another update is that V4 of 'scsi: core: only re-run queue in
> > scsi_end_request() if device queue is busy' is quite hard to implement
> since
> > commit b4fd63f42647110c9 ("Revert "scsi: core: run queue if SCSI device
> > queue isn't ready and queue is idle").
> 
> Ming -
> 
> Update from my testing. I found only one case of IO stall. I can discuss
> this specific topic if you like to send separate patch. It is too much
> interleaved discussion in this thread.
> 
> I noted you mentioned that V4 of 'scsi: core: only re-run queue in
> scsi_end_request() if device queue is busy' need underlying support of
> "scsi: core: run queue if SCSI device queue isn't ready and queue is idle"
> patch which is already reverted in mainline.

Right.

> Overall idea of running h/w queues conditionally in your patch " scsi:
> core: only re-run queue in scsi_end_request" is still worth. There can be

I agree.

> some race if we use this patch and for that you have concern. Am I
> correct. ?

If the patch of "scsi: core: run queue if SCSI device queue isn't ready and
queue is idle" is re-added, the approach should work. However, it looks a bit
complicated, and I was thinking if one simpler approach can be figured out.

> 
> One of the race I found in my testing is fixed by below patch -
> 
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index 54f9015..bcfd33a 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -173,8 +173,10 @@ static int blk_mq_do_dispatch_ctx(struct
> blk_mq_hw_ctx *hctx)
>                 if (!sbitmap_any_bit_set(&hctx->ctx_map))
>                         break;
> 
> -               if (!blk_mq_get_dispatch_budget(hctx))
> +               if (!blk_mq_get_dispatch_budget(hctx)) {
> +                       blk_mq_delay_run_hw_queue(hctx,
> BLK_MQ_BUDGET_DELAY);
>                         break;
> +               }

Actually all hw queues need to be run, instead of this hctx, cause
the budget stuff is request queue wide.

> 
>                 rq = blk_mq_dequeue_from_ctx(hctx, ctx);
>                 if (!rq) {
> 
> 
> In my test setup, I have your V3 'scsi: core: only re-run queue in
> scsi_end_request() if device queue is busy' rebased on 5.8 which does not
> have
> "scsi: core: run queue if SCSI device queue isn't ready and queue is idle"
> since it is already reverted in mainline.

If you added the above patch, I believe you can remove the run queue in
scsi_end_request() unconditionally. However, the delay run queue may
degrade io performance.

Actually the re-run queue in scsi_end_request() is only for dequeuing
request from sw/scheduler queue. And no such issue if request stays in
hctx->dispatch list.

> 
> I have 24 SAS SSD and I reduced QD = 16 so that I hit budget contention
> frequently.  I am running ioscheduler=none.
> If hctx0 has 16 IO inflight (All those IO will be posted to h/q queue
> directly). Next IO (17th IO) will see budget contention and it will be
> queued into s/w queue.
> It is expected that queue will be kicked from scsi_end_request. It is
> possible that one of the  IO completed and it reduce sdev->device_busy,
> but it has not yet reach the code which will kicked the h/w queue.
> Releasing budget and restarting h/w queue is not atomic. At the same time,
> another IO (18th IO) from submission path get the budget and it will be
> posted from below path. This IO will reset sdev->restart and it will not
> allow h/w queue to be restarted from completion path. This will lead one

Maybe re-run queue is needed before resetting sdev->restart if sdev->restart is 1.


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: [PATCH RFC v7 10/12] megaraid_sas: switch fusion adapters to MQ
  2020-07-29 15:36                                                 ` Ming Lei
@ 2020-07-29 18:31                                                   ` Kashyap Desai
  2020-08-04  8:36                                                     ` Ming Lei
  0 siblings, 1 reply; 123+ messages in thread
From: Kashyap Desai @ 2020-07-29 18:31 UTC (permalink / raw)
  To: Ming Lei
  Cc: John Garry, axboe, jejb, martin.petersen, don.brace,
	Sumit Saxena, bvanassche, hare, hch, Shivasharan Srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	PDL,MEGARAIDLINUX

> > >
> > > Another update is that V4 of 'scsi: core: only re-run queue in
> > > scsi_end_request() if device queue is busy' is quite hard to
> > > implement
> > since
> > > commit b4fd63f42647110c9 ("Revert "scsi: core: run queue if SCSI
> > > device queue isn't ready and queue is idle").
> >
> > Ming -
> >
> > Update from my testing. I found only one case of IO stall. I can
> > discuss this specific topic if you like to send separate patch. It is
> > too much interleaved discussion in this thread.
> >
> > I noted you mentioned that V4 of 'scsi: core: only re-run queue in
> > scsi_end_request() if device queue is busy' need underlying support of
> > "scsi: core: run queue if SCSI device queue isn't ready and queue is
idle"
> > patch which is already reverted in mainline.
>
> Right.
>
> > Overall idea of running h/w queues conditionally in your patch " scsi:
> > core: only re-run queue in scsi_end_request" is still worth. There can
> > be
>
> I agree.
>
> > some race if we use this patch and for that you have concern. Am I
> > correct. ?
>
> If the patch of "scsi: core: run queue if SCSI device queue isn't ready
and queue
> is idle" is re-added, the approach should work.
I could not find issue in " scsi: core: only re-run queue in
scsi_end_request" even though above mentioned patch is reverted.
There may be some corner cases/race condition in submission path which can
be fixed doing self-restart of h/w queue.

> However, it looks a bit
> complicated, and I was thinking if one simpler approach can be figured
out.

I was thinking your original approach is simple, but if you think some
other simple approach I can test as part of these series.
BTW, I am still not getting why you think your original approach is not
good design.

>
> >
> > One of the race I found in my testing is fixed by below patch -
> >
> > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c index
> > 54f9015..bcfd33a 100644
> > --- a/block/blk-mq-sched.c
> > +++ b/block/blk-mq-sched.c
> > @@ -173,8 +173,10 @@ static int blk_mq_do_dispatch_ctx(struct
> > blk_mq_hw_ctx *hctx)
> >                 if (!sbitmap_any_bit_set(&hctx->ctx_map))
> >                         break;
> >
> > -               if (!blk_mq_get_dispatch_budget(hctx))
> > +               if (!blk_mq_get_dispatch_budget(hctx)) {
> > +                       blk_mq_delay_run_hw_queue(hctx,
> > BLK_MQ_BUDGET_DELAY);
> >                         break;
> > +               }
>
> Actually all hw queues need to be run, instead of this hctx, cause the
budget
> stuff is request queue wide.


OK. But I thought all the hctx will see issue independently, if they are
active and they will restart its own hctx queue.
BTW, do you think above handling in block layer code make sense
irrespective of current h/w queue restart logic OR it is just relative
stuffs ?

>
> >
> >                 rq = blk_mq_dequeue_from_ctx(hctx, ctx);
> >                 if (!rq) {
> >
> >
> > In my test setup, I have your V3 'scsi: core: only re-run queue in
> > scsi_end_request() if device queue is busy' rebased on 5.8 which does
> > not have
> > "scsi: core: run queue if SCSI device queue isn't ready and queue is
idle"
> > since it is already reverted in mainline.
>
> If you added the above patch, I believe you can remove the run queue in
> scsi_end_request() unconditionally. However, the delay run queue may
> degrade io performance.

I understood.  But that performance issue is due to budget contention and
may impact some old HBA(less queue depth) or emulation HBA.
That is why I thought your patch of conditional h/w run from completion
would improve performance.

>
> Actually the re-run queue in scsi_end_request() is only for dequeuing
request
> from sw/scheduler queue. And no such issue if request stays in
> hctx->dispatch list.

I was not aware of this. Thanks for info.  I will review the code for my
own understanding.
>
> >
> > I have 24 SAS SSD and I reduced QD = 16 so that I hit budget
> > contention frequently.  I am running ioscheduler=none.
> > If hctx0 has 16 IO inflight (All those IO will be posted to h/q queue
> > directly). Next IO (17th IO) will see budget contention and it will be
> > queued into s/w queue.
> > It is expected that queue will be kicked from scsi_end_request. It is
> > possible that one of the  IO completed and it reduce
> > sdev->device_busy, but it has not yet reach the code which will kicked
the
> h/w queue.
> > Releasing budget and restarting h/w queue is not atomic. At the same
> > time, another IO (18th IO) from submission path get the budget and it
> > will be posted from below path. This IO will reset sdev->restart and
> > it will not allow h/w queue to be restarted from completion path. This
> > will lead one
>
> Maybe re-run queue is needed before resetting sdev->restart if
sdev->restart
> is 1.

Agree.

>
>
> Thanks,
> Ming

^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: [PATCH RFC v7 12/12] hpsa: enable host_tagset and switch to MQ
  2020-07-14  7:52       ` John Garry
  2020-07-14  8:06         ` Ming Lei
@ 2020-08-03 20:39         ` Don.Brace
  2020-08-04  9:27           ` John Garry
  1 sibling, 1 reply; 123+ messages in thread
From: Don.Brace @ 2020-08-03 20:39 UTC (permalink / raw)
  To: john.garry, hare, don.brace
  Cc: axboe, jejb, martin.petersen, kashyap.desai, sumit.saxena,
	ming.lei, bvanassche, hare, hch, shivasharan.srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	megaraidlinux.pdl

Subject: Re: [PATCH RFC v7 12/12] hpsa: enable host_tagset and switch to MQ

>>
>> Hi Don,
>>
>> I am preparing the next iteration of this series, and we're getting 
>> close to dropping the RFC tags. The series has grown a bit, and I am 
>> not sure what to do with hpsa support.
>>
>> The latest versions of this series have not been tested for hpsa, AFAIK.

>>v7 is here:

>>https://github.com/hisilicon/kernel-dev/commits/private-topic-blk-mq-shared-tags-rfc-v7

>>So that should be good to test with for now.

cloned https://github.com/hisilicon/kernel-dev
	branch origin/private-topic-blk-mq-shared-tags-rfc-v7

The driver did not load, so I cherry-picked from 

git://git.kernel.org/pub/scm/linux/kernel/git/hare/scsi-devel.git 
	branch origin/reserved-tags.v6

the following patches:
6a9d1a96ea41 hpsa: move hpsa_hba_inquiry after scsi_add_host()
eeb5cd1fca58 hpsa: use reserved commands
7df7d8382902 hpsa: use scsi_host_busy_iter() to traverse outstanding commands
485881d6d8dc hpsa: drop refcount field from CommandList
c4980ad5e5cb scsi: implement reserved command handling
34d03fa945c0 scsi: add scsi_{get,put}_internal_cmd() helper
4556e50450c8 block: add flag for internal commands
138125f74b25 scsi: hpsa: Lift {BIG_,}IOCTL_Command_struct copy{in,out} into hpsa_ioctl()
cb17c1b69b17 scsi: hpsa: Don't bother with vmalloc for BIG_IOCTL_Command_struct
10100ffd5f65 scsi: hpsa: Get rid of compat_alloc_user_space()
06b43f968db5 scsi: hpsa: hpsa_ioctl(): Tidy up a bit

The driver loads and I ran some mke2fs, mount/umount tests,
but I am getting an extra devices in the list which does not
seem to be coming from hpsa driver.

I have not yet had time to diagnose this issue.

lsscsi
[1:0:0:0]    disk    ASMT     2105             0     /dev/sdi 
[14:0:-1:0]  type??? nullnull nullnullnullnull null  -        
[14:0:0:0]   storage HP       H240             7.19  -        
[14:0:1:0]   disk    ATA      MB002000GWFGH    HPG0  /dev/sda 
[14:0:2:0]   disk    ATA      MB002000GWFGH    HPG0  /dev/sdb 
[14:0:3:0]   disk    HP       EF0450FARMV      HPD5  /dev/sdc 
[14:0:4:0]   disk    ATA      MB002000GWFGH    HPG0  /dev/sdd 
[14:0:5:0]   disk    ATA      MB002000GWFGH    HPG0  /dev/sde 
[14:0:6:0]   disk    HP       EF0450FARMV      HPD5  /dev/sdf 
[14:0:7:0]   disk    ATA      VB0250EAVER      HPG7  /dev/sdg 
[14:0:8:0]   disk    ATA      MB0500GCEHF      HPGC  /dev/sdh 
[14:0:9:0]   enclosu HP       H240             7.19  -        
[15:0:-1:0]  type??? nullnull nullnullnullnull null  -        
[15:0:0:0]   storage HP       P440             7.19  -        
[15:1:0:0]   disk    HP       LOGICAL VOLUME   7.19  /dev/sdj 
[15:1:0:1]   disk    HP       LOGICAL VOLUME   7.19  /dev/sdk 
[15:1:0:2]   disk    HP       LOGICAL VOLUME   7.19  /dev/sdl 
[15:1:0:3]   disk    HP       LOGICAL VOLUME   7.19  /dev/sdm 
[16:0:-1:0]  type??? nullnull nullnullnullnull null  -        
[16:0:0:0]   storage HP       P441             7.19  -        




^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 10/12] megaraid_sas: switch fusion adapters to MQ
  2020-07-29 18:31                                                   ` Kashyap Desai
@ 2020-08-04  8:36                                                     ` Ming Lei
  2020-08-04  9:27                                                       ` Kashyap Desai
  0 siblings, 1 reply; 123+ messages in thread
From: Ming Lei @ 2020-08-04  8:36 UTC (permalink / raw)
  To: Kashyap Desai
  Cc: John Garry, axboe, jejb, martin.petersen, don.brace,
	Sumit Saxena, bvanassche, hare, hch, Shivasharan Srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	PDL,MEGARAIDLINUX

On Thu, Jul 30, 2020 at 12:01:22AM +0530, Kashyap Desai wrote:
> > > >
> > > > Another update is that V4 of 'scsi: core: only re-run queue in
> > > > scsi_end_request() if device queue is busy' is quite hard to
> > > > implement
> > > since
> > > > commit b4fd63f42647110c9 ("Revert "scsi: core: run queue if SCSI
> > > > device queue isn't ready and queue is idle").
> > >
> > > Ming -
> > >
> > > Update from my testing. I found only one case of IO stall. I can
> > > discuss this specific topic if you like to send separate patch. It is
> > > too much interleaved discussion in this thread.
> > >
> > > I noted you mentioned that V4 of 'scsi: core: only re-run queue in
> > > scsi_end_request() if device queue is busy' need underlying support of
> > > "scsi: core: run queue if SCSI device queue isn't ready and queue is
> idle"
> > > patch which is already reverted in mainline.
> >
> > Right.
> >
> > > Overall idea of running h/w queues conditionally in your patch " scsi:
> > > core: only re-run queue in scsi_end_request" is still worth. There can
> > > be
> >
> > I agree.
> >
> > > some race if we use this patch and for that you have concern. Am I
> > > correct. ?
> >
> > If the patch of "scsi: core: run queue if SCSI device queue isn't ready
> and queue
> > is idle" is re-added, the approach should work.
> I could not find issue in " scsi: core: only re-run queue in
> scsi_end_request" even though above mentioned patch is reverted.
> There may be some corner cases/race condition in submission path which can
> be fixed doing self-restart of h/w queue.

It is because two corner cases are handled in other ways:

    Now that we have the patches ("blk-mq: In blk_mq_dispatch_rq_list()
    "no budget" is a reason to kick") and ("blk-mq: Rerun dispatching in
    the case of budget contention") we should no longer need the fix in
    the SCSI code.  Revert it, resolving conflicts with other patches that
    have touched this code.

> 
> > However, it looks a bit
> > complicated, and I was thinking if one simpler approach can be figured
> out.
> 
> I was thinking your original approach is simple, but if you think some
> other simple approach I can test as part of these series.
> BTW, I am still not getting why you think your original approach is not
> good design.

It is still not straightforward enough or simple enough for proving its
correctness, even though the implementation isn't complicated.

> 
> >
> > >
> > > One of the race I found in my testing is fixed by below patch -
> > >
> > > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c index
> > > 54f9015..bcfd33a 100644
> > > --- a/block/blk-mq-sched.c
> > > +++ b/block/blk-mq-sched.c
> > > @@ -173,8 +173,10 @@ static int blk_mq_do_dispatch_ctx(struct
> > > blk_mq_hw_ctx *hctx)
> > >                 if (!sbitmap_any_bit_set(&hctx->ctx_map))
> > >                         break;
> > >
> > > -               if (!blk_mq_get_dispatch_budget(hctx))
> > > +               if (!blk_mq_get_dispatch_budget(hctx)) {
> > > +                       blk_mq_delay_run_hw_queue(hctx,
> > > BLK_MQ_BUDGET_DELAY);
> > >                         break;
> > > +               }
> >
> > Actually all hw queues need to be run, instead of this hctx, cause the
> budget
> > stuff is request queue wide.
> 
> 
> OK. But I thought all the hctx will see issue independently, if they are
> active and they will restart its own hctx queue.
> BTW, do you think above handling in block layer code make sense
> irrespective of current h/w queue restart logic OR it is just relative
> stuffs ?

You are right, it is correct to just run this hctx.

> 
> >
> > >
> > >                 rq = blk_mq_dequeue_from_ctx(hctx, ctx);
> > >                 if (!rq) {
> > >
> > >
> > > In my test setup, I have your V3 'scsi: core: only re-run queue in
> > > scsi_end_request() if device queue is busy' rebased on 5.8 which does
> > > not have
> > > "scsi: core: run queue if SCSI device queue isn't ready and queue is
> idle"
> > > since it is already reverted in mainline.
> >
> > If you added the above patch, I believe you can remove the run queue in
> > scsi_end_request() unconditionally. However, the delay run queue may
> > degrade io performance.
> 
> I understood.  But that performance issue is due to budget contention and
> may impact some old HBA(less queue depth) or emulation HBA.

Your patch for delay running hw queue causes delay once one request
is completed, and the queue should have been run immediately after one
request is finished.

> That is why I thought your patch of conditional h/w run from completion
> would improve performance.

Yeah, we all think that way is correct thing to do, and now the problem
is how to run hw queue just in case of budget contention.


thanks, 
Ming


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 12/12] hpsa: enable host_tagset and switch to MQ
  2020-08-03 20:39         ` Don.Brace
@ 2020-08-04  9:27           ` John Garry
  2020-08-04 15:18             ` Don.Brace
  0 siblings, 1 reply; 123+ messages in thread
From: John Garry @ 2020-08-04  9:27 UTC (permalink / raw)
  To: Don.Brace, hare, don.brace
  Cc: axboe, jejb, martin.petersen, kashyap.desai, sumit.saxena,
	ming.lei, bvanassche, hare, hch, shivasharan.srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	megaraidlinux.pdl

On 03/08/2020 21:39, Don.Brace@microchip.com wrote:

Hi Don,

>>> at should be good to test with for now.
> clonedhttps://github.com/hisilicon/kernel-dev
> 	branch origin/private-topic-blk-mq-shared-tags-rfc-v7
> 
> The driver did not load, so I cherry-picked from
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/hare/scsi-devel.git
> 	branch origin/reserved-tags.v6
> 
> the following patches:
> 6a9d1a96ea41 hpsa: move hpsa_hba_inquiry after scsi_add_host()
> eeb5cd1fca58 hpsa: use reserved commands
> 7df7d8382902 hpsa: use scsi_host_busy_iter() to traverse outstanding commands
> 485881d6d8dc hpsa: drop refcount field from CommandList
> c4980ad5e5cb scsi: implement reserved command handling
> 34d03fa945c0 scsi: add scsi_{get,put}_internal_cmd() helper
> 4556e50450c8 block: add flag for internal commands
> 138125f74b25 scsi: hpsa: Lift {BIG_,}IOCTL_Command_struct copy{in,out} into hpsa_ioctl()
> cb17c1b69b17 scsi: hpsa: Don't bother with vmalloc for BIG_IOCTL_Command_struct
> 10100ffd5f65 scsi: hpsa: Get rid of compat_alloc_user_space()
> 06b43f968db5 scsi: hpsa: hpsa_ioctl(): Tidy up a bit
> 
> The driver loads and I ran some mke2fs, mount/umount tests,

ok, great

> but I am getting an extra devices in the list which does not
> seem to be coming from hpsa driver.
> 
> I have not yet had time to diagnose this issue.
> 
> lsscsi
> [1:0:0:0]    disk    ASMT     2105             0     /dev/sdi
> [14:0:-1:0]  type??? nullnull nullnullnullnull null  -
> [14:0:0:0]   storage HP       H240             7.19  -
> [14:0:1:0]   disk    ATA      MB002000GWFGH    HPG0  /dev/sda
> [14:0:2:0]   disk    ATA      MB002000GWFGH    HPG0  /dev/sdb
> [14:0:3:0]   disk    HP       EF0450FARMV      HPD5  /dev/sdc
> [14:0:4:0]   disk    ATA      MB002000GWFGH    HPG0  /dev/sdd
> [14:0:5:0]   disk    ATA      MB002000GWFGH    HPG0  /dev/sde
> [14:0:6:0]   disk    HP       EF0450FARMV      HPD5  /dev/sdf
> [14:0:7:0]   disk    ATA      VB0250EAVER      HPG7  /dev/sdg
> [14:0:8:0]   disk    ATA      MB0500GCEHF      HPGC  /dev/sdh
> [14:0:9:0]   enclosu HP       H240             7.19  -
> [15:0:-1:0]  type??? nullnull nullnullnullnull null  -
> [15:0:0:0]   storage HP       P440             7.19  -
> [15:1:0:0]   disk    HP       LOGICAL VOLUME   7.19  /dev/sdj
> [15:1:0:1]   disk    HP       LOGICAL VOLUME   7.19  /dev/sdk
> [15:1:0:2]   disk    HP       LOGICAL VOLUME   7.19  /dev/sdl
> [15:1:0:3]   disk    HP       LOGICAL VOLUME   7.19  /dev/sdm
> [16:0:-1:0]  type??? nullnull nullnullnullnull null  -
> [16:0:0:0]   storage HP       P441             7.19  -
> 
> 

I assume that you are missing some other patches from that branch, like 
these:

77dcb92c31ae scsi: revamp host device handling
6e9884aefe66 scsi: Use dummy inquiry data for the host device
a381637f8a6e scsi: use real inquiry data when initialising devices

@Hannes, Any plans to get this series going again?

Thanks,
John

^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: [PATCH RFC v7 10/12] megaraid_sas: switch fusion adapters to MQ
  2020-08-04  8:36                                                     ` Ming Lei
@ 2020-08-04  9:27                                                       ` Kashyap Desai
  2020-08-05  8:40                                                         ` Ming Lei
  0 siblings, 1 reply; 123+ messages in thread
From: Kashyap Desai @ 2020-08-04  9:27 UTC (permalink / raw)
  To: Ming Lei
  Cc: John Garry, axboe, jejb, martin.petersen, don.brace,
	Sumit Saxena, bvanassche, hare, hch, Shivasharan Srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	PDL,MEGARAIDLINUX

> >
> > > However, it looks a bit
> > > complicated, and I was thinking if one simpler approach can be
> > > figured
> > out.
> >
> > I was thinking your original approach is simple, but if you think some
> > other simple approach I can test as part of these series.
> > BTW, I am still not getting why you think your original approach is
> > not good design.
>
> It is still not straightforward enough or simple enough for proving its
> correctness, even though the implementation isn't complicated.

Ming -

I noted your comments.

I have completed testing and this particular latest performance issue on
Volume is outstanding.
Currently it is 20-25% performance drop in IOPs and we want that to be
closed before shared host tag is enabled for <megaraid_sas> driver.
Just for my understanding - What will be the next steps on this ?

I can validate any new approach/patch for this issue.

Kashyap

>
> >
> > >
> > > >

^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: [PATCH RFC v7 12/12] hpsa: enable host_tagset and switch to MQ
  2020-08-04  9:27           ` John Garry
@ 2020-08-04 15:18             ` Don.Brace
  2020-08-05 11:21               ` John Garry
  0 siblings, 1 reply; 123+ messages in thread
From: Don.Brace @ 2020-08-04 15:18 UTC (permalink / raw)
  To: john.garry, hare, don.brace
  Cc: axboe, jejb, martin.petersen, kashyap.desai, sumit.saxena,
	ming.lei, bvanassche, hare, hch, shivasharan.srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	megaraidlinux.pdl

Subject: Re: [PATCH RFC v7 12/12] hpsa: enable host_tagset and switch to MQ

Hi Don,

>>> at should be good to test with for now.
> clonedhttps://github.com/hisilicon/kernel-dev
>       branch origin/private-topic-blk-mq-shared-tags-rfc-v7
>
> The driver did not load, so I cherry-picked from
>
> git://git.kernel.org/pub/scm/linux/kernel/git/hare/scsi-devel.git
>       branch origin/reserved-tags.v6
ok, great

> but I am getting an extra devices in the list which does not seem to 
> be coming from hpsa driver.
>
>>I assume that you are missing some other patches from >>that branch, like
>>these:

>>77dcb92c31ae scsi: revamp host device handling
>>6e9884aefe66 scsi: Use dummy inquiry data for the host >>device a381637f8a6e scsi: use real inquiry data when >>initialising devices

>>@Hannes, Any plans to get this series going again?

I cherry-picked the following and this resolves the issue.
77dcb92c31ae scsi: revamp host device handling
6e9884aefe66 scsi: Use dummy inquiry data for the host device
a381637f8a6e scsi: use real inquiry data when initialising devices
I'll continue with more I/O stress testing.

Thanks for the patch suggestions,
Don

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 10/12] megaraid_sas: switch fusion adapters to MQ
  2020-07-28  8:45                                             ` Ming Lei
  2020-07-29  5:25                                               ` Kashyap Desai
@ 2020-08-04 17:00                                               ` John Garry
  2020-08-05  2:56                                                 ` Ming Lei
  1 sibling, 1 reply; 123+ messages in thread
From: John Garry @ 2020-08-04 17:00 UTC (permalink / raw)
  To: Ming Lei
  Cc: Kashyap Desai, axboe, jejb, martin.petersen, don.brace,
	Sumit Saxena, bvanassche, hare, hch, Shivasharan Srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	PDL,MEGARAIDLINUX

On 28/07/2020 09:45, Ming Lei wrote:
>> OK, so dynamically allocating the sbitmap could be good. I was thinking
>> previously that we still allocate for nr_cpus size, and search a limited
>> range - but this would have heavier runtime overhead.
>>
>> So if you really think that this may have some value, then let me know, so
>> we can look to take it forward.

Hi Ming,

> Forget to mention, the in-tree code has been this shape for long
> time, please see sbitmap_resize() called from blk_mq_map_swqueue().

So after the resize, even if we are only checking a single word and a 
few bits within that word, we still need 2x 64b loads - 1x for .word and 
1x for .cleared. Seems a bit inefficient for caching when we have a 1:1 
mapping or similar. For 1:1 case only, how about a ctx_map per queue for 
all hctx, with a single bit per hctx? I do realize that it makes the 
code more complicated, but it could be more efficient.

Another thing to consider is that for ctx_map, we don't do deferred bit 
clear, so we don't ever really need to check .cleared there. I think.

Thanks,
John

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 10/12] megaraid_sas: switch fusion adapters to MQ
  2020-08-04 17:00                                               ` John Garry
@ 2020-08-05  2:56                                                 ` Ming Lei
  0 siblings, 0 replies; 123+ messages in thread
From: Ming Lei @ 2020-08-05  2:56 UTC (permalink / raw)
  To: John Garry
  Cc: Kashyap Desai, axboe, jejb, martin.petersen, don.brace,
	Sumit Saxena, bvanassche, hare, hch, Shivasharan Srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	PDL,MEGARAIDLINUX

On Tue, Aug 04, 2020 at 06:00:52PM +0100, John Garry wrote:
> On 28/07/2020 09:45, Ming Lei wrote:
> > > OK, so dynamically allocating the sbitmap could be good. I was thinking
> > > previously that we still allocate for nr_cpus size, and search a limited
> > > range - but this would have heavier runtime overhead.
> > > 
> > > So if you really think that this may have some value, then let me know, so
> > > we can look to take it forward.
> 
> Hi Ming,
> 
> > Forget to mention, the in-tree code has been this shape for long
> > time, please see sbitmap_resize() called from blk_mq_map_swqueue().
> 
> So after the resize, even if we are only checking a single word and a few
> bits within that word, we still need 2x 64b loads - 1x for .word and 1x for
> .cleared. Seems a bit inefficient for caching when we have a 1:1 mapping or
> similar. For 1:1 case only, how about a ctx_map per queue for all hctx, with
> a single bit per hctx? I do realize that it makes the code more complicated,
> but it could be more efficient.

IMO, the cost for accessing one bit and one word is basically same.

> 
> Another thing to consider is that for ctx_map, we don't do deferred bit
> clear, so we don't ever really need to check .cleared there. I think.

That looks true.

However, in case of MQ, the normal code path is direct issue, not sure if
we need this kind of optimization.

BTW, no matter if hostwide tags is used or not, the problem is in always
run-queue from scsi_end_request(). As we discussed, the re-run queue is
only needed in case of budget contention.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 10/12] megaraid_sas: switch fusion adapters to MQ
  2020-08-04  9:27                                                       ` Kashyap Desai
@ 2020-08-05  8:40                                                         ` Ming Lei
  2020-08-06 10:25                                                           ` Kashyap Desai
  0 siblings, 1 reply; 123+ messages in thread
From: Ming Lei @ 2020-08-05  8:40 UTC (permalink / raw)
  To: Kashyap Desai
  Cc: John Garry, axboe, jejb, martin.petersen, don.brace,
	Sumit Saxena, bvanassche, hare, hch, Shivasharan Srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	PDL,MEGARAIDLINUX

On Tue, Aug 04, 2020 at 02:57:52PM +0530, Kashyap Desai wrote:
> > >
> > > > However, it looks a bit
> > > > complicated, and I was thinking if one simpler approach can be
> > > > figured
> > > out.
> > >
> > > I was thinking your original approach is simple, but if you think some
> > > other simple approach I can test as part of these series.
> > > BTW, I am still not getting why you think your original approach is
> > > not good design.
> >
> > It is still not straightforward enough or simple enough for proving its
> > correctness, even though the implementation isn't complicated.
> 
> Ming -
> 
> I noted your comments.
> 
> I have completed testing and this particular latest performance issue on
> Volume is outstanding.
> Currently it is 20-25% performance drop in IOPs and we want that to be
> closed before shared host tag is enabled for <megaraid_sas> driver.
> Just for my understanding - What will be the next steps on this ?
> 
> I can validate any new approach/patch for this issue.
> 

Hello,

What do you think of the following patch?

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index c866a4f33871..49f0fc5c7a63 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -552,8 +552,24 @@ static void scsi_run_queue_async(struct scsi_device *sdev)
 	if (scsi_target(sdev)->single_lun ||
 	    !list_empty(&sdev->host->starved_list))
 		kblockd_schedule_work(&sdev->requeue_work);
-	else
-		blk_mq_run_hw_queues(sdev->request_queue, true);
+	else {
+		/*
+		 * smp_mb() implied in either rq->end_io or blk_mq_free_request
+		 * is for ordering writing .device_busy in scsi_device_unbusy()
+		 * and reading sdev->restarts.
+		 */
+		int old = atomic_read(&sdev->restarts);
+
+		if (old) {
+			blk_mq_run_hw_queues(sdev->request_queue, true);
+
+			/*
+			 * ->restarts has to be kept as non-zero if there is
+			 *  new budget contention comes.
+			 */
+			atomic_cmpxchg(&sdev->restarts, old, 0);
+		}
+	}
 }
 
 /* Returns false when no more bytes to process, true if there are more */
@@ -1612,8 +1628,34 @@ static void scsi_mq_put_budget(struct request_queue *q)
 static bool scsi_mq_get_budget(struct request_queue *q)
 {
 	struct scsi_device *sdev = q->queuedata;
+	int ret = scsi_dev_queue_ready(q, sdev);
 
-	return scsi_dev_queue_ready(q, sdev);
+	if (ret)
+		return true;
+
+	/*
+	 * If all in-flight requests originated from this LUN are completed
+	 * before setting .restarts, sdev->device_busy will be observed as
+	 * zero, then blk_mq_delay_run_hw_queue() will dispatch this request
+	 * soon. Otherwise, completion of one of these request will observe
+	 * the .restarts flag, and the request queue will be run for handling
+	 * this request, see scsi_end_request().
+	 */
+	atomic_inc(&sdev->restarts);
+
+	/*
+	 * Order writing .restarts and reading .device_busy, and make sure
+	 * .restarts is visible to scsi_end_request(). Its pair is implied by
+	 * __blk_mq_end_request() in scsi_end_request() for ordering
+	 * writing .device_busy in scsi_device_unbusy() and reading .restarts.
+	 *
+	 */
+	smp_mb__after_atomic();
+
+	if (unlikely(atomic_read(&sdev->device_busy) == 0 &&
+				!scsi_device_blocked(sdev)))
+		blk_mq_delay_run_hw_queues(sdev->request_queue, SCSI_QUEUE_DELAY);
+	return false;
 }
 
 static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
index bc5909033d13..1a5c9a3df6d6 100644
--- a/include/scsi/scsi_device.h
+++ b/include/scsi/scsi_device.h
@@ -109,6 +109,7 @@ struct scsi_device {
 	atomic_t device_busy;		/* commands actually active on LLDD */
 	atomic_t device_blocked;	/* Device returned QUEUE_FULL. */
 
+	atomic_t restarts;
 	spinlock_t list_lock;
 	struct list_head starved_entry;
 	unsigned short queue_depth;	/* How deep of a queue we want */

Thanks, 
Ming


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 12/12] hpsa: enable host_tagset and switch to MQ
  2020-08-04 15:18             ` Don.Brace
@ 2020-08-05 11:21               ` John Garry
  2020-08-14 21:04                 ` Don.Brace
  0 siblings, 1 reply; 123+ messages in thread
From: John Garry @ 2020-08-05 11:21 UTC (permalink / raw)
  To: Don.Brace, hare, don.brace
  Cc: axboe, jejb, martin.petersen, kashyap.desai, sumit.saxena,
	ming.lei, bvanassche, hare, hch, shivasharan.srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	megaraidlinux.pdl

On 04/08/2020 16:18, Don.Brace@microchip.com wrote:
>> git://git.kernel.org/pub/scm/linux/kernel/git/hare/scsi-devel.git
>>        branch origin/reserved-tags.v6
> ok, great
> 
>> but I am getting an extra devices in the list which does not seem to
>> be coming from hpsa driver.
>>
>>> I assume that you are missing some other patches from >>that branch, like
>>> these:
>>> 77dcb92c31ae scsi: revamp host device handling
>>> 6e9884aefe66 scsi: Use dummy inquiry data for the host >>device a381637f8a6e scsi: use real inquiry data when >>initialising devices
>>> @Hannes, Any plans to get this series going again?
> I cherry-picked the following and this resolves the issue.
> 77dcb92c31ae scsi: revamp host device handling
> 6e9884aefe66 scsi: Use dummy inquiry data for the host device
> a381637f8a6e scsi: use real inquiry data when initialising devices
> I'll continue with more I/O stress testing.

ok, great. Please let me know about your testing, as I might just add 
that series to the v8 branch.

Cheers,
John

^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: [PATCH RFC v7 10/12] megaraid_sas: switch fusion adapters to MQ
  2020-08-05  8:40                                                         ` Ming Lei
@ 2020-08-06 10:25                                                           ` Kashyap Desai
  2020-08-06 13:38                                                             ` Ming Lei
  0 siblings, 1 reply; 123+ messages in thread
From: Kashyap Desai @ 2020-08-06 10:25 UTC (permalink / raw)
  To: Ming Lei
  Cc: John Garry, axboe, jejb, martin.petersen, don.brace,
	Sumit Saxena, bvanassche, hare, hch, Shivasharan Srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	PDL,MEGARAIDLINUX

> > Ming -
> >
> > I noted your comments.
> >
> > I have completed testing and this particular latest performance issue
> > on Volume is outstanding.
> > Currently it is 20-25% performance drop in IOPs and we want that to be
> > closed before shared host tag is enabled for <megaraid_sas> driver.
> > Just for my understanding - What will be the next steps on this ?
> >
> > I can validate any new approach/patch for this issue.
> >
>
> Hello,
>
> What do you think of the following patch?

I tested this patch. I still see IO hang.

>
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index
> c866a4f33871..49f0fc5c7a63 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -552,8 +552,24 @@ static void scsi_run_queue_async(struct scsi_device
> *sdev)
>  	if (scsi_target(sdev)->single_lun ||
>  	    !list_empty(&sdev->host->starved_list))
>  		kblockd_schedule_work(&sdev->requeue_work);
> -	else
> -		blk_mq_run_hw_queues(sdev->request_queue, true);
> +	else {
> +		/*
> +		 * smp_mb() implied in either rq->end_io or
> blk_mq_free_request
> +		 * is for ordering writing .device_busy in
scsi_device_unbusy()
> +		 * and reading sdev->restarts.
> +		 */
> +		int old = atomic_read(&sdev->restarts);
> +
> +		if (old) {
> +			blk_mq_run_hw_queues(sdev->request_queue, true);
> +
> +			/*
> +			 * ->restarts has to be kept as non-zero if there
is
> +			 *  new budget contention comes.
> +			 */
> +			atomic_cmpxchg(&sdev->restarts, old, 0);
> +		}
> +	}
>  }
>
>  /* Returns false when no more bytes to process, true if there are more
*/
> @@ -1612,8 +1628,34 @@ static void scsi_mq_put_budget(struct
> request_queue *q)  static bool scsi_mq_get_budget(struct request_queue
*q)
> {
>  	struct scsi_device *sdev = q->queuedata;
> +	int ret = scsi_dev_queue_ready(q, sdev);
>
> -	return scsi_dev_queue_ready(q, sdev);
> +	if (ret)
> +		return true;
> +
> +	/*
> +	 * If all in-flight requests originated from this LUN are
completed
> +	 * before setting .restarts, sdev->device_busy will be observed as
> +	 * zero, then blk_mq_delay_run_hw_queue() will dispatch this
request
> +	 * soon. Otherwise, completion of one of these request will
observe
> +	 * the .restarts flag, and the request queue will be run for
handling
> +	 * this request, see scsi_end_request().
> +	 */
> +	atomic_inc(&sdev->restarts);
> +
> +	/*
> +	 * Order writing .restarts and reading .device_busy, and make sure
> +	 * .restarts is visible to scsi_end_request(). Its pair is implied
by
> +	 * __blk_mq_end_request() in scsi_end_request() for ordering
> +	 * writing .device_busy in scsi_device_unbusy() and reading
.restarts.
> +	 *
> +	 */
> +	smp_mb__after_atomic();
> +
> +	if (unlikely(atomic_read(&sdev->device_busy) == 0 &&
> +				!scsi_device_blocked(sdev)))
> +		blk_mq_delay_run_hw_queues(sdev->request_queue,
> SCSI_QUEUE_DELAY);

Hi Ming -

There is still some race which is not handled.  Take a case of IO is not
able to get budget and it has already marked <restarts> flag.
<restarts> flag will be seen non-zero in completion path and completion
path will attempt h/w queue run. (But this particular IO is still not in
s/w queue.).
Attempt of running h/w queue from completion path will not flush any IO
since there is no IO in s/w queue.

I think above code is added assuming it should manage this particular
case, but this code also does not help. If some IO in between submitted
directly to the h/w queue then sdev->device_busy will be non-zero.
	
If I move above section of the code into completion path, IO hang is
resolved.  I also verify performance -

Multi Drive R0 	1 workers per VD gives	662K prior to this patch and now
It scale to 1.1M IOPs. (90% improvement)
Multi Drive R0  4 workers per VD gives 1.9M prior to this patch and now It
scale to 3.1M IOPs. (50% improvement)

Here is modified patch -

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 6f50e5c..dcdc5f6 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -594,8 +594,26 @@ static bool scsi_end_request(struct request *req,
blk_status_t error,
        if (scsi_target(sdev)->single_lun ||
            !list_empty(&sdev->host->starved_list))
                kblockd_schedule_work(&sdev->requeue_work);
-       else
-               blk_mq_run_hw_queues(q, true);
+       else {
+               /*
+                * smp_mb() implied in either rq->end_io or
blk_mq_free_request
+                * is for ordering writing .device_busy in
scsi_device_unbusy()
+                * and reading sdev->restarts.
+                */
+               int old = atomic_read(&sdev->restarts);
+
+               if (old) {
+                       blk_mq_run_hw_queues(sdev->request_queue, true);
+
+                       /*
+                        * ->restarts has to be kept as non-zero if there
is
+                        *  new budget contention comes.
+                        */
+                       atomic_cmpxchg(&sdev->restarts, old, 0);
+               } else if (unlikely(atomic_read(&sdev->device_busy) == 0
&&
+                                       !scsi_device_blocked(sdev)))
+                       blk_mq_delay_run_hw_queues(sdev->request_queue,
SCSI_QUEUE_DELAY);
+       }

        percpu_ref_put(&q->q_usage_counter);
        return false;
@@ -1615,8 +1633,31 @@ static bool scsi_mq_get_budget(struct blk_mq_hw_ctx
*hctx)
 {
        struct request_queue *q = hctx->queue;
        struct scsi_device *sdev = q->queuedata;
+       int ret = scsi_dev_queue_ready(q, sdev);
+
+       if (ret)
+               return true;

-       return scsi_dev_queue_ready(q, sdev);
+       /*
+        * If all in-flight requests originated from this LUN are
completed
+        * before setting .restarts, sdev->device_busy will be observed as
+        * zero, then blk_mq_delay_run_hw_queue() will dispatch this
request
+        * soon. Otherwise, completion of one of these request will
observe
+        * the .restarts flag, and the request queue will be run for
handling
+        * this request, see scsi_end_request().
+        */
+       atomic_inc(&sdev->restarts);
+
+       /*
+        * Order writing .restarts and reading .device_busy, and make sure
+        * .restarts is visible to scsi_end_request(). Its pair is implied
by
+        * __blk_mq_end_request() in scsi_end_request() for ordering
+        * writing .device_busy in scsi_device_unbusy() and reading
.restarts.
+        *
+        */
+       smp_mb__after_atomic();
+
+       return false;
 }

 static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
index bc59090..ac45058 100644
--- a/include/scsi/scsi_device.h
+++ b/include/scsi/scsi_device.h
@@ -108,7 +108,8 @@ struct scsi_device {

        atomic_t device_busy;           /* commands actually active on
LLDD */
        atomic_t device_blocked;        /* Device returned QUEUE_FULL. */
-
+
+       atomic_t restarts;
        spinlock_t list_lock;
        struct list_head starved_entry;
        unsigned short queue_depth;     /* How deep of a queue we want */

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 10/12] megaraid_sas: switch fusion adapters to MQ
  2020-08-06 10:25                                                           ` Kashyap Desai
@ 2020-08-06 13:38                                                             ` Ming Lei
  2020-08-06 14:37                                                               ` Kashyap Desai
  0 siblings, 1 reply; 123+ messages in thread
From: Ming Lei @ 2020-08-06 13:38 UTC (permalink / raw)
  To: Kashyap Desai
  Cc: John Garry, axboe, jejb, martin.petersen, don.brace,
	Sumit Saxena, bvanassche, hare, hch, Shivasharan Srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	PDL,MEGARAIDLINUX

On Thu, Aug 06, 2020 at 03:55:50PM +0530, Kashyap Desai wrote:
> > > Ming -
> > >
> > > I noted your comments.
> > >
> > > I have completed testing and this particular latest performance issue
> > > on Volume is outstanding.
> > > Currently it is 20-25% performance drop in IOPs and we want that to be
> > > closed before shared host tag is enabled for <megaraid_sas> driver.
> > > Just for my understanding - What will be the next steps on this ?
> > >
> > > I can validate any new approach/patch for this issue.
> > >
> >
> > Hello,
> >
> > What do you think of the following patch?
> 
> I tested this patch. I still see IO hang.
> 
> >
> > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index
> > c866a4f33871..49f0fc5c7a63 100644
> > --- a/drivers/scsi/scsi_lib.c
> > +++ b/drivers/scsi/scsi_lib.c
> > @@ -552,8 +552,24 @@ static void scsi_run_queue_async(struct scsi_device
> > *sdev)
> >  	if (scsi_target(sdev)->single_lun ||
> >  	    !list_empty(&sdev->host->starved_list))
> >  		kblockd_schedule_work(&sdev->requeue_work);
> > -	else
> > -		blk_mq_run_hw_queues(sdev->request_queue, true);
> > +	else {
> > +		/*
> > +		 * smp_mb() implied in either rq->end_io or
> > blk_mq_free_request
> > +		 * is for ordering writing .device_busy in
> scsi_device_unbusy()
> > +		 * and reading sdev->restarts.
> > +		 */
> > +		int old = atomic_read(&sdev->restarts);
> > +
> > +		if (old) {
> > +			blk_mq_run_hw_queues(sdev->request_queue, true);
> > +
> > +			/*
> > +			 * ->restarts has to be kept as non-zero if there
> is
> > +			 *  new budget contention comes.
> > +			 */
> > +			atomic_cmpxchg(&sdev->restarts, old, 0);
> > +		}
> > +	}
> >  }
> >
> >  /* Returns false when no more bytes to process, true if there are more
> */
> > @@ -1612,8 +1628,34 @@ static void scsi_mq_put_budget(struct
> > request_queue *q)  static bool scsi_mq_get_budget(struct request_queue
> *q)
> > {
> >  	struct scsi_device *sdev = q->queuedata;
> > +	int ret = scsi_dev_queue_ready(q, sdev);
> >
> > -	return scsi_dev_queue_ready(q, sdev);
> > +	if (ret)
> > +		return true;
> > +
> > +	/*
> > +	 * If all in-flight requests originated from this LUN are
> completed
> > +	 * before setting .restarts, sdev->device_busy will be observed as
> > +	 * zero, then blk_mq_delay_run_hw_queue() will dispatch this
> request
> > +	 * soon. Otherwise, completion of one of these request will
> observe
> > +	 * the .restarts flag, and the request queue will be run for
> handling
> > +	 * this request, see scsi_end_request().
> > +	 */
> > +	atomic_inc(&sdev->restarts);
> > +
> > +	/*
> > +	 * Order writing .restarts and reading .device_busy, and make sure
> > +	 * .restarts is visible to scsi_end_request(). Its pair is implied
> by
> > +	 * __blk_mq_end_request() in scsi_end_request() for ordering
> > +	 * writing .device_busy in scsi_device_unbusy() and reading
> .restarts.
> > +	 *
> > +	 */
> > +	smp_mb__after_atomic();
> > +
> > +	if (unlikely(atomic_read(&sdev->device_busy) == 0 &&
> > +				!scsi_device_blocked(sdev)))
> > +		blk_mq_delay_run_hw_queues(sdev->request_queue,
> > SCSI_QUEUE_DELAY);
> 
> Hi Ming -
> 
> There is still some race which is not handled.  Take a case of IO is not
> able to get budget and it has already marked <restarts> flag.
> <restarts> flag will be seen non-zero in completion path and completion
> path will attempt h/w queue run. (But this particular IO is still not in
> s/w queue.).
> Attempt of running h/w queue from completion path will not flush any IO
> since there is no IO in s/w queue.

Then where is the IO to be submitted in case of running out of budget?

Any IO request which is going to be added to hctx->dispatch, the queue will be
re-run via blk-mq core.

Any IO request being issued directly when running out of budget will be
insert to hctx->dispatch or sw/scheduler queue, will be run in the
submission path.


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: [PATCH RFC v7 10/12] megaraid_sas: switch fusion adapters to MQ
  2020-08-06 13:38                                                             ` Ming Lei
@ 2020-08-06 14:37                                                               ` Kashyap Desai
  2020-08-06 15:29                                                                 ` Ming Lei
  0 siblings, 1 reply; 123+ messages in thread
From: Kashyap Desai @ 2020-08-06 14:37 UTC (permalink / raw)
  To: Ming Lei
  Cc: John Garry, axboe, jejb, martin.petersen, don.brace,
	Sumit Saxena, bvanassche, hare, hch, Shivasharan Srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	PDL,MEGARAIDLINUX

> > Hi Ming -
> >
> > There is still some race which is not handled.  Take a case of IO is
> > not able to get budget and it has already marked <restarts> flag.
> > <restarts> flag will be seen non-zero in completion path and
> > completion path will attempt h/w queue run. (But this particular IO is
> > still not in s/w queue.).
> > Attempt of running h/w queue from completion path will not flush any
> > IO since there is no IO in s/w queue.
>
> Then where is the IO to be submitted in case of running out of budget?

Typical race in your latest patch is - (Lets consider command A,B and C)
Command A did not receive budget. Command B completed  (which was already
submitted earlier) at the same time and it make sdev->device_busy = 0 from
" scsi_finish_command".
Command B has still not called "scsi_end_request". Command C get the
budget and it will make sdev->device_busy = 1. Now, Command A set  set
sdev->restarts flags but will not run h/w queue since sdev->device_busy =
1.
Command B run h/w queue (make sdev->restart = 0) from completion path, but
command -A is still not in the s/w queue. Command-A is in now in s/w
queue. Command-C completed but it will not run h/w queue because
sdev->restarts = 0.


>
> Any IO request which is going to be added to hctx->dispatch, the queue
will be
> re-run via blk-mq core.
>
> Any IO request being issued directly when running out of budget will be
insert
> to hctx->dispatch or sw/scheduler queue, will be run in the submission
path.

I have *not* included below changes we discussed in my testing - If I
include below patch, it is correct that queue will be run in submission
path (at least the path which is impacted in my testing). You have already
mentioned that most of the submission path has fix now in latest kernel
w.r.t running h/w queue from submission path.  Below path is missing for
running h/w queue from submission path.

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c index
54f9015..bcfd33a 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -173,8 +173,10 @@ static int blk_mq_do_dispatch_ctx(struct
blk_mq_hw_ctx *hctx)
                if (!sbitmap_any_bit_set(&hctx->ctx_map))
                        break;

-               if (!blk_mq_get_dispatch_budget(hctx))
+               if (!blk_mq_get_dispatch_budget(hctx)) {
+                       blk_mq_delay_run_hw_queue(hctx,
+ BLK_MQ_BUDGET_DELAY);
                        break;
+               }

                rq = blk_mq_dequeue_from_ctx(hctx, ctx);
                if (!rq) {

Are you saying above fix should be included along with your latest patch ?

>
>
> Thanks,
> Ming

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 10/12] megaraid_sas: switch fusion adapters to MQ
  2020-08-06 14:37                                                               ` Kashyap Desai
@ 2020-08-06 15:29                                                                 ` Ming Lei
  2020-08-08 19:05                                                                   ` Kashyap Desai
  0 siblings, 1 reply; 123+ messages in thread
From: Ming Lei @ 2020-08-06 15:29 UTC (permalink / raw)
  To: Kashyap Desai
  Cc: John Garry, axboe, jejb, martin.petersen, don.brace,
	Sumit Saxena, bvanassche, hare, hch, Shivasharan Srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	PDL,MEGARAIDLINUX

On Thu, Aug 06, 2020 at 08:07:38PM +0530, Kashyap Desai wrote:
> > > Hi Ming -
> > >
> > > There is still some race which is not handled.  Take a case of IO is
> > > not able to get budget and it has already marked <restarts> flag.
> > > <restarts> flag will be seen non-zero in completion path and
> > > completion path will attempt h/w queue run. (But this particular IO is
> > > still not in s/w queue.).
> > > Attempt of running h/w queue from completion path will not flush any
> > > IO since there is no IO in s/w queue.
> >
> > Then where is the IO to be submitted in case of running out of budget?
> 
> Typical race in your latest patch is - (Lets consider command A,B and C)
> Command A did not receive budget. Command B completed  (which was already

Command A doesn't get budget, and A is still in sw/scheduler queue
because we try to acquire budget before dequeuing request from sw/scheduler queue,
see __blk_mq_do_dispatch_sched() and blk_mq_do_dispatch_ctx().

Not consider direct issue, because the hw queue will be run explicitly
when not getting budget, see __blk_mq_try_issue_directly.

Not consider command A being added to hctx->dispatch too, because blk-mq will
re-run the queue, see blk_mq_dispatch_rq_list().


> submitted earlier) at the same time and it make sdev->device_busy = 0 from
> " scsi_finish_command".
> Command B has still not called "scsi_end_request". Command C get the
> budget and it will make sdev->device_busy = 1. Now, Command A set  set
> sdev->restarts flags but will not run h/w queue since sdev->device_busy =
> 1.

Right.

> Command B run h/w queue (make sdev->restart = 0) from completion path, but
> command -A is still not in the s/w queue.

Then you didn't answer my question about where A is, did you?

> Command-A is in now in s/w queue. Command-C completed but it will not run h/w queue because
> sdev->restarts = 0.

Why does command-A become in sw/queue now?

> 
> 
> >
> > Any IO request which is going to be added to hctx->dispatch, the queue
> will be
> > re-run via blk-mq core.
> >
> > Any IO request being issued directly when running out of budget will be
> insert
> > to hctx->dispatch or sw/scheduler queue, will be run in the submission
> path.
> 
> I have *not* included below changes we discussed in my testing - If I
> include below patch, it is correct that queue will be run in submission
> path (at least the path which is impacted in my testing). You have already
> mentioned that most of the submission path has fix now in latest kernel
> w.r.t running h/w queue from submission path.  Below path is missing for
> running h/w queue from submission path.
> 
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c index
> 54f9015..bcfd33a 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -173,8 +173,10 @@ static int blk_mq_do_dispatch_ctx(struct
> blk_mq_hw_ctx *hctx)
>                 if (!sbitmap_any_bit_set(&hctx->ctx_map))
>                         break;
> 
> -               if (!blk_mq_get_dispatch_budget(hctx))
> +               if (!blk_mq_get_dispatch_budget(hctx)) {
> +                       blk_mq_delay_run_hw_queue(hctx,
> + BLK_MQ_BUDGET_DELAY);
>                         break;
> +               }
> 
>                 rq = blk_mq_dequeue_from_ctx(hctx, ctx);
>                 if (!rq) {
> 
> Are you saying above fix should be included along with your latest patch ?

No.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: [PATCH RFC v7 10/12] megaraid_sas: switch fusion adapters to MQ
  2020-08-06 15:29                                                                 ` Ming Lei
@ 2020-08-08 19:05                                                                   ` Kashyap Desai
  2020-08-09  2:16                                                                     ` Ming Lei
  0 siblings, 1 reply; 123+ messages in thread
From: Kashyap Desai @ 2020-08-08 19:05 UTC (permalink / raw)
  To: Ming Lei
  Cc: John Garry, axboe, jejb, martin.petersen, don.brace,
	Sumit Saxena, bvanassche, hare, hch, Shivasharan Srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	PDL,MEGARAIDLINUX

> On Thu, Aug 06, 2020 at 08:07:38PM +0530, Kashyap Desai wrote:
> > > > Hi Ming -
> > > >
> > > > There is still some race which is not handled.  Take a case of IO
> > > > is not able to get budget and it has already marked <restarts>
flag.
> > > > <restarts> flag will be seen non-zero in completion path and
> > > > completion path will attempt h/w queue run. (But this particular
> > > > IO is still not in s/w queue.).
> > > > Attempt of running h/w queue from completion path will not flush
> > > > any IO since there is no IO in s/w queue.
> > >
> > > Then where is the IO to be submitted in case of running out of
budget?
> >
> > Typical race in your latest patch is - (Lets consider command A,B and
> > C) Command A did not receive budget. Command B completed  (which was
> > already
>
> Command A doesn't get budget, and A is still in sw/scheduler queue
because
> we try to acquire budget before dequeuing request from sw/scheduler
queue,
> see __blk_mq_do_dispatch_sched() and blk_mq_do_dispatch_ctx().
>
> Not consider direct issue, because the hw queue will be run explicitly
when
> not getting budget, see __blk_mq_try_issue_directly.
>
> Not consider command A being added to hctx->dispatch too, because blk-mq
> will re-run the queue, see blk_mq_dispatch_rq_list().

Ming -

After going through your comment (I noted your comment and thanks for
correcting my understanding.) and block layer code, I realize that it is a
different race condition. My previous explanation was not accurate.
I debug further and figure out what is actually happening - Consider below
scenario/sequence -

Thread -1 - Detected budget contention. Set restarts = 1.
Thread -2 - old restarts = 1. start hw queue.
Thread -3 - old restarts = 1. start hw queue.
Thread -2 - move restarts = 0.
In my testing, I noticed that both thread-2 and thread-3 started h/w queue
but there was no work for them to do. It is possible because some other
context of h/w queue run might have done that job.
It means, IO of thread-1 is already posted.
Thread -4 - Detected budget contention. Set restart = 1 (because thread-2
has move restarts=0).
Thread -3 - move restarts = 0 (because this thread see old value = 1 but
that is actually updated one more time by thread-4 and theread-4 actually
wanted to run h/w queues). IO of Thread-4 will not be scheduled.

We have to make sure that completion IO path do atomic_cmpxchng of
restarts flag before running the h/w queue.  Below code change - (main fix
is sequence of atomic_cmpxchg and blk_mq_run_hw_queues) fix the issue.

--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -594,8 +594,27 @@ static bool scsi_end_request(struct request *req,
blk_status_t error,
        if (scsi_target(sdev)->single_lun ||
            !list_empty(&sdev->host->starved_list))
                kblockd_schedule_work(&sdev->requeue_work);
-       else
-               blk_mq_run_hw_queues(q, true);
+       else {
+               /*
+                * smp_mb() implied in either rq->end_io or
blk_mq_free_request
+                * is for ordering writing .device_busy in
scsi_device_unbusy()
+                * and reading sdev->restarts.
+                */
+               int old = atomic_read(&sdev->restarts);
+
+               if (old) {
+                       /*
+                        * ->restarts has to be kept as non-zero if there
is
+                        *  new budget contention comes.
+                        */
+                       atomic_cmpxchg(&sdev->restarts, old, 0);
+
+                       /* run the queue after restarts flag is updated
+                        * to avoid race condition with .get_budget
+                        */
+                       blk_mq_run_hw_queues(sdev->request_queue, true);
+               }
+       }

        percpu_ref_put(&q->q_usage_counter);
        return false;

Kashyap

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 10/12] megaraid_sas: switch fusion adapters to MQ
  2020-08-08 19:05                                                                   ` Kashyap Desai
@ 2020-08-09  2:16                                                                     ` Ming Lei
  2020-08-10 16:38                                                                       ` Kashyap Desai
  0 siblings, 1 reply; 123+ messages in thread
From: Ming Lei @ 2020-08-09  2:16 UTC (permalink / raw)
  To: Kashyap Desai
  Cc: John Garry, axboe, jejb, martin.petersen, don.brace,
	Sumit Saxena, bvanassche, hare, hch, Shivasharan Srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	PDL,MEGARAIDLINUX

On Sun, Aug 09, 2020 at 12:35:21AM +0530, Kashyap Desai wrote:
> > On Thu, Aug 06, 2020 at 08:07:38PM +0530, Kashyap Desai wrote:
> > > > > Hi Ming -
> > > > >
> > > > > There is still some race which is not handled.  Take a case of IO
> > > > > is not able to get budget and it has already marked <restarts>
> flag.
> > > > > <restarts> flag will be seen non-zero in completion path and
> > > > > completion path will attempt h/w queue run. (But this particular
> > > > > IO is still not in s/w queue.).
> > > > > Attempt of running h/w queue from completion path will not flush
> > > > > any IO since there is no IO in s/w queue.
> > > >
> > > > Then where is the IO to be submitted in case of running out of
> budget?
> > >
> > > Typical race in your latest patch is - (Lets consider command A,B and
> > > C) Command A did not receive budget. Command B completed  (which was
> > > already
> >
> > Command A doesn't get budget, and A is still in sw/scheduler queue
> because
> > we try to acquire budget before dequeuing request from sw/scheduler
> queue,
> > see __blk_mq_do_dispatch_sched() and blk_mq_do_dispatch_ctx().
> >
> > Not consider direct issue, because the hw queue will be run explicitly
> when
> > not getting budget, see __blk_mq_try_issue_directly.
> >
> > Not consider command A being added to hctx->dispatch too, because blk-mq
> > will re-run the queue, see blk_mq_dispatch_rq_list().
> 
> Ming -
> 
> After going through your comment (I noted your comment and thanks for
> correcting my understanding.) and block layer code, I realize that it is a
> different race condition. My previous explanation was not accurate.
> I debug further and figure out what is actually happening - Consider below
> scenario/sequence -
> 
> Thread -1 - Detected budget contention. Set restarts = 1.
> Thread -2 - old restarts = 1. start hw queue.
> Thread -3 - old restarts = 1. start hw queue.
> Thread -2 - move restarts = 0.
> In my testing, I noticed that both thread-2 and thread-3 started h/w queue
> but there was no work for them to do. It is possible because some other
> context of h/w queue run might have done that job.

It should be true, because there is other run queue somewhere, such as
blk-mq's restart or delay run queue.

> It means, IO of thread-1 is already posted.

OK.

> Thread -4 - Detected budget contention. Set restart = 1 (because thread-2
> has move restarts=0).

OK.

> Thread -3 - move restarts = 0 (because this thread see old value = 1 but
> that is actually updated one more time by thread-4 and theread-4 actually
> wanted to run h/w queues). IO of Thread-4 will not be scheduled.

Right.

> 
> We have to make sure that completion IO path do atomic_cmpxchng of
> restarts flag before running the h/w queue.  Below code change - (main fix
> is sequence of atomic_cmpxchg and blk_mq_run_hw_queues) fix the issue.
> 
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -594,8 +594,27 @@ static bool scsi_end_request(struct request *req,
> blk_status_t error,
>         if (scsi_target(sdev)->single_lun ||
>             !list_empty(&sdev->host->starved_list))
>                 kblockd_schedule_work(&sdev->requeue_work);
> -       else
> -               blk_mq_run_hw_queues(q, true);
> +       else {
> +               /*
> +                * smp_mb() implied in either rq->end_io or
> blk_mq_free_request
> +                * is for ordering writing .device_busy in
> scsi_device_unbusy()
> +                * and reading sdev->restarts.
> +                */
> +               int old = atomic_read(&sdev->restarts);
> +
> +               if (old) {
> +                       /*
> +                        * ->restarts has to be kept as non-zero if there
> is
> +                        *  new budget contention comes.
> +                        */
> +                       atomic_cmpxchg(&sdev->restarts, old, 0);
> +
> +                       /* run the queue after restarts flag is updated
> +                        * to avoid race condition with .get_budget
> +                        */
> +                       blk_mq_run_hw_queues(sdev->request_queue, true);
> +               }
> +       }
> 

I think the above change is right, and this patter is basically same with SCHED_RESTART
used in blk_mq_sched_restart().

BTW, could you run your function & performance test against the following new version?
Then I can include your test result in commit log for moving on.


From 06993ddf5c5dbe0e772cc38342919eb61a57bc50 Mon Sep 17 00:00:00 2001
From: Ming Lei <ming.lei@redhat.com>
Date: Wed, 5 Aug 2020 16:35:53 +0800
Subject: [PATCH] scsi: core: only re-run queue in scsi_end_request() if device
 queue is busy

Now the request queue is run in scsi_end_request() unconditionally if both
target queue and host queue is ready. We should have re-run request queue
only after this device queue becomes busy for restarting this LUN only.

Recently Long Li reported that cost of run queue may be very heavy in
case of high queue depth. So improve this situation by only running
the request queue when this LUN is busy.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ewan D. Milne <emilne@redhat.com>
Cc: Kashyap Desai <kashyap.desai@broadcom.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Long Li <longli@microsoft.com>
Reported-by: Long Li <longli@microsoft.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
V4:
	- not clear .restarts in get_budget(), instead clearing it
	after re-run queue is done; Kashyap figured out we have to
	update ->restarts before re-run queue in scsi_run_queue_async().

V3:
	- add one smp_mb() in scsi_mq_get_budget() and comment

V2:
	- commit log change, no any code change
	- add reported-by tag

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/scsi/scsi_lib.c    | 51 +++++++++++++++++++++++++++++++++++---
 include/scsi/scsi_device.h |  1 +
 2 files changed, 49 insertions(+), 3 deletions(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index c866a4f33871..d083250f9518 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -552,8 +552,27 @@ static void scsi_run_queue_async(struct scsi_device *sdev)
 	if (scsi_target(sdev)->single_lun ||
 	    !list_empty(&sdev->host->starved_list))
 		kblockd_schedule_work(&sdev->requeue_work);
-	else
-		blk_mq_run_hw_queues(sdev->request_queue, true);
+	else {
+		/*
+		 * smp_mb() implied in either rq->end_io or blk_mq_free_request
+		 * is for ordering writing .device_busy in scsi_device_unbusy()
+		 * and reading sdev->restarts.
+		 */
+		int old = atomic_read(&sdev->restarts);
+
+		if (old) {
+			/*
+			 * ->restarts has to be kept as non-zero if there is
+			 *  new budget contention comes.
+			 *
+			 *  No need to run queue when either another re-run
+			 *  queue wins in updating ->restarts or one new budget
+			 *  contention comes.
+			 */
+			if (atomic_cmpxchg(&sdev->restarts, old, 0) == old)
+				blk_mq_run_hw_queues(sdev->request_queue, true);
+		}
+	}
 }
 
 /* Returns false when no more bytes to process, true if there are more */
@@ -1612,8 +1631,34 @@ static void scsi_mq_put_budget(struct request_queue *q)
 static bool scsi_mq_get_budget(struct request_queue *q)
 {
 	struct scsi_device *sdev = q->queuedata;
+	int ret = scsi_dev_queue_ready(q, sdev);
+
+	if (ret)
+		return true;
+
+	atomic_inc(&sdev->restarts);
 
-	return scsi_dev_queue_ready(q, sdev);
+	/*
+	 * Order writing .restarts and reading .device_busy, and make sure
+	 * .restarts is visible to scsi_end_request(). Its pair is implied by
+	 * __blk_mq_end_request() in scsi_end_request() for ordering
+	 * writing .device_busy in scsi_device_unbusy() and reading .restarts.
+	 *
+	 */
+	smp_mb__after_atomic();
+
+	/*
+	 * If all in-flight requests originated from this LUN are completed
+	 * before setting .restarts, sdev->device_busy will be observed as
+	 * zero, then blk_mq_delay_run_hw_queues() will dispatch this request
+	 * soon. Otherwise, completion of one of these request will observe
+	 * the .restarts flag, and the request queue will be run for handling
+	 * this request, see scsi_end_request().
+	 */
+	if (unlikely(atomic_read(&sdev->device_busy) == 0 &&
+				!scsi_device_blocked(sdev)))
+		blk_mq_delay_run_hw_queues(sdev->request_queue, SCSI_QUEUE_DELAY);
+	return false;
 }
 
 static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx,
diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
index bc5909033d13..1a5c9a3df6d6 100644
--- a/include/scsi/scsi_device.h
+++ b/include/scsi/scsi_device.h
@@ -109,6 +109,7 @@ struct scsi_device {
 	atomic_t device_busy;		/* commands actually active on LLDD */
 	atomic_t device_blocked;	/* Device returned QUEUE_FULL. */
 
+	atomic_t restarts;
 	spinlock_t list_lock;
 	struct list_head starved_entry;
 	unsigned short queue_depth;	/* How deep of a queue we want */
-- 
2.25.2



Thanks,
Ming


^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: [PATCH RFC v7 10/12] megaraid_sas: switch fusion adapters to MQ
  2020-08-09  2:16                                                                     ` Ming Lei
@ 2020-08-10 16:38                                                                       ` Kashyap Desai
  2020-08-11  8:09                                                                         ` John Garry
  0 siblings, 1 reply; 123+ messages in thread
From: Kashyap Desai @ 2020-08-10 16:38 UTC (permalink / raw)
  To: Ming Lei
  Cc: John Garry, axboe, jejb, martin.petersen, don.brace,
	Sumit Saxena, bvanassche, hare, hch, Shivasharan Srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	PDL,MEGARAIDLINUX

> On Sun, Aug 09, 2020 at 12:35:21AM +0530, Kashyap Desai wrote:
> > > On Thu, Aug 06, 2020 at 08:07:38PM +0530, Kashyap Desai wrote:
> > > > > > Hi Ming -
> > > > > >
> > > > > > There is still some race which is not handled.  Take a case of
> > > > > > IO is not able to get budget and it has already marked
> > > > > > <restarts>
> > flag.
> > > > > > <restarts> flag will be seen non-zero in completion path and
> > > > > > completion path will attempt h/w queue run. (But this
> > > > > > particular IO is still not in s/w queue.).
> > > > > > Attempt of running h/w queue from completion path will not
> > > > > > flush any IO since there is no IO in s/w queue.
> > > > >
> > > > > Then where is the IO to be submitted in case of running out of
> > budget?
> > > >
> > > > Typical race in your latest patch is - (Lets consider command A,B
> > > > and
> > > > C) Command A did not receive budget. Command B completed  (which
> > > > was already
> > >
> > > Command A doesn't get budget, and A is still in sw/scheduler queue
> > because
> > > we try to acquire budget before dequeuing request from sw/scheduler
> > queue,
> > > see __blk_mq_do_dispatch_sched() and blk_mq_do_dispatch_ctx().
> > >
> > > Not consider direct issue, because the hw queue will be run
> > > explicitly
> > when
> > > not getting budget, see __blk_mq_try_issue_directly.
> > >
> > > Not consider command A being added to hctx->dispatch too, because
> > > blk-mq will re-run the queue, see blk_mq_dispatch_rq_list().
> >
> > Ming -
> >
> > After going through your comment (I noted your comment and thanks for
> > correcting my understanding.) and block layer code, I realize that it
> > is a different race condition. My previous explanation was not
accurate.
> > I debug further and figure out what is actually happening - Consider
> > below scenario/sequence -
> >
> > Thread -1 - Detected budget contention. Set restarts = 1.
> > Thread -2 - old restarts = 1. start hw queue.
> > Thread -3 - old restarts = 1. start hw queue.
> > Thread -2 - move restarts = 0.
> > In my testing, I noticed that both thread-2 and thread-3 started h/w
> > queue but there was no work for them to do. It is possible because
> > some other context of h/w queue run might have done that job.
>
> It should be true, because there is other run queue somewhere, such as
blk-
> mq's restart or delay run queue.
>
> > It means, IO of thread-1 is already posted.
>
> OK.
>
> > Thread -4 - Detected budget contention. Set restart = 1 (because
> > thread-2 has move restarts=0).
>
> OK.
>
> > Thread -3 - move restarts = 0 (because this thread see old value = 1
> > but that is actually updated one more time by thread-4 and theread-4
> > actually wanted to run h/w queues). IO of Thread-4 will not be
scheduled.
>
> Right.
>
> >
> > We have to make sure that completion IO path do atomic_cmpxchng of
> > restarts flag before running the h/w queue.  Below code change - (main
> > fix is sequence of atomic_cmpxchg and blk_mq_run_hw_queues) fix the
> issue.
> >
> > --- a/drivers/scsi/scsi_lib.c
> > +++ b/drivers/scsi/scsi_lib.c
> > @@ -594,8 +594,27 @@ static bool scsi_end_request(struct request *req,
> > blk_status_t error,
> >         if (scsi_target(sdev)->single_lun ||
> >             !list_empty(&sdev->host->starved_list))
> >                 kblockd_schedule_work(&sdev->requeue_work);
> > -       else
> > -               blk_mq_run_hw_queues(q, true);
> > +       else {
> > +               /*
> > +                * smp_mb() implied in either rq->end_io or
> > blk_mq_free_request
> > +                * is for ordering writing .device_busy in
> > scsi_device_unbusy()
> > +                * and reading sdev->restarts.
> > +                */
> > +               int old = atomic_read(&sdev->restarts);
> > +
> > +               if (old) {
> > +                       /*
> > +                        * ->restarts has to be kept as non-zero if
> > + there
> > is
> > +                        *  new budget contention comes.
> > +                        */
> > +                       atomic_cmpxchg(&sdev->restarts, old, 0);
> > +
> > +                       /* run the queue after restarts flag is
updated
> > +                        * to avoid race condition with .get_budget
> > +                        */
> > +                       blk_mq_run_hw_queues(sdev->request_queue,
true);
> > +               }
> > +       }
> >
>
> I think the above change is right, and this patter is basically same
with
> SCHED_RESTART used in blk_mq_sched_restart().
>
> BTW, could you run your function & performance test against the
following
> new version?
> Then I can include your test result in commit log for moving on.

Ming  - I completed both functional and performance test.

System used for the test -
Manufacturer: Supermicro
Product Name: X11DPG-QT

lscpu <snippet>
CPU(s):                72
On-line CPU(s) list:   0-71
Thread(s) per core:    2
Core(s) per socket:    18
Socket(s):             2
NUMA node(s):          2
Model name:            Intel(R) Xeon(R) Gold 6150 CPU @ 2.70GHz

Controller used -
MegaRAID 9560-16i

Total 24 SAS driver of model "WDC      WUSTM3240ASS200"

Total 3 VD created each VD consist of 8 SAS Drives.

Performance testing -

Fio script -
[global]
ioengine=libaio
direct=1
sync=0
ramp_time=20
runtime=60
cpus_allowed=18,19
bs=4k
rw=randread
ioscheduler=none
iodepth=128

[seqprecon]
filename=/dev/sdc
[seqprecon]
filename=/dev/sdd
[seqprecon]
filename=/dev/sde

Without this patch - 602K IOPS. Perf top snippet -(Note - Usage of
blk_mq_run_hw_queues -> blk_mq_run_hw_queue is very high. It consume more
CPU which leads to less performance.)

     8.70%  [kernel]        [k] blk_mq_run_hw_queue
     5.24%  [megaraid_sas]  [k] complete_cmd_fusion
     4.65%  [kernel]        [k] sbitmap_any_bit_set
     3.93%  [kernel]        [k] irq_entries_start
     3.58%  perf            [.] __symbols__insert
     2.21%  [megaraid_sas]  [k] megasas_build_and_issue_cmd_fusion
     1.91%  [kernel]        [k] blk_mq_run_hw_queues

With this patch - 1110K IOPS. Perf top snippet -

    8.05%  [megaraid_sas]  [k] complete_cmd_fusion
     4.10%  [kernel]        [k] irq_entries_start
     3.71%  [megaraid_sas]  [k] megasas_build_and_issue_cmd_fusion
     2.85%  [kernel]        [k] read_tsc
     2.83%  [kernel]        [k] io_submit_one
     2.26%  [kernel]        [k] entry_SYSCALL_64
     2.08%  [megaraid_sas]  [k] megasas_queue_command


Functional Test -

I cover overnight IO testing using <fio> script which sends 4K rand read,
read, rand write and write IOs to the 24 SAS JBOD drives.
Some of the JBOD has ioscheduler=none and some of the JBOD has
ioscheduler=mq-deadline
I used additional script which change sdev->queue_depth of each device
from 2 to 16 range at the interval of 5 seconds.
I used additional script which toggle "rq_affinity=1" and "rq_affinity=2"
at the interval of 5 seconds.

I did not noticed any IO hang.

Thanks, Kashyap

>
>
> From 06993ddf5c5dbe0e772cc38342919eb61a57bc50 Mon Sep 17 00:00:00
> 2001
> From: Ming Lei <ming.lei@redhat.com>
> Date: Wed, 5 Aug 2020 16:35:53 +0800
> Subject: [PATCH] scsi: core: only re-run queue in scsi_end_request() if
device
> queue is busy
>
> Now the request queue is run in scsi_end_request() unconditionally if
both
> target queue and host queue is ready. We should have re-run request
queue
> only after this device queue becomes busy for restarting this LUN only.
>
> Recently Long Li reported that cost of run queue may be very heavy in
case of
> high queue depth. So improve this situation by only running the request
queue
> when this LUN is busy.
>
> Cc: Jens Axboe <axboe@kernel.dk>
> Cc: Ewan D. Milne <emilne@redhat.com>
> Cc: Kashyap Desai <kashyap.desai@broadcom.com>
> Cc: Hannes Reinecke <hare@suse.de>
> Cc: Bart Van Assche <bvanassche@acm.org>
> Cc: Damien Le Moal <damien.lemoal@wdc.com>
> Cc: Long Li <longli@microsoft.com>
> Reported-by: Long Li <longli@microsoft.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
> V4:
> 	- not clear .restarts in get_budget(), instead clearing it
> 	after re-run queue is done; Kashyap figured out we have to
> 	update ->restarts before re-run queue in scsi_run_queue_async().
>
> V3:
> 	- add one smp_mb() in scsi_mq_get_budget() and comment
>
> V2:
> 	- commit log change, no any code change
> 	- add reported-by tag
>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  drivers/scsi/scsi_lib.c    | 51 +++++++++++++++++++++++++++++++++++---
>  include/scsi/scsi_device.h |  1 +
>  2 files changed, 49 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index
> c866a4f33871..d083250f9518 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -552,8 +552,27 @@ static void scsi_run_queue_async(struct scsi_device
> *sdev)
>  	if (scsi_target(sdev)->single_lun ||
>  	    !list_empty(&sdev->host->starved_list))
>  		kblockd_schedule_work(&sdev->requeue_work);
> -	else
> -		blk_mq_run_hw_queues(sdev->request_queue, true);
> +	else {
> +		/*
> +		 * smp_mb() implied in either rq->end_io or
> blk_mq_free_request
> +		 * is for ordering writing .device_busy in
scsi_device_unbusy()
> +		 * and reading sdev->restarts.
> +		 */
> +		int old = atomic_read(&sdev->restarts);
> +
> +		if (old) {
> +			/*
> +			 * ->restarts has to be kept as non-zero if there
is
> +			 *  new budget contention comes.
> +			 *
> +			 *  No need to run queue when either another
re-run
> +			 *  queue wins in updating ->restarts or one new
> budget
> +			 *  contention comes.
> +			 */
> +			if (atomic_cmpxchg(&sdev->restarts, old, 0) ==
old)
> +				blk_mq_run_hw_queues(sdev-
> >request_queue, true);
> +		}
> +	}
>  }
>
>  /* Returns false when no more bytes to process, true if there are more
*/
> @@ -1612,8 +1631,34 @@ static void scsi_mq_put_budget(struct
> request_queue *q)  static bool scsi_mq_get_budget(struct request_queue
*q)
> {
>  	struct scsi_device *sdev = q->queuedata;
> +	int ret = scsi_dev_queue_ready(q, sdev);
> +
> +	if (ret)
> +		return true;
> +
> +	atomic_inc(&sdev->restarts);
>
> -	return scsi_dev_queue_ready(q, sdev);
> +	/*
> +	 * Order writing .restarts and reading .device_busy, and make sure
> +	 * .restarts is visible to scsi_end_request(). Its pair is implied
by
> +	 * __blk_mq_end_request() in scsi_end_request() for ordering
> +	 * writing .device_busy in scsi_device_unbusy() and reading
.restarts.
> +	 *
> +	 */
> +	smp_mb__after_atomic();
> +
> +	/*
> +	 * If all in-flight requests originated from this LUN are
completed
> +	 * before setting .restarts, sdev->device_busy will be observed as
> +	 * zero, then blk_mq_delay_run_hw_queues() will dispatch this
> request
> +	 * soon. Otherwise, completion of one of these request will
observe
> +	 * the .restarts flag, and the request queue will be run for
handling
> +	 * this request, see scsi_end_request().
> +	 */
> +	if (unlikely(atomic_read(&sdev->device_busy) == 0 &&
> +				!scsi_device_blocked(sdev)))
> +		blk_mq_delay_run_hw_queues(sdev->request_queue,
> SCSI_QUEUE_DELAY);
> +	return false;
>  }
>
>  static blk_status_t scsi_queue_rq(struct blk_mq_hw_ctx *hctx, diff
--git
> a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h index
> bc5909033d13..1a5c9a3df6d6 100644
> --- a/include/scsi/scsi_device.h
> +++ b/include/scsi/scsi_device.h
> @@ -109,6 +109,7 @@ struct scsi_device {
>  	atomic_t device_busy;		/* commands actually active on
LLDD
> */
>  	atomic_t device_blocked;	/* Device returned QUEUE_FULL. */
>
> +	atomic_t restarts;
>  	spinlock_t list_lock;
>  	struct list_head starved_entry;
>  	unsigned short queue_depth;	/* How deep of a queue we want */
> --
> 2.25.2
>
>
>
> Thanks,
> Ming

^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: [PATCH RFC v7 02/12] blk-mq: rename blk_mq_update_tag_set_depth()
  2020-06-23 14:23         ` Hannes Reinecke
  2020-06-24  8:13           ` Kashyap Desai
@ 2020-08-10 16:51           ` Kashyap Desai
  2020-08-11  8:01             ` John Garry
  1 sibling, 1 reply; 123+ messages in thread
From: Kashyap Desai @ 2020-08-10 16:51 UTC (permalink / raw)
  To: Hannes Reinecke, John Garry, Ming Lei
  Cc: axboe, jejb, martin.petersen, don.brace, Sumit Saxena,
	bvanassche, hare, hch, Shivasharan Srikanteshwara, linux-block,
	linux-scsi, esc.storagedev, chenxiang66

> > Kashyap, I've also attached an updated patch for the elevator_count
> > patch; if you agree John can include it in the next version.
>
> Hannes - Patch looks good.   Header does not include problem statement.
> How about adding below in header ?
>
> High CPU utilization on "native_queued_spin_lock_slowpath" due to lock
> contention is possible in mq-deadline and bfq io scheduler when
> nr_hw_queues is more than one.
> It is because kblockd work queue can submit IO from all online CPUs
> (through
> blk_mq_run_hw_queues) even though only one hctx has pending commands.
> Elevator callback "has_work" for mq-deadline and bfq scheduler consider
> pending work if there are any IOs on request queue and it does not account
> hctx context.

Hannes/John - We need one more correction for below patch -

https://github.com/hisilicon/kernel-dev/commit/ff631eb80aa0449eaeb78a282fd7eff2a9e42f77

I noticed - that elevator_queued count goes negative mainly because there
are some case where IO was submitted from dispatch queue(not scheduler
queue) and request still has "RQF_ELVPRIV" flag set.
In such cases " dd_finish_request" is called without " dd_insert_request". I
think it is better to decrement counter once it is taken out from dispatched
queue. (Ming proposed to use dispatch path for decrementing counter, but I
somehow did not accounted assuming RQF_ELVPRIV will be set only if IO is
submitted from scheduler queue.)

Below is additional change. Can you merge this ?

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index 9d75374..bc413dd 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -385,6 +385,8 @@ static struct request *dd_dispatch_request(struct
blk_mq_hw_ctx *hctx)

        spin_lock(&dd->lock);
        rq = __dd_dispatch_request(dd);
+       if (rq)
+               atomic_dec(&rq->mq_hctx->elevator_queued);
        spin_unlock(&dd->lock);

        return rq;
@@ -574,7 +576,6 @@ static void dd_finish_request(struct request *rq)
                        blk_mq_sched_mark_restart_hctx(rq->mq_hctx);
                spin_unlock_irqrestore(&dd->zone_lock, flags);
        }
-       atomic_dec(&rq->mq_hctx->elevator_queued);
 }

 static bool dd_has_work(struct blk_mq_hw_ctx *hctx)
--
2.9.5

Kashyap

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 02/12] blk-mq: rename blk_mq_update_tag_set_depth()
  2020-08-10 16:51           ` Kashyap Desai
@ 2020-08-11  8:01             ` John Garry
  2020-08-11 16:34               ` Kashyap Desai
  0 siblings, 1 reply; 123+ messages in thread
From: John Garry @ 2020-08-11  8:01 UTC (permalink / raw)
  To: Kashyap Desai, Hannes Reinecke, Ming Lei
  Cc: axboe, jejb, martin.petersen, don.brace, Sumit Saxena,
	bvanassche, hare, hch, Shivasharan Srikanteshwara, linux-block,
	linux-scsi, esc.storagedev, chenxiang66

On 10/08/2020 17:51, Kashyap Desai wrote:
>> tx context.
> Hannes/John - We need one more correction for below patch -
> 
> https://github.com/hisilicon/kernel-dev/commit/ff631eb80aa0449eaeb78a282fd7eff2a9e42f77
> 
> I noticed - that elevator_queued count goes negative mainly because there
> are some case where IO was submitted from dispatch queue(not scheduler
> queue) and request still has "RQF_ELVPRIV" flag set.
> In such cases " dd_finish_request" is called without " dd_insert_request". I
> think it is better to decrement counter once it is taken out from dispatched
> queue. (Ming proposed to use dispatch path for decrementing counter, but I
> somehow did not accounted assuming RQF_ELVPRIV will be set only if IO is
> submitted from scheduler queue.)
> 
> Below is additional change. Can you merge this ?
> 
> diff --git a/block/mq-deadline.c b/block/mq-deadline.c
> index 9d75374..bc413dd 100644
> --- a/block/mq-deadline.c
> +++ b/block/mq-deadline.c
> @@ -385,6 +385,8 @@ static struct request *dd_dispatch_request(struct
> blk_mq_hw_ctx *hctx)
> 
>          spin_lock(&dd->lock);
>          rq = __dd_dispatch_request(dd);
> +       if (rq)
> +               atomic_dec(&rq->mq_hctx->elevator_queued);

Is there any reason why this operation could not be taken outside the 
spinlock? I assume raciness is not a problem with this patch...

>          spin_unlock(&dd->lock);
> 
>          return rq;
> @@ -574,7 +576,6 @@ static void dd_finish_request(struct request *rq)
>                          blk_mq_sched_mark_restart_hctx(rq->mq_hctx);
>                  spin_unlock_irqrestore(&dd->zone_lock, flags);
>          }
> -       atomic_dec(&rq->mq_hctx->elevator_queued);
>   }
> 
>   static bool dd_has_work(struct blk_mq_hw_ctx *hctx)
> --
> 2.9.5
> 
> Kashyap
> .#


btw, can you provide signed-off-by if you want credit upgraded to 
Co-developed-by?

Thanks,
john

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 10/12] megaraid_sas: switch fusion adapters to MQ
  2020-08-10 16:38                                                                       ` Kashyap Desai
@ 2020-08-11  8:09                                                                         ` John Garry
  0 siblings, 0 replies; 123+ messages in thread
From: John Garry @ 2020-08-11  8:09 UTC (permalink / raw)
  To: Kashyap Desai, Ming Lei
  Cc: axboe, jejb, martin.petersen, don.brace, Sumit Saxena,
	bvanassche, hare, hch, Shivasharan Srikanteshwara, linux-block,
	linux-scsi, esc.storagedev, chenxiang66, PDL,MEGARAIDLINUX

On 10/08/2020 17:38, Kashyap Desai wrote:
> 
> Functional Test -
> 
> I cover overnight IO testing using <fio> script which sends 4K rand read,
> read, rand write and write IOs to the 24 SAS JBOD drives.
> Some of the JBOD has ioscheduler=none and some of the JBOD has
> ioscheduler=mq-deadline
> I used additional script which change sdev->queue_depth of each device
> from 2 to 16 range at the interval of 5 seconds.
> I used additional script which toggle "rq_affinity=1" and "rq_affinity=2"
> at the interval of 5 seconds.
> 
> I did not noticed any IO hang.
> 
> Thanks, Kashyap
> 

Nice work. I think v8 series can now be prepared, since this final 
performance issue reported looks resolved. But I still don't know what's 
going on for "[PATCHv6 00/21] scsi: enable reserved commands for LLDDs", 
which current hpsa patch relies on for basic functionality :(

>>
>>  From 06993ddf5c5dbe0e772cc38342919eb61a57bc50 Mon Sep 17 00:00:00
>> 2001
>> From: Ming Lei<ming.lei@redhat.com>
>> Date: Wed, 5 Aug 2020 16:35:53 +0800
>> Subject: [PATCH] scsi: core: only re-run queue in scsi_end_request() if
> device
>> queue is busy

Ming, I assume that you will send this directly to SCSI maintainers when 
the merge window closes.

>>
>> Now the request queue is run in scsi_end_request() unconditionally if
> both
>> target queue and host queue is ready. We should have re-run request
> queue
>> only after this device queue becomes busy for restarting this LUN only.
>>
>> Recently Long Li reported that cost of run queue may be very heavy in
> case of
>> high queue depth. So improve this situation by only running the request
> queue
>> when this LUN is busy.


^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: [PATCH RFC v7 02/12] blk-mq: rename blk_mq_update_tag_set_depth()
  2020-08-11  8:01             ` John Garry
@ 2020-08-11 16:34               ` Kashyap Desai
  0 siblings, 0 replies; 123+ messages in thread
From: Kashyap Desai @ 2020-08-11 16:34 UTC (permalink / raw)
  To: John Garry, Hannes Reinecke, Ming Lei
  Cc: axboe, jejb, martin.petersen, don.brace, Sumit Saxena,
	bvanassche, hare, hch, Shivasharan Srikanteshwara, linux-block,
	linux-scsi, esc.storagedev, chenxiang66

> > diff --git a/block/mq-deadline.c b/block/mq-deadline.c index
> > 9d75374..bc413dd 100644
> > --- a/block/mq-deadline.c
> > +++ b/block/mq-deadline.c
> > @@ -385,6 +385,8 @@ static struct request *dd_dispatch_request(struct
> > blk_mq_hw_ctx *hctx)
> >
> >          spin_lock(&dd->lock);
> >          rq = __dd_dispatch_request(dd);
> > +       if (rq)
> > +               atomic_dec(&rq->mq_hctx->elevator_queued);
>
> Is there any reason why this operation could not be taken outside the
> spinlock? I assume raciness is not a problem with this patch...

No issue if we want to move this outside spinlock.

>
> >          spin_unlock(&dd->lock);
> >
> >          return rq;
> > @@ -574,7 +576,6 @@ static void dd_finish_request(struct request *rq)
> >                          blk_mq_sched_mark_restart_hctx(rq->mq_hctx);
> >                  spin_unlock_irqrestore(&dd->zone_lock, flags);
> >          }
> > -       atomic_dec(&rq->mq_hctx->elevator_queued);
> >   }
> >
> >   static bool dd_has_work(struct blk_mq_hw_ctx *hctx)
> > --
> > 2.9.5
> >
> > Kashyap
> > .#
>
>
> btw, can you provide signed-off-by if you want credit upgraded to Co-
> developed-by?

I will send you merged patch which you can push to your git repo.

Kashyap

>
> Thanks,
> john

^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: [PATCH RFC v7 12/12] hpsa: enable host_tagset and switch to MQ
  2020-08-05 11:21               ` John Garry
@ 2020-08-14 21:04                 ` Don.Brace
  2020-08-17  8:00                   ` John Garry
  0 siblings, 1 reply; 123+ messages in thread
From: Don.Brace @ 2020-08-14 21:04 UTC (permalink / raw)
  To: john.garry, hare, don.brace
  Cc: axboe, jejb, martin.petersen, kashyap.desai, sumit.saxena,
	ming.lei, bvanassche, hare, hch, shivasharan.srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	megaraidlinux.pdl

Subject: Re: [PATCH RFC v7 12/12] hpsa: enable host_tagset and switch to MQ

> I'll continue with more I/O stress testing.

>>ok, great. Please let me know about your testing, as I might just add that series to the v8 branch.

>>Cheers,
>>John

I cloned your branch from https://github.com/hisilicon/kernel-dev
  and checkout branch: origin/private-topic-blk-mq-shared-tags-rfc-v7

By themselves, the branch compiled but the driver did not load.

I cherry-picked the following patches from Hannes:
  git://git.kernel.org/pub/scm/linux/kernel/git/hare/scsi-devel.git
  branch: reserved-tags.v6

6a9d1a96ea41 hpsa: move hpsa_hba_inquiry after scsi_add_host()
eeb5cd1fca58 hpsa: use reserved commands
	confict: removal of function hpsa_get_cmd_index,
               non-functional issue.
7df7d8382902 hpsa: use scsi_host_busy_iter() to traverse outstanding commands
485881d6d8dc hpsa: drop refcount field from CommandList
c4980ad5e5cb scsi: implement reserved command handling
	conflict: drivers/scsi/scsi_lib.c
               minor context issue adding comment,
               non-functional issue.
34d03fa945c0 scsi: add scsi_{get,put}_internal_cmd() helper
	conflict: drivers/scsi/scsi_lib.c
               minor context issue around scsi_get_internal_cmd
               when adding new comment,
               non-functional issue
4556e50450c8 block: add flag for internal commands
138125f74b25 scsi: hpsa: Lift {BIG_,}IOCTL_Command_struct copy{in,out} into hpsa_ioctl()
cb17c1b69b17 scsi: hpsa: Don't bother with vmalloc for BIG_IOCTL_Command_struct
10100ffd5f65 scsi: hpsa: Get rid of compat_alloc_user_space()
06b43f968db5 scsi: hpsa: hpsa_ioctl(): Tidy up a bit
a381637f8a6e scsi: use real inquiry data when initialising devices
6e9884aefe66 scsi: Use dummy inquiry data for the host device
77dcb92c31ae scsi: revamp host device handling

After the above patches were applied, the driver loaded and I ran the following tests:
insmod/rmmod
reboot
Ran an I/O stress test consisting of:
	6 SATA HBA disks
	2 SAS HBA disks
	2 RAID 5 volumes using 3 SAS HDDs
	2 RAID 5 volumes using 3 SAS SSDs (ioaccel enabled)

	1) fio tests to raw disks.
	2) mke2fs tests
	3) mount
	4) fio to file systems
	5) umount
	6) fsck

	And running reset tests in parallel to the above 6 tests using sg_reset

I ran some performance tests to HBA and LOGICAL VOLUMES and did not find a performance regression.

We are also reconsidering changing smartpqi over to use host tags but in some preliminary performance tests, I found a performance regression.
Note: I only used your V7 patches for smartpqi.
      I have not had time to determine what is causing this, but wanted to make note of this.

For hpsa:

With all of the patches noted above,
Tested-by: Don Brace <don.brace@microsemi.com>

For hpsa specific patches:
Reviewed-by: Don Brace <don.brace@microsemi.com>

Thanks for your input and your patches,
Don


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 12/12] hpsa: enable host_tagset and switch to MQ
  2020-08-14 21:04                 ` Don.Brace
@ 2020-08-17  8:00                   ` John Garry
  2020-08-17 18:39                     ` Don.Brace
  0 siblings, 1 reply; 123+ messages in thread
From: John Garry @ 2020-08-17  8:00 UTC (permalink / raw)
  To: Don.Brace, hare, don.brace
  Cc: axboe, jejb, martin.petersen, kashyap.desai, sumit.saxena,
	ming.lei, bvanassche, hare, hch, shivasharan.srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	megaraidlinux.pdl

On 14/08/2020 22:04, Don.Brace@microchip.com wrote:

Hi Don,

> I cloned your branch fromhttps://github.com/hisilicon/kernel-dev
>    and checkout branch: origin/private-topic-blk-mq-shared-tags-rfc-v7
> 
> By themselves, the branch compiled but the driver did not load.
> 
> I cherry-picked the following patches from Hannes:
>    git://git.kernel.org/pub/scm/linux/kernel/git/hare/scsi-devel.git
>    branch: reserved-tags.v6
> 
> 6a9d1a96ea41 hpsa: move hpsa_hba_inquiry after scsi_add_host()
> eeb5cd1fca58 hpsa: use reserved commands
> 	confict: removal of function hpsa_get_cmd_index,
>                 non-functional issue.
> 7df7d8382902 hpsa: use scsi_host_busy_iter() to traverse outstanding commands
> 485881d6d8dc hpsa: drop refcount field from CommandList
> c4980ad5e5cb scsi: implement reserved command handling
> 	conflict: drivers/scsi/scsi_lib.c
>                 minor context issue adding comment,
>                 non-functional issue.
> 34d03fa945c0 scsi: add scsi_{get,put}_internal_cmd() helper
> 	conflict: drivers/scsi/scsi_lib.c
>                 minor context issue around scsi_get_internal_cmd
>                 when adding new comment,
>                 non-functional issue
> 4556e50450c8 block: add flag for internal commands
> 138125f74b25 scsi: hpsa: Lift {BIG_,}IOCTL_Command_struct copy{in,out} into hpsa_ioctl()
> cb17c1b69b17 scsi: hpsa: Don't bother with vmalloc for BIG_IOCTL_Command_struct
> 10100ffd5f65 scsi: hpsa: Get rid of compat_alloc_user_space()
> 06b43f968db5 scsi: hpsa: hpsa_ioctl(): Tidy up a bit
> a381637f8a6e scsi: use real inquiry data when initialising devices
> 6e9884aefe66 scsi: Use dummy inquiry data for the host device
> 77dcb92c31ae scsi: revamp host device handling
> 
> After the above patches were applied, the driver loaded and I ran the following tests:
> insmod/rmmod
> reboot
> Ran an I/O stress test consisting of:
> 	6 SATA HBA disks
> 	2 SAS HBA disks
> 	2 RAID 5 volumes using 3 SAS HDDs
> 	2 RAID 5 volumes using 3 SAS SSDs (ioaccel enabled)
> 
> 	1) fio tests to raw disks.
> 	2) mke2fs tests
> 	3) mount
> 	4) fio to file systems
> 	5) umount
> 	6) fsck
> 
> 	And running reset tests in parallel to the above 6 tests using sg_reset
> 
> I ran some performance tests to HBA and LOGICAL VOLUMES and did not find a performance regression.
> 

ok, thanks for this info. I appreciate it.

> We are also reconsidering changing smartpqi over to use host tags but in some preliminary performance tests, I found a performance regression.
> Note: I only used your V7 patches for smartpqi.
>        I have not had time to determine what is causing this, but wanted to make note of this.

Thanks. Please note that we have been looking at many performances 
improvements since v7, and these will be included in v8, so maybe I can 
still include smartpqi in the v8 series and you can retest if you want.

> 
> For hpsa:
> 
> With all of the patches noted above,
> Tested-by: Don Brace<don.brace@microsemi.com>
> 
> For hpsa specific patches:
> Reviewed-by: Don Brace<don.brace@microsemi.com>

Thanks. Please also note that I want to drop the RFC tag for v8 series, 
so I will just have to note that we still depend on Hannes' work for 
hpsa. We could also change the patch, but let's see how we go.

Cheers,
John


^ permalink raw reply	[flat|nested] 123+ messages in thread

* RE: [PATCH RFC v7 12/12] hpsa: enable host_tagset and switch to MQ
  2020-08-17  8:00                   ` John Garry
@ 2020-08-17 18:39                     ` Don.Brace
  2020-08-18  7:14                       ` Hannes Reinecke
  0 siblings, 1 reply; 123+ messages in thread
From: Don.Brace @ 2020-08-17 18:39 UTC (permalink / raw)
  To: john.garry, hare, don.brace
  Cc: axboe, jejb, martin.petersen, kashyap.desai, sumit.saxena,
	ming.lei, bvanassche, hare, hch, shivasharan.srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	megaraidlinux.pdl

Subject: Re: [PATCH RFC v7 12/12] hpsa: enable host_tagset and switch to MQ

> We are also reconsidering changing smartpqi over to use host tags but in some preliminary performance tests, I found a performance regression.
> Note: I only used your V7 patches for smartpqi.
>        I have not had time to determine what is causing this, but wanted to make note of this.

>>Thanks. Please note that we have been looking at many >>performances improvements since v7, and these will be >>included in v8, so maybe I can still include smartpqi in >>the v8 series and you can retest if you want.

Sure,
Thanks for your patches
Don

>
> For hpsa:
>
> With all of the patches noted above,
> Tested-by: Don Brace<don.brace@microsemi.com>
>
> For hpsa specific patches:
> Reviewed-by: Don Brace<don.brace@microsemi.com>

Thanks. Please also note that I want to drop the RFC tag for v8 series, so I will just have to note that we still depend on Hannes' work for hpsa. We could also change the patch, but let's see how we go.

Cheers,
John


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 12/12] hpsa: enable host_tagset and switch to MQ
  2020-08-17 18:39                     ` Don.Brace
@ 2020-08-18  7:14                       ` Hannes Reinecke
  0 siblings, 0 replies; 123+ messages in thread
From: Hannes Reinecke @ 2020-08-18  7:14 UTC (permalink / raw)
  To: Don.Brace, john.garry, don.brace
  Cc: axboe, jejb, martin.petersen, kashyap.desai, sumit.saxena,
	ming.lei, bvanassche, hare, hch, shivasharan.srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	megaraidlinux.pdl

On 8/17/20 8:39 PM, Don.Brace@microchip.com wrote:
> Subject: Re: [PATCH RFC v7 12/12] hpsa: enable host_tagset and switch to MQ
> 
>> We are also reconsidering changing smartpqi over to use host tags but in some preliminary performance tests,
>> I found a performance regression.
>> Note: I only used your V7 patches for smartpqi.
>>        I have not had time to determine what is causing this, but wanted to make note of this.
> 
>> Thanks. Please note that we have been looking at many >>performances improvements since v7, and these will
>> be included in v8, so maybe I can still include smartpqi in >>the v8 series and you can retest if you want.
> 
> Sure,
> Thanks for your patches
> Don
> 
Well, I had been looking at smartpqi and its tag handling, and found it
no easy match with the blk-mq implementation we have in linux.
(Which is actually quite curious, as both had been developed around the
same time, so I would've thought that they would be similar ...)

Anyhow: for smartpqi each sgl element use up one slot in the submission
queue, so the total number of SQEs for one command is 1 (for the command
itself) + number of sgl elements.
With that the queue size is actually dynamic, and depends on the size of
the commands being sent.
This doesn't map easily onto blk-mq concepts, where we assume that each
command consumes one SQE.

So currently the smartpqi driver has its own heuristics for determining
the queue depth, but I fear that this also will eat up quite some
improvements we might be getting from using host_tagset.
(Especially as the smartpqi driver doesn't actually _has_ a host_tagset,
but rather the mapping withing the driver exposes something which looks
like a host tagset ...)

What I really would like to see is to update blk-mq to handle smartpqi
properly; this might even be beneficial to other drivers like mpt3sas
which have a similar concept (called 'chain_tracker' there).
One idea would be to allocate additional tags (one for each sgl) such
that the tag bitmap reflects the status of the submission queue.
Maybe we can update my reserved tags patchset for that ... hmmm.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		           Kernel Storage Architect
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH RFC v7 11/12] smartpqi: enable host tagset
  2020-07-14 14:02     ` Hannes Reinecke
@ 2020-08-18  8:33       ` John Garry
  0 siblings, 0 replies; 123+ messages in thread
From: John Garry @ 2020-08-18  8:33 UTC (permalink / raw)
  To: Hannes Reinecke, don.brace, hare
  Cc: axboe, jejb, martin.petersen, kashyap.desai, sumit.saxena,
	ming.lei, bvanassche, hch, shivasharan.srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	megaraidlinux.pdl

On 14/07/2020 15:02, Hannes Reinecke wrote:
> On 7/14/20 3:16 PM, John Garry wrote:
>> Hi Hannes,
>>
>>>    static struct pqi_io_request *pqi_alloc_io_request(
>>> -    struct pqi_ctrl_info *ctrl_info)
>>> +    struct pqi_ctrl_info *ctrl_info, struct scsi_cmnd *scmd)
>>>    {
>>>        struct pqi_io_request *io_request;
>>> +    unsigned int limit = PQI_RESERVED_IO_SLOTS;
>>>        u16 i = ctrl_info->next_io_request_slot;    /* benignly racy */
>>>    -    while (1) {
>>> +    if (scmd) {
>>> +        u32 blk_tag = blk_mq_unique_tag(scmd->request);
>>> +
>>> +        i = blk_mq_unique_tag_to_tag(blk_tag) + limit;
>>>            io_request = &ctrl_info->io_request_pool[i];
>>
>> This looks ok
>>
>>> -        if (atomic_inc_return(&io_request->refcount) == 1)
>>> -            break;
>>> -        atomic_dec(&io_request->refcount);
>>> -        i = (i + 1) % ctrl_info->max_io_slots;
>>> +        if (WARN_ON(atomic_inc_return(&io_request->refcount) > 1)) {
>>> +            atomic_dec(&io_request->refcount);
>>> +            return NULL;
>>> +        }
>>> +    } else {
>>> +        while (1) {
>>> +            io_request = &ctrl_info->io_request_pool[i];
>>> +            if (atomic_inc_return(&io_request->refcount) == 1)
>>> +                break;
>>> +            atomic_dec(&io_request->refcount);
>>> +            i = (i + 1) % limit;
>>
>> To me, the range we use here looks incorrect. I would assume we should
>> restrict range to [max_io_slots - PQI_RESERVED_IO_SLOTS, max_io_slots).
>>
>> But then your reserved commands support would solve that.
>>
> This branch of the 'if' condition will only be taken for internal
> commands, for which we only allow up to PQI_RESERVED_IO_SLOTS.
> And we set the 'normal' I/O commands above at an offset, so we're fine here.

Here is the code:

----8<----
	unsigned int limit = PQI_RESERVED_IO_SLOTS;
	u16 i = ctrl_info->next_io_request_slot; /* benignly racy */

	if (scmd) {
		u32 blk_tag = blk_mq_unique_tag(scmd->request);

		i = blk_mq_unique_tag_to_tag(blk_tag) + limit;
		io_request = &ctrl_info->io_request_pool[i];
		if (WARN_ON(atomic_inc_return(&io_request->refcount) > 1)) {
			atomic_dec(&io_request->refcount);
			return NULL;
		}
	} else {
		while (1) {
			io_request = &ctrl_info->io_request_pool[i];
			if (atomic_inc_return(&io_request->refcount) == 1)
				break;
			atomic_dec(&io_request->refcount);
			i = (i + 1) % limit;
		}
	}

	/* benignly racy */
	ctrl_info->next_io_request_slot = (i + 1) % ctrl_info->max_io_slots;

---->8----

Is how we set ctrl_info->next_io_request_slot ok? Should it be:

ctrl_info->next_io_request_slot = (i + 1) % limit;

And also moved into 'else' leg for good measure.

Thanks,
John

^ permalink raw reply	[flat|nested] 123+ messages in thread

end of thread, back to index

Thread overview: 123+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-10 17:29 [PATCH RFC v7 00/12] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs John Garry
2020-06-10 17:29 ` [PATCH RFC v7 01/12] blk-mq: rename BLK_MQ_F_TAG_SHARED as BLK_MQ_F_TAG_QUEUE_SHARED John Garry
2020-06-10 17:29 ` [PATCH RFC v7 02/12] blk-mq: rename blk_mq_update_tag_set_depth() John Garry
2020-06-11  2:57   ` Ming Lei
2020-06-11  8:26     ` John Garry
2020-06-23 11:25       ` John Garry
2020-06-23 14:23         ` Hannes Reinecke
2020-06-24  8:13           ` Kashyap Desai
2020-06-29 16:18             ` John Garry
2020-08-10 16:51           ` Kashyap Desai
2020-08-11  8:01             ` John Garry
2020-08-11 16:34               ` Kashyap Desai
2020-06-10 17:29 ` [PATCH RFC v7 03/12] blk-mq: Use pointers for blk_mq_tags bitmap tags John Garry
2020-06-10 17:29 ` [PATCH RFC v7 04/12] blk-mq: Facilitate a shared sbitmap per tagset John Garry
2020-06-11  3:37   ` Ming Lei
2020-06-11 10:09     ` John Garry
2020-06-10 17:29 ` [PATCH RFC v7 05/12] blk-mq: Record nr_active_requests per queue for when using shared sbitmap John Garry
2020-06-11  4:04   ` Ming Lei
2020-06-11 10:22     ` John Garry
2020-06-10 17:29 ` [PATCH RFC v7 06/12] blk-mq: Record active_queues_shared_sbitmap per tag_set " John Garry
2020-06-11 13:16   ` Hannes Reinecke
2020-06-11 14:22     ` John Garry
2020-06-10 17:29 ` [PATCH RFC v7 07/12] blk-mq: Add support in hctx_tags_bitmap_show() for a " John Garry
2020-06-11 13:19   ` Hannes Reinecke
2020-06-11 14:33     ` John Garry
2020-06-12  6:06       ` Hannes Reinecke
2020-06-29 15:32         ` About sbitmap_bitmap_show() and cleared bits (was Re: [PATCH RFC v7 07/12] blk-mq: Add support in hctx_tags_bitmap_show() for a shared sbitmap) John Garry
2020-06-30  6:33           ` Hannes Reinecke
2020-06-30  7:30             ` John Garry
2020-06-30 11:36               ` John Garry
2020-06-30 14:55           ` Bart Van Assche
2020-07-13  9:41         ` [PATCH RFC v7 07/12] blk-mq: Add support in hctx_tags_bitmap_show() for a shared sbitmap John Garry
2020-07-13 12:20           ` Hannes Reinecke
2020-06-10 17:29 ` [PATCH RFC v7 08/12] scsi: Add template flag 'host_tagset' John Garry
2020-06-10 17:29 ` [PATCH RFC v7 09/12] scsi: hisi_sas: Switch v3 hw to MQ John Garry
2020-06-10 17:29 ` [PATCH RFC v7 10/12] megaraid_sas: switch fusion adapters " John Garry
2020-07-02 10:23   ` Kashyap Desai
2020-07-06  8:23     ` John Garry
2020-07-06  8:45       ` Hannes Reinecke
2020-07-06  9:26         ` John Garry
2020-07-06  9:40           ` Hannes Reinecke
2020-07-06 19:19       ` Kashyap Desai
2020-07-07  7:58         ` John Garry
2020-07-07 14:45           ` Kashyap Desai
2020-07-07 16:17             ` John Garry
2020-07-09 19:01               ` Kashyap Desai
2020-07-10  8:10                 ` John Garry
2020-07-13  7:55                   ` Kashyap Desai
2020-07-13  8:42                     ` John Garry
2020-07-19 19:07                       ` Kashyap Desai
2020-07-20  7:23                       ` Kashyap Desai
2020-07-20  9:18                         ` John Garry
2020-07-21  1:13                         ` Ming Lei
2020-07-21  6:53                           ` Kashyap Desai
2020-07-22  4:12                             ` Ming Lei
2020-07-22  5:30                               ` Kashyap Desai
2020-07-22  8:04                                 ` Ming Lei
2020-07-22  9:32                                   ` John Garry
2020-07-23 14:07                                     ` Ming Lei
2020-07-23 17:29                                       ` John Garry
2020-07-24  2:47                                         ` Ming Lei
2020-07-28  7:54                                           ` John Garry
2020-07-28  8:45                                             ` Ming Lei
2020-07-29  5:25                                               ` Kashyap Desai
2020-07-29 15:36                                                 ` Ming Lei
2020-07-29 18:31                                                   ` Kashyap Desai
2020-08-04  8:36                                                     ` Ming Lei
2020-08-04  9:27                                                       ` Kashyap Desai
2020-08-05  8:40                                                         ` Ming Lei
2020-08-06 10:25                                                           ` Kashyap Desai
2020-08-06 13:38                                                             ` Ming Lei
2020-08-06 14:37                                                               ` Kashyap Desai
2020-08-06 15:29                                                                 ` Ming Lei
2020-08-08 19:05                                                                   ` Kashyap Desai
2020-08-09  2:16                                                                     ` Ming Lei
2020-08-10 16:38                                                                       ` Kashyap Desai
2020-08-11  8:09                                                                         ` John Garry
2020-08-04 17:00                                               ` John Garry
2020-08-05  2:56                                                 ` Ming Lei
2020-07-28  8:01                                   ` Kashyap Desai
2020-07-08 11:31         ` John Garry
2020-06-10 17:29 ` [PATCH RFC v7 11/12] smartpqi: enable host tagset John Garry
2020-07-14 13:16   ` John Garry
2020-07-14 13:31     ` John Garry
2020-07-14 18:16       ` Don.Brace
2020-07-15  7:28         ` John Garry
2020-07-14 14:02     ` Hannes Reinecke
2020-08-18  8:33       ` John Garry
2020-06-10 17:29 ` [PATCH RFC v7 12/12] hpsa: enable host_tagset and switch to MQ John Garry
2020-07-14  7:37   ` John Garry
2020-07-14  7:41     ` Hannes Reinecke
2020-07-14  7:52       ` John Garry
2020-07-14  8:06         ` Ming Lei
2020-07-14  9:53           ` John Garry
2020-07-14 10:14             ` Ming Lei
2020-07-14 10:43               ` Hannes Reinecke
2020-07-14 10:19             ` Hannes Reinecke
2020-07-14 10:35               ` John Garry
2020-07-14 10:44               ` Ming Lei
2020-07-14 10:52                 ` John Garry
2020-07-14 12:04                   ` Ming Lei
2020-08-03 20:39         ` Don.Brace
2020-08-04  9:27           ` John Garry
2020-08-04 15:18             ` Don.Brace
2020-08-05 11:21               ` John Garry
2020-08-14 21:04                 ` Don.Brace
2020-08-17  8:00                   ` John Garry
2020-08-17 18:39                     ` Don.Brace
2020-08-18  7:14                       ` Hannes Reinecke
2020-07-16 16:14     ` Don.Brace
2020-07-16 19:45     ` Don.Brace
2020-07-17 10:11       ` John Garry
2020-06-11  3:07 ` [PATCH RFC v7 00/12] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs Ming Lei
2020-06-11  9:35   ` John Garry
2020-06-12 18:47     ` Kashyap Desai
2020-06-15  2:13       ` Ming Lei
2020-06-15  6:57         ` Kashyap Desai
2020-06-16  1:00           ` Ming Lei
2020-06-17 11:26             ` Kashyap Desai
2020-06-22  6:24               ` Hannes Reinecke
2020-06-23  0:55                 ` Ming Lei
2020-06-23 11:50                   ` Kashyap Desai
2020-06-23 12:11                   ` Kashyap Desai

Linux-SCSI Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-scsi/0 linux-scsi/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-scsi linux-scsi/ https://lore.kernel.org/linux-scsi \
		linux-scsi@vger.kernel.org
	public-inbox-index linux-scsi

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-scsi


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git