Linux-Block Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH RFC v7 00/12] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs
@ 2020-06-10 17:29 John Garry
  2020-06-10 17:29 ` [PATCH RFC v7 01/12] blk-mq: rename BLK_MQ_F_TAG_SHARED as BLK_MQ_F_TAG_QUEUE_SHARED John Garry
                   ` (12 more replies)
  0 siblings, 13 replies; 45+ messages in thread
From: John Garry @ 2020-06-10 17:29 UTC (permalink / raw)
  To: axboe, jejb, martin.petersen, don.brace, kashyap.desai,
	sumit.saxena, ming.lei, bvanassche, hare, hch,
	shivasharan.srikanteshwara
  Cc: linux-block, linux-scsi, esc.storagedev, chenxiang66,
	megaraidlinux.pdl, John Garry

Hi all,

Here is v7 of the patchset.

In this version of the series, we keep the shared sbitmap for driver tags,
and introduce changes to fix up the tag budgeting across request queues (
and associated debugfs changes).

Some performance figures:

Using 12x SAS SSDs on hisi_sas v3 hw. mq-deadline results are included,
but it is not always an appropriate scheduler to use.

Tag depth 		4000 (default)			260**

Baseline:
none sched:		2290K IOPS			894K
mq-deadline sched:	2341K IOPS			2313K

Final, host_tagset=0 in LLDD*
none sched:		2289K IOPS			703K
mq-deadline sched:	2337K IOPS			2291K

Final:
none sched:		2281K IOPS			1101K
mq-deadline sched:	2322K IOPS			1278K

* this is relevant as this is the performance in supporting but not
  enabling the feature
** depth=260 is relevant as some point where we are regularly waiting for
   tags to be available. Figures were are a bit unstable here for testing.

A copy of the patches can be found here:
https://github.com/hisilicon/kernel-dev/commits/private-topic-blk-mq-shared-tags-rfc-v7

And to progress this series, we the the following to go in first, when ready:
https://lore.kernel.org/linux-scsi/20200430131904.5847-1-hare@suse.de/

Comments welcome, thanks!

Differences to v6:
- tag budgeting across request queues and associated changes
- add any reviewed tags
- rebase
- I did not include any change related to updating shared sbitmap per-hctx
  wait pointer, based on lack of evidence of performance improvement. This
  was discussed here originally:
  https://lore.kernel.org/linux-scsi/ecaeccf029c6fe377ebd4f30f04df9f1@mail.gmail.com/
  I may revisit.

Differences to v5:
- For now, drop the shared scheduler tags
- Fix megaraid SAS queue selection and rebase
- Omit minor unused arguments patch, which has now been merged
- Add separate patch to introduce sbitmap pointer
- Fixed hctx_tags_bitmap_show() for shared sbitmap
- Various tidying

Hannes Reinecke (5):
  blk-mq: rename blk_mq_update_tag_set_depth()
  scsi: Add template flag 'host_tagset'
  megaraid_sas: switch fusion adapters to MQ
  smartpqi: enable host tagset
  hpsa: enable host_tagset and switch to MQ

John Garry (6):
  blk-mq: Use pointers for blk_mq_tags bitmap tags
  blk-mq: Facilitate a shared sbitmap per tagset
  blk-mq: Record nr_active_requests per queue for when using shared
    sbitmap
  blk-mq: Record active_queues_shared_sbitmap per tag_set for when using
    shared sbitmap
  blk-mq: Add support in hctx_tags_bitmap_show() for a shared sbitmap
  scsi: hisi_sas: Switch v3 hw to MQ

Ming Lei (1):
  blk-mq: rename BLK_MQ_F_TAG_SHARED as BLK_MQ_F_TAG_QUEUE_SHARED

 block/bfq-iosched.c                         |   4 +-
 block/blk-core.c                            |   2 +
 block/blk-mq-debugfs.c                      | 120 ++++++++++++++-
 block/blk-mq-tag.c                          | 157 ++++++++++++++------
 block/blk-mq-tag.h                          |  21 ++-
 block/blk-mq.c                              |  64 +++++---
 block/blk-mq.h                              |  33 +++-
 block/kyber-iosched.c                       |   4 +-
 drivers/scsi/hisi_sas/hisi_sas.h            |   3 +-
 drivers/scsi/hisi_sas/hisi_sas_main.c       |  36 ++---
 drivers/scsi/hisi_sas/hisi_sas_v3_hw.c      |  87 +++++------
 drivers/scsi/hpsa.c                         |  44 +-----
 drivers/scsi/hpsa.h                         |   1 -
 drivers/scsi/megaraid/megaraid_sas.h        |   1 -
 drivers/scsi/megaraid/megaraid_sas_base.c   |  59 +++-----
 drivers/scsi/megaraid/megaraid_sas_fusion.c |  24 +--
 drivers/scsi/scsi_lib.c                     |   2 +
 drivers/scsi/smartpqi/smartpqi_init.c       |  38 +++--
 include/linux/blk-mq.h                      |   9 +-
 include/linux/blkdev.h                      |   3 +
 include/scsi/scsi_host.h                    |   6 +-
 21 files changed, 463 insertions(+), 255 deletions(-)

-- 
2.26.2


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH RFC v7 01/12] blk-mq: rename BLK_MQ_F_TAG_SHARED as BLK_MQ_F_TAG_QUEUE_SHARED
  2020-06-10 17:29 [PATCH RFC v7 00/12] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs John Garry
@ 2020-06-10 17:29 ` John Garry
  2020-06-10 17:29 ` [PATCH RFC v7 02/12] blk-mq: rename blk_mq_update_tag_set_depth() John Garry
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 45+ messages in thread
From: John Garry @ 2020-06-10 17:29 UTC (permalink / raw)
  To: axboe, jejb, martin.petersen, don.brace, kashyap.desai,
	sumit.saxena, ming.lei, bvanassche, hare, hch,
	shivasharan.srikanteshwara
  Cc: linux-block, linux-scsi, esc.storagedev, chenxiang66,
	megaraidlinux.pdl, John Garry

From: Ming Lei <ming.lei@redhat.com>

BLK_MQ_F_TAG_SHARED actually means that tags is shared among request
queues, all of which should belong to LUNs attached to same HBA.

So rename it to make the point explicitly.

Suggested-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: John Garry <john.garry@huawei.com>
---
 block/blk-mq-debugfs.c |  2 +-
 block/blk-mq-tag.c     |  2 +-
 block/blk-mq-tag.h     |  4 ++--
 block/blk-mq.c         | 20 ++++++++++----------
 include/linux/blk-mq.h |  2 +-
 5 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 15df3a36e9fa..52d11f8422a7 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -237,7 +237,7 @@ static const char *const alloc_policy_name[] = {
 #define HCTX_FLAG_NAME(name) [ilog2(BLK_MQ_F_##name)] = #name
 static const char *const hctx_flag_name[] = {
 	HCTX_FLAG_NAME(SHOULD_MERGE),
-	HCTX_FLAG_NAME(TAG_SHARED),
+	HCTX_FLAG_NAME(TAG_QUEUE_SHARED),
 	HCTX_FLAG_NAME(BLOCKING),
 	HCTX_FLAG_NAME(NO_SCHED),
 	HCTX_FLAG_NAME(STACKING),
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 96a39d0724a2..85aa1690cbcf 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -65,7 +65,7 @@ static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx,
 {
 	unsigned int depth, users;
 
-	if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_SHARED))
+	if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED))
 		return true;
 	if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
 		return true;
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index d38e48f2a0a4..c810a346db8e 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -56,7 +56,7 @@ extern void __blk_mq_tag_idle(struct blk_mq_hw_ctx *);
 
 static inline bool blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx)
 {
-	if (!(hctx->flags & BLK_MQ_F_TAG_SHARED))
+	if (!(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED))
 		return false;
 
 	return __blk_mq_tag_busy(hctx);
@@ -64,7 +64,7 @@ static inline bool blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx)
 
 static inline void blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
 {
-	if (!(hctx->flags & BLK_MQ_F_TAG_SHARED))
+	if (!(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED))
 		return;
 
 	__blk_mq_tag_idle(hctx);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 9a36ac1c1fa1..d255c485ca5f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -281,7 +281,7 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
 		rq->tag = BLK_MQ_NO_TAG;
 		rq->internal_tag = tag;
 	} else {
-		if (data->hctx->flags & BLK_MQ_F_TAG_SHARED) {
+		if (data->hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED) {
 			rq_flags = RQF_MQ_INFLIGHT;
 			atomic_inc(&data->hctx->nr_active);
 		}
@@ -1116,7 +1116,7 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx *hctx,
 	wait_queue_entry_t *wait;
 	bool ret;
 
-	if (!(hctx->flags & BLK_MQ_F_TAG_SHARED)) {
+	if (!(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED)) {
 		blk_mq_sched_mark_restart_hctx(hctx);
 
 		/*
@@ -1282,7 +1282,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 				 * For non-shared tags, the RESTART check
 				 * will suffice.
 				 */
-				if (hctx->flags & BLK_MQ_F_TAG_SHARED)
+				if (hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED)
 					no_tag = true;
 				break;
 			}
@@ -2579,7 +2579,7 @@ blk_mq_alloc_hctx(struct request_queue *q, struct blk_mq_tag_set *set,
 	spin_lock_init(&hctx->lock);
 	INIT_LIST_HEAD(&hctx->dispatch);
 	hctx->queue = q;
-	hctx->flags = set->flags & ~BLK_MQ_F_TAG_SHARED;
+	hctx->flags = set->flags & ~BLK_MQ_F_TAG_QUEUE_SHARED;
 
 	INIT_LIST_HEAD(&hctx->hctx_list);
 
@@ -2796,9 +2796,9 @@ static void queue_set_hctx_shared(struct request_queue *q, bool shared)
 
 	queue_for_each_hw_ctx(q, hctx, i) {
 		if (shared)
-			hctx->flags |= BLK_MQ_F_TAG_SHARED;
+			hctx->flags |= BLK_MQ_F_TAG_QUEUE_SHARED;
 		else
-			hctx->flags &= ~BLK_MQ_F_TAG_SHARED;
+			hctx->flags &= ~BLK_MQ_F_TAG_QUEUE_SHARED;
 	}
 }
 
@@ -2824,7 +2824,7 @@ static void blk_mq_del_queue_tag_set(struct request_queue *q)
 	list_del_rcu(&q->tag_set_list);
 	if (list_is_singular(&set->tag_list)) {
 		/* just transitioned to unshared */
-		set->flags &= ~BLK_MQ_F_TAG_SHARED;
+		set->flags &= ~BLK_MQ_F_TAG_QUEUE_SHARED;
 		/* update existing queue */
 		blk_mq_update_tag_set_depth(set, false);
 	}
@@ -2841,12 +2841,12 @@ static void blk_mq_add_queue_tag_set(struct blk_mq_tag_set *set,
 	 * Check to see if we're transitioning to shared (from 1 to 2 queues).
 	 */
 	if (!list_empty(&set->tag_list) &&
-	    !(set->flags & BLK_MQ_F_TAG_SHARED)) {
-		set->flags |= BLK_MQ_F_TAG_SHARED;
+	    !(set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)) {
+		set->flags |= BLK_MQ_F_TAG_QUEUE_SHARED;
 		/* update existing queue */
 		blk_mq_update_tag_set_depth(set, true);
 	}
-	if (set->flags & BLK_MQ_F_TAG_SHARED)
+	if (set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)
 		queue_set_hctx_shared(q, true);
 	list_add_tail_rcu(&q->tag_set_list, &set->tag_list);
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index d6fcae17da5a..233209e8030d 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -392,7 +392,7 @@ struct blk_mq_ops {
 
 enum {
 	BLK_MQ_F_SHOULD_MERGE	= 1 << 0,
-	BLK_MQ_F_TAG_SHARED	= 1 << 1,
+	BLK_MQ_F_TAG_QUEUE_SHARED = 1 << 1,
 	/*
 	 * Set when this device requires underlying blk-mq device for
 	 * completing IO:
-- 
2.26.2


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH RFC v7 02/12] blk-mq: rename blk_mq_update_tag_set_depth()
  2020-06-10 17:29 [PATCH RFC v7 00/12] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs John Garry
  2020-06-10 17:29 ` [PATCH RFC v7 01/12] blk-mq: rename BLK_MQ_F_TAG_SHARED as BLK_MQ_F_TAG_QUEUE_SHARED John Garry
@ 2020-06-10 17:29 ` John Garry
  2020-06-11  2:57   ` Ming Lei
  2020-06-10 17:29 ` [PATCH RFC v7 03/12] blk-mq: Use pointers for blk_mq_tags bitmap tags John Garry
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 45+ messages in thread
From: John Garry @ 2020-06-10 17:29 UTC (permalink / raw)
  To: axboe, jejb, martin.petersen, don.brace, kashyap.desai,
	sumit.saxena, ming.lei, bvanassche, hare, hch,
	shivasharan.srikanteshwara
  Cc: linux-block, linux-scsi, esc.storagedev, chenxiang66,
	megaraidlinux.pdl, Hannes Reinecke, John Garry

From: Hannes Reinecke <hare@suse.de>

The function does not set the depth, but rather transitions from
shared to non-shared queues and vice versa.
So rename it to blk_mq_update_tag_set_shared() to better reflect
its purpose.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: John Garry <john.garry@huawei.com>
---
 block/blk-mq-tag.c | 18 ++++++++++--------
 block/blk-mq.c     |  8 ++++----
 2 files changed, 14 insertions(+), 12 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 85aa1690cbcf..bedddf168253 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -454,24 +454,22 @@ static int bt_alloc(struct sbitmap_queue *bt, unsigned int depth,
 				       node);
 }
 
-static struct blk_mq_tags *blk_mq_init_bitmap_tags(struct blk_mq_tags *tags,
-						   int node, int alloc_policy)
+static int blk_mq_init_bitmap_tags(struct blk_mq_tags *tags,
+				   int node, int alloc_policy)
 {
 	unsigned int depth = tags->nr_tags - tags->nr_reserved_tags;
 	bool round_robin = alloc_policy == BLK_TAG_ALLOC_RR;
 
 	if (bt_alloc(&tags->bitmap_tags, depth, round_robin, node))
-		goto free_tags;
+		return -ENOMEM;
 	if (bt_alloc(&tags->breserved_tags, tags->nr_reserved_tags, round_robin,
 		     node))
 		goto free_bitmap_tags;
 
-	return tags;
+	return 0;
 free_bitmap_tags:
 	sbitmap_queue_free(&tags->bitmap_tags);
-free_tags:
-	kfree(tags);
-	return NULL;
+	return -ENOMEM;
 }
 
 struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
@@ -492,7 +490,11 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
 	tags->nr_tags = total_tags;
 	tags->nr_reserved_tags = reserved_tags;
 
-	return blk_mq_init_bitmap_tags(tags, node, alloc_policy);
+	if (blk_mq_init_bitmap_tags(tags, node, alloc_policy) < 0) {
+		kfree(tags);
+		tags = NULL;
+	}
+	return tags;
 }
 
 void blk_mq_free_tags(struct blk_mq_tags *tags)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index d255c485ca5f..c20d75c851f2 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2802,8 +2802,8 @@ static void queue_set_hctx_shared(struct request_queue *q, bool shared)
 	}
 }
 
-static void blk_mq_update_tag_set_depth(struct blk_mq_tag_set *set,
-					bool shared)
+static void blk_mq_update_tag_set_shared(struct blk_mq_tag_set *set,
+					 bool shared)
 {
 	struct request_queue *q;
 
@@ -2826,7 +2826,7 @@ static void blk_mq_del_queue_tag_set(struct request_queue *q)
 		/* just transitioned to unshared */
 		set->flags &= ~BLK_MQ_F_TAG_QUEUE_SHARED;
 		/* update existing queue */
-		blk_mq_update_tag_set_depth(set, false);
+		blk_mq_update_tag_set_shared(set, false);
 	}
 	mutex_unlock(&set->tag_list_lock);
 	INIT_LIST_HEAD(&q->tag_set_list);
@@ -2844,7 +2844,7 @@ static void blk_mq_add_queue_tag_set(struct blk_mq_tag_set *set,
 	    !(set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)) {
 		set->flags |= BLK_MQ_F_TAG_QUEUE_SHARED;
 		/* update existing queue */
-		blk_mq_update_tag_set_depth(set, true);
+		blk_mq_update_tag_set_shared(set, true);
 	}
 	if (set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)
 		queue_set_hctx_shared(q, true);
-- 
2.26.2


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH RFC v7 03/12] blk-mq: Use pointers for blk_mq_tags bitmap tags
  2020-06-10 17:29 [PATCH RFC v7 00/12] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs John Garry
  2020-06-10 17:29 ` [PATCH RFC v7 01/12] blk-mq: rename BLK_MQ_F_TAG_SHARED as BLK_MQ_F_TAG_QUEUE_SHARED John Garry
  2020-06-10 17:29 ` [PATCH RFC v7 02/12] blk-mq: rename blk_mq_update_tag_set_depth() John Garry
@ 2020-06-10 17:29 ` John Garry
  2020-06-10 17:29 ` [PATCH RFC v7 04/12] blk-mq: Facilitate a shared sbitmap per tagset John Garry
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 45+ messages in thread
From: John Garry @ 2020-06-10 17:29 UTC (permalink / raw)
  To: axboe, jejb, martin.petersen, don.brace, kashyap.desai,
	sumit.saxena, ming.lei, bvanassche, hare, hch,
	shivasharan.srikanteshwara
  Cc: linux-block, linux-scsi, esc.storagedev, chenxiang66,
	megaraidlinux.pdl, John Garry

Introduce pointers for the blk_mq_tags regular and reserved bitmap tags,
with the goal of later being able to use a common shared tag bitmap across
all HW contexts in a set.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: John Garry <john.garry@huawei.com>
---
 block/bfq-iosched.c    |  4 ++--
 block/blk-mq-debugfs.c |  8 ++++----
 block/blk-mq-tag.c     | 41 ++++++++++++++++++++++-------------------
 block/blk-mq-tag.h     |  7 +++++--
 block/blk-mq.c         |  4 ++--
 block/kyber-iosched.c  |  4 ++--
 6 files changed, 37 insertions(+), 31 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 50c8f034c01c..a1123d4d586d 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -6372,8 +6372,8 @@ static void bfq_depth_updated(struct blk_mq_hw_ctx *hctx)
 	struct blk_mq_tags *tags = hctx->sched_tags;
 	unsigned int min_shallow;
 
-	min_shallow = bfq_update_depths(bfqd, &tags->bitmap_tags);
-	sbitmap_queue_min_shallow_depth(&tags->bitmap_tags, min_shallow);
+	min_shallow = bfq_update_depths(bfqd, tags->bitmap_tags);
+	sbitmap_queue_min_shallow_depth(tags->bitmap_tags, min_shallow);
 }
 
 static int bfq_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int index)
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 52d11f8422a7..a400b6698dff 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -450,11 +450,11 @@ static void blk_mq_debugfs_tags_show(struct seq_file *m,
 		   atomic_read(&tags->active_queues));
 
 	seq_puts(m, "\nbitmap_tags:\n");
-	sbitmap_queue_show(&tags->bitmap_tags, m);
+	sbitmap_queue_show(tags->bitmap_tags, m);
 
 	if (tags->nr_reserved_tags) {
 		seq_puts(m, "\nbreserved_tags:\n");
-		sbitmap_queue_show(&tags->breserved_tags, m);
+		sbitmap_queue_show(tags->breserved_tags, m);
 	}
 }
 
@@ -485,7 +485,7 @@ static int hctx_tags_bitmap_show(void *data, struct seq_file *m)
 	if (res)
 		goto out;
 	if (hctx->tags)
-		sbitmap_bitmap_show(&hctx->tags->bitmap_tags.sb, m);
+		sbitmap_bitmap_show(&hctx->tags->bitmap_tags->sb, m);
 	mutex_unlock(&q->sysfs_lock);
 
 out:
@@ -519,7 +519,7 @@ static int hctx_sched_tags_bitmap_show(void *data, struct seq_file *m)
 	if (res)
 		goto out;
 	if (hctx->sched_tags)
-		sbitmap_bitmap_show(&hctx->sched_tags->bitmap_tags.sb, m);
+		sbitmap_bitmap_show(&hctx->sched_tags->bitmap_tags->sb, m);
 	mutex_unlock(&q->sysfs_lock);
 
 out:
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index bedddf168253..be39db3c88d7 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -35,9 +35,9 @@ bool __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx)
  */
 void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags, bool include_reserve)
 {
-	sbitmap_queue_wake_all(&tags->bitmap_tags);
+	sbitmap_queue_wake_all(tags->bitmap_tags);
 	if (include_reserve)
-		sbitmap_queue_wake_all(&tags->breserved_tags);
+		sbitmap_queue_wake_all(tags->breserved_tags);
 }
 
 /*
@@ -113,10 +113,10 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 			WARN_ON_ONCE(1);
 			return BLK_MQ_NO_TAG;
 		}
-		bt = &tags->breserved_tags;
+		bt = tags->breserved_tags;
 		tag_offset = 0;
 	} else {
-		bt = &tags->bitmap_tags;
+		bt = tags->bitmap_tags;
 		tag_offset = tags->nr_reserved_tags;
 	}
 
@@ -162,9 +162,9 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 						data->ctx);
 		tags = blk_mq_tags_from_data(data);
 		if (data->flags & BLK_MQ_REQ_RESERVED)
-			bt = &tags->breserved_tags;
+			bt = tags->breserved_tags;
 		else
-			bt = &tags->bitmap_tags;
+			bt = tags->bitmap_tags;
 
 		/*
 		 * If destination hw queue is changed, fake wake up on
@@ -198,10 +198,10 @@ void blk_mq_put_tag(struct blk_mq_tags *tags, struct blk_mq_ctx *ctx,
 		const int real_tag = tag - tags->nr_reserved_tags;
 
 		BUG_ON(real_tag >= tags->nr_tags);
-		sbitmap_queue_clear(&tags->bitmap_tags, real_tag, ctx->cpu);
+		sbitmap_queue_clear(tags->bitmap_tags, real_tag, ctx->cpu);
 	} else {
 		BUG_ON(tag >= tags->nr_reserved_tags);
-		sbitmap_queue_clear(&tags->breserved_tags, tag, ctx->cpu);
+		sbitmap_queue_clear(tags->breserved_tags, tag, ctx->cpu);
 	}
 }
 
@@ -325,9 +325,9 @@ static void __blk_mq_all_tag_iter(struct blk_mq_tags *tags,
 	WARN_ON_ONCE(flags & BT_TAG_ITER_RESERVED);
 
 	if (tags->nr_reserved_tags)
-		bt_tags_for_each(tags, &tags->breserved_tags, fn, priv,
+		bt_tags_for_each(tags, tags->breserved_tags, fn, priv,
 				 flags | BT_TAG_ITER_RESERVED);
-	bt_tags_for_each(tags, &tags->bitmap_tags, fn, priv, flags);
+	bt_tags_for_each(tags, tags->bitmap_tags, fn, priv, flags);
 }
 
 /**
@@ -441,8 +441,8 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
 			continue;
 
 		if (tags->nr_reserved_tags)
-			bt_for_each(hctx, &tags->breserved_tags, fn, priv, true);
-		bt_for_each(hctx, &tags->bitmap_tags, fn, priv, false);
+			bt_for_each(hctx, tags->breserved_tags, fn, priv, true);
+		bt_for_each(hctx, tags->bitmap_tags, fn, priv, false);
 	}
 	blk_queue_exit(q);
 }
@@ -460,15 +460,18 @@ static int blk_mq_init_bitmap_tags(struct blk_mq_tags *tags,
 	unsigned int depth = tags->nr_tags - tags->nr_reserved_tags;
 	bool round_robin = alloc_policy == BLK_TAG_ALLOC_RR;
 
-	if (bt_alloc(&tags->bitmap_tags, depth, round_robin, node))
+	if (bt_alloc(&tags->__bitmap_tags, depth, round_robin, node))
 		return -ENOMEM;
-	if (bt_alloc(&tags->breserved_tags, tags->nr_reserved_tags, round_robin,
-		     node))
+	if (bt_alloc(&tags->__breserved_tags, tags->nr_reserved_tags,
+		     round_robin, node))
 		goto free_bitmap_tags;
 
+	tags->bitmap_tags = &tags->__bitmap_tags;
+	tags->breserved_tags = &tags->__breserved_tags;
+
 	return 0;
 free_bitmap_tags:
-	sbitmap_queue_free(&tags->bitmap_tags);
+	sbitmap_queue_free(&tags->__bitmap_tags);
 	return -ENOMEM;
 }
 
@@ -499,8 +502,8 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
 
 void blk_mq_free_tags(struct blk_mq_tags *tags)
 {
-	sbitmap_queue_free(&tags->bitmap_tags);
-	sbitmap_queue_free(&tags->breserved_tags);
+	sbitmap_queue_free(&tags->__bitmap_tags);
+	sbitmap_queue_free(&tags->__breserved_tags);
 	kfree(tags);
 }
 
@@ -550,7 +553,7 @@ int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
 		 * Don't need (or can't) update reserved tags here, they
 		 * remain static and should never need resizing.
 		 */
-		sbitmap_queue_resize(&tags->bitmap_tags,
+		sbitmap_queue_resize(tags->bitmap_tags,
 				tdepth - tags->nr_reserved_tags);
 	}
 
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index c810a346db8e..cebf7a4b280a 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -13,8 +13,11 @@ struct blk_mq_tags {
 
 	atomic_t active_queues;
 
-	struct sbitmap_queue bitmap_tags;
-	struct sbitmap_queue breserved_tags;
+	struct sbitmap_queue *bitmap_tags;
+	struct sbitmap_queue *breserved_tags;
+
+	struct sbitmap_queue __bitmap_tags;
+	struct sbitmap_queue __breserved_tags;
 
 	struct request **rqs;
 	struct request **static_rqs;
diff --git a/block/blk-mq.c b/block/blk-mq.c
index c20d75c851f2..90b645c3092c 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1093,7 +1093,7 @@ static int blk_mq_dispatch_wake(wait_queue_entry_t *wait, unsigned mode,
 		struct sbitmap_queue *sbq;
 
 		list_del_init(&wait->entry);
-		sbq = &hctx->tags->bitmap_tags;
+		sbq = hctx->tags->bitmap_tags;
 		atomic_dec(&sbq->ws_active);
 	}
 	spin_unlock(&hctx->dispatch_wait_lock);
@@ -1111,7 +1111,7 @@ static int blk_mq_dispatch_wake(wait_queue_entry_t *wait, unsigned mode,
 static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx *hctx,
 				 struct request *rq)
 {
-	struct sbitmap_queue *sbq = &hctx->tags->bitmap_tags;
+	struct sbitmap_queue *sbq = hctx->tags->bitmap_tags;
 	struct wait_queue_head *wq;
 	wait_queue_entry_t *wait;
 	bool ret;
diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
index a38c5ab103d1..075e99c207ef 100644
--- a/block/kyber-iosched.c
+++ b/block/kyber-iosched.c
@@ -359,7 +359,7 @@ static unsigned int kyber_sched_tags_shift(struct request_queue *q)
 	 * All of the hardware queues have the same depth, so we can just grab
 	 * the shift of the first one.
 	 */
-	return q->queue_hw_ctx[0]->sched_tags->bitmap_tags.sb.shift;
+	return q->queue_hw_ctx[0]->sched_tags->bitmap_tags->sb.shift;
 }
 
 static struct kyber_queue_data *kyber_queue_data_alloc(struct request_queue *q)
@@ -502,7 +502,7 @@ static int kyber_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int hctx_idx)
 	khd->batching = 0;
 
 	hctx->sched_data = khd;
-	sbitmap_queue_min_shallow_depth(&hctx->sched_tags->bitmap_tags,
+	sbitmap_queue_min_shallow_depth(hctx->sched_tags->bitmap_tags,
 					kqd->async_depth);
 
 	return 0;
-- 
2.26.2


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH RFC v7 04/12] blk-mq: Facilitate a shared sbitmap per tagset
  2020-06-10 17:29 [PATCH RFC v7 00/12] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs John Garry
                   ` (2 preceding siblings ...)
  2020-06-10 17:29 ` [PATCH RFC v7 03/12] blk-mq: Use pointers for blk_mq_tags bitmap tags John Garry
@ 2020-06-10 17:29 ` John Garry
  2020-06-11  3:37   ` Ming Lei
  2020-06-10 17:29 ` [PATCH RFC v7 05/12] blk-mq: Record nr_active_requests per queue for when using shared sbitmap John Garry
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 45+ messages in thread
From: John Garry @ 2020-06-10 17:29 UTC (permalink / raw)
  To: axboe, jejb, martin.petersen, don.brace, kashyap.desai,
	sumit.saxena, ming.lei, bvanassche, hare, hch,
	shivasharan.srikanteshwara
  Cc: linux-block, linux-scsi, esc.storagedev, chenxiang66,
	megaraidlinux.pdl, John Garry

Some SCSI HBAs (such as HPSA, megaraid, mpt3sas, hisi_sas_v3 ..) support
multiple reply queues with single hostwide tags.

In addition, these drivers want to use interrupt assignment in
pci_alloc_irq_vectors(PCI_IRQ_AFFINITY). However, as discussed in [0],
CPU hotplug may cause in-flight IO completion to not be serviced when an
interrupt is shutdown. That problem is solved in commit bf0beec0607d
("blk-mq: drain I/O when all CPUs in a hctx are offline").

However, to take advantage of that blk-mq feature, the HBA HW queuess are
required to be mapped to that of the blk-mq hctx's; to do that, the HBA HW queues
need to be exposed to the upper layer.

In making that transition, the per-SCSI command request tags are no
longer unique per Scsi host - they are just unique per hctx. As such, the
HBA LLDD would have to generate this tag internally, which has a certain
performance overhead.

However another problem is that blk-mq assumes the host may accept
(Scsi_host.can_queue * #hw queue) commands. In commit 6eb045e092ef ("scsi:
 core: avoid host-wide host_busy counter for scsi_mq"), the Scsi host busy
counter was removed, which would stop the LLDD being sent more than
.can_queue commands; however, it should still be ensured that the block
layer does not issue more than .can_queue commands to the Scsi host.

To solve this problem, introduce a shared sbitmap per blk_mq_tag_set,
which may be requested at init time.

New flag BLK_MQ_F_TAG_HCTX_SHARED should be set when requesting the
tagset to indicate whether the shared sbitmap should be used.

Even when BLK_MQ_F_TAG_HCTX_SHARED is set, a full set of tags and requests
are still allocated per hctx; the reason for this is that if tags and
requests were only allocated for a single hctx - like hctx0 - it may break
block drivers which expect a request be associated with a specific hctx,
i.e. not always hctx0. This will introduce extra memory usage.

This change is based on work originally from Ming Lei in [1] and from
Bart's suggestion in [2].

[0] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/
[1] https://lore.kernel.org/linux-block/20190531022801.10003-1-ming.lei@redhat.com/
[2] https://lore.kernel.org/linux-block/ff77beff-5fd9-9f05-12b6-826922bace1f@huawei.com/T/#m3db0a602f095cbcbff27e9c884d6b4ae826144be

Signed-off-by: John Garry <john.garry@huawei.com>
---
 block/blk-mq-tag.c     | 39 +++++++++++++++++++++++++++++++++++++--
 block/blk-mq-tag.h     | 10 +++++++++-
 block/blk-mq.c         | 24 +++++++++++++++++++++++-
 block/blk-mq.h         |  5 +++++
 include/linux/blk-mq.h |  6 ++++++
 5 files changed, 80 insertions(+), 4 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index be39db3c88d7..92843e3e1a2a 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -228,7 +228,7 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
 	 * We can hit rq == NULL here, because the tagging functions
 	 * test and set the bit before assigning ->rqs[].
 	 */
-	if (rq && rq->q == hctx->queue)
+	if (rq && rq->q == hctx->queue && rq->mq_hctx == hctx)
 		return iter_data->fn(hctx, rq, iter_data->data, reserved);
 	return true;
 }
@@ -466,6 +466,7 @@ static int blk_mq_init_bitmap_tags(struct blk_mq_tags *tags,
 		     round_robin, node))
 		goto free_bitmap_tags;
 
+	/* We later overwrite these in case of per-set shared sbitmap */
 	tags->bitmap_tags = &tags->__bitmap_tags;
 	tags->breserved_tags = &tags->__breserved_tags;
 
@@ -475,7 +476,32 @@ static int blk_mq_init_bitmap_tags(struct blk_mq_tags *tags,
 	return -ENOMEM;
 }
 
-struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
+bool blk_mq_init_shared_sbitmap(struct blk_mq_tag_set *tag_set)
+{
+	unsigned int depth = tag_set->queue_depth - tag_set->reserved_tags;
+	int alloc_policy = BLK_MQ_FLAG_TO_ALLOC_POLICY(tag_set->flags);
+	bool round_robin = alloc_policy == BLK_TAG_ALLOC_RR;
+	int node = tag_set->numa_node;
+
+	if (bt_alloc(&tag_set->__bitmap_tags, depth, round_robin, node))
+		return false;
+	if (bt_alloc(&tag_set->__breserved_tags, tag_set->reserved_tags,
+		     round_robin, node))
+		goto free_bitmap_tags;
+	return true;
+free_bitmap_tags:
+	sbitmap_queue_free(&tag_set->__bitmap_tags);
+	return false;
+}
+
+void blk_mq_exit_shared_sbitmap(struct blk_mq_tag_set *tag_set)
+{
+	sbitmap_queue_free(&tag_set->__bitmap_tags);
+	sbitmap_queue_free(&tag_set->__breserved_tags);
+}
+
+struct blk_mq_tags *blk_mq_init_tags(struct blk_mq_tag_set *set,
+				     unsigned int total_tags,
 				     unsigned int reserved_tags,
 				     int node, int alloc_policy)
 {
@@ -502,6 +528,10 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
 
 void blk_mq_free_tags(struct blk_mq_tags *tags)
 {
+	/*
+	 * Do not free tags->{bitmap, breserved}_tags, as this may point to
+	 * shared sbitmap
+	 */
 	sbitmap_queue_free(&tags->__bitmap_tags);
 	sbitmap_queue_free(&tags->__breserved_tags);
 	kfree(tags);
@@ -560,6 +590,11 @@ int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
 	return 0;
 }
 
+void blk_mq_tag_resize_shared_sbitmap(struct blk_mq_tag_set *set, unsigned int size)
+{
+	sbitmap_queue_resize(&set->__bitmap_tags, size - set->reserved_tags);
+}
+
 /**
  * blk_mq_unique_tag() - return a tag that is unique queue-wide
  * @rq: request for which to compute a unique tag
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index cebf7a4b280a..cf39dd13a24d 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -25,7 +25,12 @@ struct blk_mq_tags {
 };
 
 
-extern struct blk_mq_tags *blk_mq_init_tags(unsigned int nr_tags, unsigned int reserved_tags, int node, int alloc_policy);
+extern bool blk_mq_init_shared_sbitmap(struct blk_mq_tag_set *tag_set);
+extern void blk_mq_exit_shared_sbitmap(struct blk_mq_tag_set *tag_set);
+extern struct blk_mq_tags *blk_mq_init_tags(struct blk_mq_tag_set *tag_set,
+					    unsigned int nr_tags,
+					    unsigned int reserved_tags,
+					    int node, int alloc_policy);
 extern void blk_mq_free_tags(struct blk_mq_tags *tags);
 
 extern unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data);
@@ -34,6 +39,9 @@ extern void blk_mq_put_tag(struct blk_mq_tags *tags, struct blk_mq_ctx *ctx,
 extern int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
 					struct blk_mq_tags **tags,
 					unsigned int depth, bool can_grow);
+extern void blk_mq_tag_resize_shared_sbitmap(struct blk_mq_tag_set *set,
+					     unsigned int size);
+
 extern void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags, bool);
 void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
 		void *priv);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 90b645c3092c..77120dd4e4d5 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2229,7 +2229,7 @@ struct blk_mq_tags *blk_mq_alloc_rq_map(struct blk_mq_tag_set *set,
 	if (node == NUMA_NO_NODE)
 		node = set->numa_node;
 
-	tags = blk_mq_init_tags(nr_tags, reserved_tags, node,
+	tags = blk_mq_init_tags(set, nr_tags, reserved_tags, node,
 				BLK_MQ_FLAG_TO_ALLOC_POLICY(set->flags));
 	if (!tags)
 		return NULL;
@@ -3349,11 +3349,28 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
 	if (ret)
 		goto out_free_mq_map;
 
+	if (blk_mq_is_sbitmap_shared(set)) {
+		if (!blk_mq_init_shared_sbitmap(set)) {
+			ret = -ENOMEM;
+			goto out_free_mq_rq_maps;
+		}
+
+		for (i = 0; i < set->nr_hw_queues; i++) {
+			struct blk_mq_tags *tags = set->tags[i];
+
+			tags->bitmap_tags = &set->__bitmap_tags;
+			tags->breserved_tags = &set->__breserved_tags;
+		}
+	}
+
 	mutex_init(&set->tag_list_lock);
 	INIT_LIST_HEAD(&set->tag_list);
 
 	return 0;
 
+out_free_mq_rq_maps:
+	for (i = 0; i < set->nr_hw_queues; i++)
+		blk_mq_free_rq_map(set->tags[i]);
 out_free_mq_map:
 	for (i = 0; i < set->nr_maps; i++) {
 		kfree(set->map[i].mq_map);
@@ -3372,6 +3389,9 @@ void blk_mq_free_tag_set(struct blk_mq_tag_set *set)
 	for (i = 0; i < set->nr_hw_queues; i++)
 		blk_mq_free_map_and_requests(set, i);
 
+	if (blk_mq_is_sbitmap_shared(set))
+		blk_mq_exit_shared_sbitmap(set);
+
 	for (j = 0; j < set->nr_maps; j++) {
 		kfree(set->map[j].mq_map);
 		set->map[j].mq_map = NULL;
@@ -3408,6 +3428,8 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
 		if (!hctx->sched_tags) {
 			ret = blk_mq_tag_update_depth(hctx, &hctx->tags, nr,
 							false);
+			if (!ret && blk_mq_is_sbitmap_shared(set))
+				blk_mq_tag_resize_shared_sbitmap(set, nr);
 		} else {
 			ret = blk_mq_tag_update_depth(hctx, &hctx->sched_tags,
 							nr, true);
diff --git a/block/blk-mq.h b/block/blk-mq.h
index a139b0631817..1a283c707215 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -158,6 +158,11 @@ struct blk_mq_alloc_data {
 	struct blk_mq_hw_ctx *hctx;
 };
 
+static inline bool blk_mq_is_sbitmap_shared(struct blk_mq_tag_set *tag_set)
+{
+	return tag_set->flags & BLK_MQ_F_TAG_HCTX_SHARED;
+}
+
 static inline struct blk_mq_tags *blk_mq_tags_from_data(struct blk_mq_alloc_data *data)
 {
 	if (data->flags & BLK_MQ_REQ_INTERNAL)
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 233209e8030d..7b31cdb92a71 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -231,6 +231,9 @@ enum hctx_type {
  * @flags:	   Zero or more BLK_MQ_F_* flags.
  * @driver_data:   Pointer to data owned by the block driver that created this
  *		   tag set.
+ * @__bitmap_tags: A shared tags sbitmap, used over all hctx's
+ * @__breserved_tags:
+ *		   A shared reserved tags sbitmap, used over all hctx's
  * @tags:	   Tag sets. One tag set per hardware queue. Has @nr_hw_queues
  *		   elements.
  * @tag_list_lock: Serializes tag_list accesses.
@@ -250,6 +253,8 @@ struct blk_mq_tag_set {
 	unsigned int		flags;
 	void			*driver_data;
 
+	struct sbitmap_queue	__bitmap_tags;
+	struct sbitmap_queue	__breserved_tags;
 	struct blk_mq_tags	**tags;
 
 	struct mutex		tag_list_lock;
@@ -398,6 +403,7 @@ enum {
 	 * completing IO:
 	 */
 	BLK_MQ_F_STACKING	= 1 << 2,
+	BLK_MQ_F_TAG_HCTX_SHARED = 1 << 3,
 	BLK_MQ_F_BLOCKING	= 1 << 5,
 	BLK_MQ_F_NO_SCHED	= 1 << 6,
 	BLK_MQ_F_ALLOC_POLICY_START_BIT = 8,
-- 
2.26.2


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH RFC v7 05/12] blk-mq: Record nr_active_requests per queue for when using shared sbitmap
  2020-06-10 17:29 [PATCH RFC v7 00/12] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs John Garry
                   ` (3 preceding siblings ...)
  2020-06-10 17:29 ` [PATCH RFC v7 04/12] blk-mq: Facilitate a shared sbitmap per tagset John Garry
@ 2020-06-10 17:29 ` John Garry
  2020-06-11  4:04   ` Ming Lei
  2020-06-10 17:29 ` [PATCH RFC v7 06/12] blk-mq: Record active_queues_shared_sbitmap per tag_set " John Garry
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 45+ messages in thread
From: John Garry @ 2020-06-10 17:29 UTC (permalink / raw)
  To: axboe, jejb, martin.petersen, don.brace, kashyap.desai,
	sumit.saxena, ming.lei, bvanassche, hare, hch,
	shivasharan.srikanteshwara
  Cc: linux-block, linux-scsi, esc.storagedev, chenxiang66,
	megaraidlinux.pdl, John Garry

The per-hctx nr_active value can no longer be used to fairly assign a share
of tag depth per request queue for when using a shared sbitmap, as it does
not consider that the tags are shared tags over all hctx's.

For this case, record the nr_active_requests per request_queue, and make
the judgment based on that value.

Also introduce a debugfs version of per-hctx blk_mq_debugfs_attr, omitting
hctx_active_show() (as blk_mq_hw_ctx.nr_active is no longer maintained for
the case of shared sbitmap) and other entries which we can add which would
be revised specifically for when using a shared sbitmap.

Co-developed-with: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: John Garry <john.garry@huawei.com>
---
 block/blk-core.c       |  2 ++
 block/blk-mq-debugfs.c | 23 ++++++++++++++++++++++-
 block/blk-mq-tag.c     | 10 ++++++----
 block/blk-mq.c         |  6 +++---
 block/blk-mq.h         | 28 +++++++++++++++++++++++++++-
 include/linux/blkdev.h |  2 ++
 6 files changed, 62 insertions(+), 9 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 03252af8c82c..c622453c1363 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -529,6 +529,8 @@ struct request_queue *__blk_alloc_queue(int node_id)
 	q->backing_dev_info->capabilities = BDI_CAP_CGROUP_WRITEBACK;
 	q->node = node_id;
 
+	atomic_set(&q->nr_active_requests_shared_sbitmap, 0);
+
 	timer_setup(&q->backing_dev_info->laptop_mode_wb_timer,
 		    laptop_mode_timer_fn, 0);
 	timer_setup(&q->timeout, blk_rq_timed_out_timer, 0);
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index a400b6698dff..0fa3af41ab65 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -796,6 +796,23 @@ static const struct blk_mq_debugfs_attr blk_mq_debugfs_hctx_attrs[] = {
 	{},
 };
 
+static const struct blk_mq_debugfs_attr blk_mq_debugfs_hctx_shared_sbitmap_attrs[] = {
+	{"state", 0400, hctx_state_show},
+	{"flags", 0400, hctx_flags_show},
+	{"dispatch", 0400, .seq_ops = &hctx_dispatch_seq_ops},
+	{"busy", 0400, hctx_busy_show},
+	{"ctx_map", 0400, hctx_ctx_map_show},
+	{"sched_tags", 0400, hctx_sched_tags_show},
+	{"sched_tags_bitmap", 0400, hctx_sched_tags_bitmap_show},
+	{"io_poll", 0600, hctx_io_poll_show, hctx_io_poll_write},
+	{"dispatched", 0600, hctx_dispatched_show, hctx_dispatched_write},
+	{"queued", 0600, hctx_queued_show, hctx_queued_write},
+	{"run", 0600, hctx_run_show, hctx_run_write},
+	{"active", 0400, hctx_active_show},
+	{"dispatch_busy", 0400, hctx_dispatch_busy_show},
+	{}
+};
+
 static const struct blk_mq_debugfs_attr blk_mq_debugfs_ctx_attrs[] = {
 	{"default_rq_list", 0400, .seq_ops = &ctx_default_rq_list_seq_ops},
 	{"read_rq_list", 0400, .seq_ops = &ctx_read_rq_list_seq_ops},
@@ -878,13 +895,17 @@ void blk_mq_debugfs_register_hctx(struct request_queue *q,
 				  struct blk_mq_hw_ctx *hctx)
 {
 	struct blk_mq_ctx *ctx;
+	struct blk_mq_tag_set *set = q->tag_set;
 	char name[20];
 	int i;
 
 	snprintf(name, sizeof(name), "hctx%u", hctx->queue_num);
 	hctx->debugfs_dir = debugfs_create_dir(name, q->debugfs_dir);
 
-	debugfs_create_files(hctx->debugfs_dir, hctx, blk_mq_debugfs_hctx_attrs);
+	if (blk_mq_is_sbitmap_shared(set))
+		debugfs_create_files(hctx->debugfs_dir, hctx, blk_mq_debugfs_hctx_shared_sbitmap_attrs);
+	else
+		debugfs_create_files(hctx->debugfs_dir, hctx, blk_mq_debugfs_hctx_attrs);
 
 	hctx_for_each_ctx(hctx, ctx, i)
 		blk_mq_debugfs_register_ctx(hctx, ctx);
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 92843e3e1a2a..7db16e49f6f6 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -60,9 +60,11 @@ void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
  * For shared tag users, we track the number of currently active users
  * and attempt to provide a fair share of the tag depth for each of them.
  */
-static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx,
+static inline bool hctx_may_queue(struct blk_mq_alloc_data *data,
 				  struct sbitmap_queue *bt)
 {
+	struct blk_mq_hw_ctx *hctx = data->hctx;
+	struct request_queue *q = data->q;
 	unsigned int depth, users;
 
 	if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED))
@@ -84,15 +86,15 @@ static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx,
 	 * Allow at least some tags
 	 */
 	depth = max((bt->sb.depth + users - 1) / users, 4U);
-	return atomic_read(&hctx->nr_active) < depth;
+	return __blk_mq_active_requests(hctx, q) < depth;
 }
 
 static int __blk_mq_get_tag(struct blk_mq_alloc_data *data,
 			    struct sbitmap_queue *bt)
 {
 	if (!(data->flags & BLK_MQ_REQ_INTERNAL) &&
-	    !hctx_may_queue(data->hctx, bt))
-		return BLK_MQ_NO_TAG;
+	    !hctx_may_queue(data, bt))
+		return -1;
 	if (data->shallow_depth)
 		return __sbitmap_queue_get_shallow(bt, data->shallow_depth);
 	else
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 77120dd4e4d5..0f7e062a1665 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -283,7 +283,7 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
 	} else {
 		if (data->hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED) {
 			rq_flags = RQF_MQ_INFLIGHT;
-			atomic_inc(&data->hctx->nr_active);
+			__blk_mq_inc_active_requests(data->hctx, data->q);
 		}
 		rq->tag = tag;
 		rq->internal_tag = BLK_MQ_NO_TAG;
@@ -527,7 +527,7 @@ void blk_mq_free_request(struct request *rq)
 
 	ctx->rq_completed[rq_is_sync(rq)]++;
 	if (rq->rq_flags & RQF_MQ_INFLIGHT)
-		atomic_dec(&hctx->nr_active);
+		__blk_mq_dec_active_requests(hctx, q);
 
 	if (unlikely(laptop_mode && !blk_rq_is_passthrough(rq)))
 		laptop_io_completion(q->backing_dev_info);
@@ -1073,7 +1073,7 @@ bool blk_mq_get_driver_tag(struct request *rq)
 	if (rq->tag >= 0) {
 		if (shared) {
 			rq->rq_flags |= RQF_MQ_INFLIGHT;
-			atomic_inc(&data.hctx->nr_active);
+			__blk_mq_inc_active_requests(rq->mq_hctx, rq->q);
 		}
 		data.hctx->tags->rqs[rq->tag] = rq;
 	}
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 1a283c707215..9c1e612c2298 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -202,6 +202,32 @@ static inline bool blk_mq_get_dispatch_budget(struct blk_mq_hw_ctx *hctx)
 	return true;
 }
 
+static inline void __blk_mq_inc_active_requests(struct blk_mq_hw_ctx *hctx,
+						struct request_queue *q)
+{
+	if (blk_mq_is_sbitmap_shared(q->tag_set))
+		atomic_inc(&q->nr_active_requests_shared_sbitmap);
+	else
+		atomic_inc(&hctx->nr_active);
+}
+
+static inline void __blk_mq_dec_active_requests(struct blk_mq_hw_ctx *hctx,
+						struct request_queue *q)
+{
+	if (blk_mq_is_sbitmap_shared(q->tag_set))
+		atomic_dec(&q->nr_active_requests_shared_sbitmap);
+	else
+		atomic_dec(&hctx->nr_active);
+}
+
+static inline int __blk_mq_active_requests(struct blk_mq_hw_ctx *hctx,
+					   struct request_queue *q)
+{
+	if (blk_mq_is_sbitmap_shared(q->tag_set))
+		return atomic_read(&q->nr_active_requests_shared_sbitmap);
+	return atomic_read(&hctx->nr_active);
+}
+
 static inline void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx,
 					   struct request *rq)
 {
@@ -210,7 +236,7 @@ static inline void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx,
 
 	if (rq->rq_flags & RQF_MQ_INFLIGHT) {
 		rq->rq_flags &= ~RQF_MQ_INFLIGHT;
-		atomic_dec(&hctx->nr_active);
+		__blk_mq_dec_active_requests(hctx, rq->q);
 	}
 }
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 8fd900998b4e..c536278bec9e 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -488,6 +488,8 @@ struct request_queue {
 	struct timer_list	timeout;
 	struct work_struct	timeout_work;
 
+	atomic_t		nr_active_requests_shared_sbitmap;
+
 	struct list_head	icq_list;
 #ifdef CONFIG_BLK_CGROUP
 	DECLARE_BITMAP		(blkcg_pols, BLKCG_MAX_POLS);
-- 
2.26.2


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH RFC v7 06/12] blk-mq: Record active_queues_shared_sbitmap per tag_set for when using shared sbitmap
  2020-06-10 17:29 [PATCH RFC v7 00/12] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs John Garry
                   ` (4 preceding siblings ...)
  2020-06-10 17:29 ` [PATCH RFC v7 05/12] blk-mq: Record nr_active_requests per queue for when using shared sbitmap John Garry
@ 2020-06-10 17:29 ` John Garry
  2020-06-11 13:16   ` Hannes Reinecke
  2020-06-10 17:29 ` [PATCH RFC v7 07/12] blk-mq: Add support in hctx_tags_bitmap_show() for a " John Garry
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 45+ messages in thread
From: John Garry @ 2020-06-10 17:29 UTC (permalink / raw)
  To: axboe, jejb, martin.petersen, don.brace, kashyap.desai,
	sumit.saxena, ming.lei, bvanassche, hare, hch,
	shivasharan.srikanteshwara
  Cc: linux-block, linux-scsi, esc.storagedev, chenxiang66,
	megaraidlinux.pdl, John Garry

For when using a shared sbitmap, no longer should the number of active
request queues per hctx be relied on for when judging how to share the tag
bitmap.

Instead maintain the number of active request queues per tag_set, and make
the judgment based on that.

And since the blk_mq_tags.active_queues is no longer maintained, do not
show it in debugfs.

Originally-from: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: John Garry <john.garry@huawei.com>
---
 block/blk-mq-debugfs.c | 25 ++++++++++++++++++++--
 block/blk-mq-tag.c     | 47 ++++++++++++++++++++++++++++++++----------
 block/blk-mq.c         |  2 ++
 include/linux/blk-mq.h |  1 +
 include/linux/blkdev.h |  1 +
 5 files changed, 63 insertions(+), 13 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 0fa3af41ab65..05b4be0c03d9 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -458,17 +458,37 @@ static void blk_mq_debugfs_tags_show(struct seq_file *m,
 	}
 }
 
+static void blk_mq_debugfs_tags_shared_sbitmap_show(struct seq_file *m,
+				     struct blk_mq_tags *tags)
+{
+	seq_printf(m, "nr_tags=%u\n", tags->nr_tags);
+	seq_printf(m, "nr_reserved_tags=%u\n", tags->nr_reserved_tags);
+
+	seq_puts(m, "\nbitmap_tags:\n");
+	sbitmap_queue_show(tags->bitmap_tags, m);
+
+	if (tags->nr_reserved_tags) {
+		seq_puts(m, "\nbreserved_tags:\n");
+		sbitmap_queue_show(tags->breserved_tags, m);
+	}
+}
+
 static int hctx_tags_show(void *data, struct seq_file *m)
 {
 	struct blk_mq_hw_ctx *hctx = data;
 	struct request_queue *q = hctx->queue;
+	struct blk_mq_tag_set *set = q->tag_set;
 	int res;
 
 	res = mutex_lock_interruptible(&q->sysfs_lock);
 	if (res)
 		goto out;
-	if (hctx->tags)
-		blk_mq_debugfs_tags_show(m, hctx->tags);
+	if (hctx->tags) {
+		if (blk_mq_is_sbitmap_shared(set))
+			blk_mq_debugfs_tags_shared_sbitmap_show(m, hctx->tags);
+		else
+			blk_mq_debugfs_tags_show(m, hctx->tags);
+	}
 	mutex_unlock(&q->sysfs_lock);
 
 out:
@@ -802,6 +822,7 @@ static const struct blk_mq_debugfs_attr blk_mq_debugfs_hctx_shared_sbitmap_attrs
 	{"dispatch", 0400, .seq_ops = &hctx_dispatch_seq_ops},
 	{"busy", 0400, hctx_busy_show},
 	{"ctx_map", 0400, hctx_ctx_map_show},
+	{"tags", 0400, hctx_tags_show},
 	{"sched_tags", 0400, hctx_sched_tags_show},
 	{"sched_tags_bitmap", 0400, hctx_sched_tags_bitmap_show},
 	{"io_poll", 0600, hctx_io_poll_show, hctx_io_poll_write},
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 7db16e49f6f6..6ca06b1c3a99 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -23,9 +23,19 @@
  */
 bool __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx)
 {
-	if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state) &&
-	    !test_and_set_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
-		atomic_inc(&hctx->tags->active_queues);
+	struct request_queue *q = hctx->queue;
+	struct blk_mq_tag_set *set = q->tag_set;
+
+	if (blk_mq_is_sbitmap_shared(set)){
+		if (!test_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags) &&
+		    !test_and_set_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags))
+			atomic_inc(&set->active_queues_shared_sbitmap);
+
+	} else {
+		if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state) &&
+		    !test_and_set_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
+			atomic_inc(&hctx->tags->active_queues);
+	}
 
 	return true;
 }
@@ -47,11 +57,19 @@ void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags, bool include_reserve)
 void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
 {
 	struct blk_mq_tags *tags = hctx->tags;
-
-	if (!test_and_clear_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
-		return;
-
-	atomic_dec(&tags->active_queues);
+	struct request_queue *q = hctx->queue;
+	struct blk_mq_tag_set *set = q->tag_set;
+
+	if (blk_mq_is_sbitmap_shared(q->tag_set)){
+		if (!test_and_clear_bit(QUEUE_FLAG_HCTX_ACTIVE,
+					&q->queue_flags))
+			return;
+		atomic_dec(&set->active_queues_shared_sbitmap);
+	} else {
+		if (!test_and_clear_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
+			return;
+		atomic_dec(&tags->active_queues);
+	}
 
 	blk_mq_tag_wakeup_all(tags, false);
 }
@@ -65,12 +83,11 @@ static inline bool hctx_may_queue(struct blk_mq_alloc_data *data,
 {
 	struct blk_mq_hw_ctx *hctx = data->hctx;
 	struct request_queue *q = data->q;
+	struct blk_mq_tag_set *set = q->tag_set;
 	unsigned int depth, users;
 
 	if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED))
 		return true;
-	if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
-		return true;
 
 	/*
 	 * Don't try dividing an ant
@@ -78,7 +95,15 @@ static inline bool hctx_may_queue(struct blk_mq_alloc_data *data,
 	if (bt->sb.depth == 1)
 		return true;
 
-	users = atomic_read(&hctx->tags->active_queues);
+	if (blk_mq_is_sbitmap_shared(q->tag_set)) {
+		if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &q->queue_flags))
+			return true;
+		users = atomic_read(&set->active_queues_shared_sbitmap);
+	} else {
+		if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
+			return true;
+		users = atomic_read(&hctx->tags->active_queues);
+	}
 	if (!users)
 		return true;
 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 0f7e062a1665..f73a2f9c58bd 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3350,6 +3350,8 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
 		goto out_free_mq_map;
 
 	if (blk_mq_is_sbitmap_shared(set)) {
+		atomic_set(&set->active_queues_shared_sbitmap, 0);
+
 		if (!blk_mq_init_shared_sbitmap(set)) {
 			ret = -ENOMEM;
 			goto out_free_mq_rq_maps;
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 7b31cdb92a71..66711c7234db 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -252,6 +252,7 @@ struct blk_mq_tag_set {
 	unsigned int		timeout;
 	unsigned int		flags;
 	void			*driver_data;
+	atomic_t		active_queues_shared_sbitmap;
 
 	struct sbitmap_queue	__bitmap_tags;
 	struct sbitmap_queue	__breserved_tags;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index c536278bec9e..1b0087e8d01a 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -619,6 +619,7 @@ struct request_queue {
 #define QUEUE_FLAG_PCI_P2PDMA	25	/* device supports PCI p2p requests */
 #define QUEUE_FLAG_ZONE_RESETALL 26	/* supports Zone Reset All */
 #define QUEUE_FLAG_RQ_ALLOC_TIME 27	/* record rq->alloc_time_ns */
+#define QUEUE_FLAG_HCTX_ACTIVE 28	/* at least one blk-mq hctx is active */
 
 #define QUEUE_FLAG_MQ_DEFAULT	((1 << QUEUE_FLAG_IO_STAT) |		\
 				 (1 << QUEUE_FLAG_SAME_COMP))
-- 
2.26.2


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH RFC v7 07/12] blk-mq: Add support in hctx_tags_bitmap_show() for a shared sbitmap
  2020-06-10 17:29 [PATCH RFC v7 00/12] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs John Garry
                   ` (5 preceding siblings ...)
  2020-06-10 17:29 ` [PATCH RFC v7 06/12] blk-mq: Record active_queues_shared_sbitmap per tag_set " John Garry
@ 2020-06-10 17:29 ` John Garry
  2020-06-11 13:19   ` Hannes Reinecke
  2020-06-10 17:29 ` [PATCH RFC v7 08/12] scsi: Add template flag 'host_tagset' John Garry
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 45+ messages in thread
From: John Garry @ 2020-06-10 17:29 UTC (permalink / raw)
  To: axboe, jejb, martin.petersen, don.brace, kashyap.desai,
	sumit.saxena, ming.lei, bvanassche, hare, hch,
	shivasharan.srikanteshwara
  Cc: linux-block, linux-scsi, esc.storagedev, chenxiang66,
	megaraidlinux.pdl, John Garry

Since a set-wide shared tag sbitmap may be used, it is no longer valid to
examine the per-hctx tagset for getting the active requests for a hctx
(when a shared sbitmap is used).

As such, add support for the shared sbitmap by using an intermediate
sbitmap per hctx, iterating all active tags for the specific hctx in the
shared sbitmap.

Originally-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Hannes Reinecke <hare@suse.de> #earlier version
Signed-off-by: John Garry <john.garry@huawei.com>
---
 block/blk-mq-debugfs.c | 62 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 62 insertions(+)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 05b4be0c03d9..4da7e54adf3b 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -495,6 +495,67 @@ static int hctx_tags_show(void *data, struct seq_file *m)
 	return res;
 }
 
+struct hctx_sb_data {
+	struct sbitmap		*sb;	/* output bitmap */
+	struct blk_mq_hw_ctx	*hctx;	/* input hctx */
+};
+
+static bool hctx_filter_fn(struct blk_mq_hw_ctx *hctx, struct request *req,
+			   void *priv, bool reserved)
+{
+	struct hctx_sb_data *hctx_sb_data = priv;
+
+	if (hctx == hctx_sb_data->hctx)
+		sbitmap_set_bit(hctx_sb_data->sb, req->tag);
+
+	return true;
+}
+
+static void hctx_filter_sb(struct sbitmap *sb, struct blk_mq_hw_ctx *hctx)
+{
+	struct hctx_sb_data hctx_sb_data = { .sb = sb, .hctx = hctx };
+
+	blk_mq_queue_tag_busy_iter(hctx->queue, hctx_filter_fn, &hctx_sb_data);
+}
+
+static int hctx_tags_shared_sbitmap_bitmap_show(void *data, struct seq_file *m)
+{
+	struct blk_mq_hw_ctx *hctx = data;
+	struct request_queue *q = hctx->queue;
+	struct blk_mq_tag_set *set = q->tag_set;
+	struct sbitmap shared_sb, *sb;
+	int res;
+
+	if (!set)
+		return 0;
+
+	/*
+	 * We could use the allocated sbitmap for that hctx here, but
+	 * that would mean that we would need to clean it prior to use.
+	 */
+	res = sbitmap_init_node(&shared_sb,
+				set->__bitmap_tags.sb.depth,
+				set->__bitmap_tags.sb.shift,
+				GFP_KERNEL, NUMA_NO_NODE);
+	if (res)
+		return res;
+	sb = &shared_sb;
+
+	res = mutex_lock_interruptible(&q->sysfs_lock);
+	if (res)
+		goto out;
+	if (hctx->tags) {
+		hctx_filter_sb(sb, hctx);
+		sbitmap_bitmap_show(sb, m);
+	}
+
+	mutex_unlock(&q->sysfs_lock);
+
+out:
+	sbitmap_free(&shared_sb);
+	return res;
+}
+
 static int hctx_tags_bitmap_show(void *data, struct seq_file *m)
 {
 	struct blk_mq_hw_ctx *hctx = data;
@@ -823,6 +884,7 @@ static const struct blk_mq_debugfs_attr blk_mq_debugfs_hctx_shared_sbitmap_attrs
 	{"busy", 0400, hctx_busy_show},
 	{"ctx_map", 0400, hctx_ctx_map_show},
 	{"tags", 0400, hctx_tags_show},
+	{"tags_bitmap", 0400, hctx_tags_shared_sbitmap_bitmap_show},
 	{"sched_tags", 0400, hctx_sched_tags_show},
 	{"sched_tags_bitmap", 0400, hctx_sched_tags_bitmap_show},
 	{"io_poll", 0600, hctx_io_poll_show, hctx_io_poll_write},
-- 
2.26.2


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH RFC v7 08/12] scsi: Add template flag 'host_tagset'
  2020-06-10 17:29 [PATCH RFC v7 00/12] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs John Garry
                   ` (6 preceding siblings ...)
  2020-06-10 17:29 ` [PATCH RFC v7 07/12] blk-mq: Add support in hctx_tags_bitmap_show() for a " John Garry
@ 2020-06-10 17:29 ` John Garry
  2020-06-10 17:29 ` [PATCH RFC v7 09/12] scsi: hisi_sas: Switch v3 hw to MQ John Garry
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 45+ messages in thread
From: John Garry @ 2020-06-10 17:29 UTC (permalink / raw)
  To: axboe, jejb, martin.petersen, don.brace, kashyap.desai,
	sumit.saxena, ming.lei, bvanassche, hare, hch,
	shivasharan.srikanteshwara
  Cc: linux-block, linux-scsi, esc.storagedev, chenxiang66,
	megaraidlinux.pdl, John Garry

From: Hannes Reinecke <hare@suse.com>

Add a host template flag 'host_tagset' so hostwide tagset can be
shared on multiple reply queues after the SCSI device's reply queue
is converted to blk-mq hw queue.

Signed-off-by: Hannes Reinecke <hare@suse.com>
jpg: Update comment on can_queue
Signed-off-by: John Garry <john.garry@huawei.com>
---
 drivers/scsi/scsi_lib.c  | 2 ++
 include/scsi/scsi_host.h | 6 +++++-
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 0ba7a65e7c8d..0652acdcec22 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1894,6 +1894,8 @@ int scsi_mq_setup_tags(struct Scsi_Host *shost)
 	tag_set->flags |=
 		BLK_ALLOC_POLICY_TO_MQ_FLAG(shost->hostt->tag_alloc_policy);
 	tag_set->driver_data = shost;
+	if (shost->hostt->host_tagset)
+		tag_set->flags |= BLK_MQ_F_TAG_HCTX_SHARED;
 
 	return blk_mq_alloc_tag_set(tag_set);
 }
diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h
index 46ef8cccc982..9b7e333a681d 100644
--- a/include/scsi/scsi_host.h
+++ b/include/scsi/scsi_host.h
@@ -436,6 +436,9 @@ struct scsi_host_template {
 	/* True if the controller does not support WRITE SAME */
 	unsigned no_write_same:1;
 
+	/* True if the host uses host-wide tagspace */
+	unsigned host_tagset:1;
+
 	/*
 	 * Countdown for host blocking with no commands outstanding.
 	 */
@@ -603,7 +606,8 @@ struct Scsi_Host {
 	 *
 	 * Note: it is assumed that each hardware queue has a queue depth of
 	 * can_queue. In other words, the total queue depth per host
-	 * is nr_hw_queues * can_queue.
+	 * is nr_hw_queues * can_queue. However, for when host_tagset is set,
+	 * the total queue depth is can_queue.
 	 */
 	unsigned nr_hw_queues;
 	unsigned active_mode:2;
-- 
2.26.2


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH RFC v7 09/12] scsi: hisi_sas: Switch v3 hw to MQ
  2020-06-10 17:29 [PATCH RFC v7 00/12] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs John Garry
                   ` (7 preceding siblings ...)
  2020-06-10 17:29 ` [PATCH RFC v7 08/12] scsi: Add template flag 'host_tagset' John Garry
@ 2020-06-10 17:29 ` John Garry
  2020-06-10 17:29 ` [PATCH RFC v7 10/12] megaraid_sas: switch fusion adapters " John Garry
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 45+ messages in thread
From: John Garry @ 2020-06-10 17:29 UTC (permalink / raw)
  To: axboe, jejb, martin.petersen, don.brace, kashyap.desai,
	sumit.saxena, ming.lei, bvanassche, hare, hch,
	shivasharan.srikanteshwara
  Cc: linux-block, linux-scsi, esc.storagedev, chenxiang66,
	megaraidlinux.pdl, John Garry

Now that the block layer provides a shared tag, we can switch the driver
to expose all HW queues.

Signed-off-by: John Garry <john.garry@huawei.com>
---
 drivers/scsi/hisi_sas/hisi_sas.h       |  3 +-
 drivers/scsi/hisi_sas/hisi_sas_main.c  | 36 ++++++-----
 drivers/scsi/hisi_sas/hisi_sas_v3_hw.c | 87 +++++++++++---------------
 3 files changed, 56 insertions(+), 70 deletions(-)

diff --git a/drivers/scsi/hisi_sas/hisi_sas.h b/drivers/scsi/hisi_sas/hisi_sas.h
index 2bdd64648ef0..e6acbf940712 100644
--- a/drivers/scsi/hisi_sas/hisi_sas.h
+++ b/drivers/scsi/hisi_sas/hisi_sas.h
@@ -8,6 +8,8 @@
 #define _HISI_SAS_H_
 
 #include <linux/acpi.h>
+#include <linux/blk-mq.h>
+#include <linux/blk-mq-pci.h>
 #include <linux/clk.h>
 #include <linux/debugfs.h>
 #include <linux/dmapool.h>
@@ -431,7 +433,6 @@ struct hisi_hba {
 	u32 intr_coal_count;	/* Interrupt count to coalesce */
 
 	int cq_nvecs;
-	unsigned int *reply_map;
 
 	/* bist */
 	enum sas_linkrate debugfs_bist_linkrate;
diff --git a/drivers/scsi/hisi_sas/hisi_sas_main.c b/drivers/scsi/hisi_sas/hisi_sas_main.c
index 11caa4b0d797..7ed4eaedb7ca 100644
--- a/drivers/scsi/hisi_sas/hisi_sas_main.c
+++ b/drivers/scsi/hisi_sas/hisi_sas_main.c
@@ -417,6 +417,7 @@ static int hisi_sas_task_prep(struct sas_task *task,
 	struct device *dev = hisi_hba->dev;
 	int dlvry_queue_slot, dlvry_queue, rc, slot_idx;
 	int n_elem = 0, n_elem_dif = 0, n_elem_req = 0;
+	struct scsi_cmnd *scmd = NULL;
 	struct hisi_sas_dq *dq;
 	unsigned long flags;
 	int wr_q_index;
@@ -432,10 +433,23 @@ static int hisi_sas_task_prep(struct sas_task *task,
 		return -ECOMM;
 	}
 
-	if (hisi_hba->reply_map) {
-		int cpu = raw_smp_processor_id();
-		unsigned int dq_index = hisi_hba->reply_map[cpu];
+	if (task->uldd_task) {
+		struct ata_queued_cmd *qc;
 
+		if (dev_is_sata(device)) {
+			qc = task->uldd_task;
+			scmd = qc->scsicmd;
+		} else {
+			scmd = task->uldd_task;
+		}
+	}
+
+	if (scmd) {
+		unsigned int dq_index;
+		u32 blk_tag;
+
+		blk_tag = blk_mq_unique_tag(scmd->request);
+		dq_index = blk_mq_unique_tag_to_hwq(blk_tag);
 		*dq_pointer = dq = &hisi_hba->dq[dq_index];
 	} else {
 		*dq_pointer = dq = sas_dev->dq;
@@ -464,21 +478,9 @@ static int hisi_sas_task_prep(struct sas_task *task,
 
 	if (hisi_hba->hw->slot_index_alloc)
 		rc = hisi_hba->hw->slot_index_alloc(hisi_hba, device);
-	else {
-		struct scsi_cmnd *scsi_cmnd = NULL;
-
-		if (task->uldd_task) {
-			struct ata_queued_cmd *qc;
+	else
+		rc = hisi_sas_slot_index_alloc(hisi_hba, scmd);
 
-			if (dev_is_sata(device)) {
-				qc = task->uldd_task;
-				scsi_cmnd = qc->scsicmd;
-			} else {
-				scsi_cmnd = task->uldd_task;
-			}
-		}
-		rc  = hisi_sas_slot_index_alloc(hisi_hba, scsi_cmnd);
-	}
 	if (rc < 0)
 		goto err_out_dif_dma_unmap;
 
diff --git a/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c b/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
index 3e6b78a1f993..e22231403bbb 100644
--- a/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
+++ b/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
@@ -2360,68 +2360,36 @@ static irqreturn_t cq_interrupt_v3_hw(int irq_no, void *p)
 	return IRQ_WAKE_THREAD;
 }
 
-static void setup_reply_map_v3_hw(struct hisi_hba *hisi_hba, int nvecs)
+static int interrupt_preinit_v3_hw(struct hisi_hba *hisi_hba)
 {
-	const struct cpumask *mask;
-	int queue, cpu;
+	int vectors;
+	int max_msi = HISI_SAS_MSI_COUNT_V3_HW, min_msi;
+	struct Scsi_Host *shost = hisi_hba->shost;
+	struct irq_affinity desc = {
+		.pre_vectors = BASE_VECTORS_V3_HW,
+	};
 
-	for (queue = 0; queue < nvecs; queue++) {
-		struct hisi_sas_cq *cq = &hisi_hba->cq[queue];
+	min_msi = MIN_AFFINE_VECTORS_V3_HW;
+	vectors = pci_alloc_irq_vectors_affinity(hisi_hba->pci_dev,
+						 min_msi, max_msi,
+						 PCI_IRQ_MSI |
+						 PCI_IRQ_AFFINITY,
+						 &desc);
+	if (vectors < 0)
+		return -ENOENT;
 
-		mask = pci_irq_get_affinity(hisi_hba->pci_dev, queue +
-					    BASE_VECTORS_V3_HW);
-		if (!mask)
-			goto fallback;
-		cq->irq_mask = mask;
-		for_each_cpu(cpu, mask)
-			hisi_hba->reply_map[cpu] = queue;
-	}
-	return;
 
-fallback:
-	for_each_possible_cpu(cpu)
-		hisi_hba->reply_map[cpu] = cpu % hisi_hba->queue_count;
-	/* Don't clean all CQ masks */
+	hisi_hba->cq_nvecs = vectors - BASE_VECTORS_V3_HW;
+	shost->nr_hw_queues = hisi_hba->cq_nvecs;
+
+	return 0;
 }
 
 static int interrupt_init_v3_hw(struct hisi_hba *hisi_hba)
 {
 	struct device *dev = hisi_hba->dev;
 	struct pci_dev *pdev = hisi_hba->pci_dev;
-	int vectors, rc, i;
-	int max_msi = HISI_SAS_MSI_COUNT_V3_HW, min_msi;
-
-	if (auto_affine_msi_experimental) {
-		struct irq_affinity desc = {
-			.pre_vectors = BASE_VECTORS_V3_HW,
-		};
-
-		dev_info(dev, "Enable MSI auto-affinity\n");
-
-		min_msi = MIN_AFFINE_VECTORS_V3_HW;
-
-		hisi_hba->reply_map = devm_kcalloc(dev, nr_cpu_ids,
-						   sizeof(unsigned int),
-						   GFP_KERNEL);
-		if (!hisi_hba->reply_map)
-			return -ENOMEM;
-		vectors = pci_alloc_irq_vectors_affinity(hisi_hba->pci_dev,
-							 min_msi, max_msi,
-							 PCI_IRQ_MSI |
-							 PCI_IRQ_AFFINITY,
-							 &desc);
-		if (vectors < 0)
-			return -ENOENT;
-		setup_reply_map_v3_hw(hisi_hba, vectors - BASE_VECTORS_V3_HW);
-	} else {
-		min_msi = max_msi;
-		vectors = pci_alloc_irq_vectors(hisi_hba->pci_dev, min_msi,
-						max_msi, PCI_IRQ_MSI);
-		if (vectors < 0)
-			return vectors;
-	}
-
-	hisi_hba->cq_nvecs = vectors - BASE_VECTORS_V3_HW;
+	int rc, i;
 
 	rc = devm_request_irq(dev, pci_irq_vector(pdev, 1),
 			      int_phy_up_down_bcast_v3_hw, 0,
@@ -3070,6 +3038,15 @@ static int debugfs_set_bist_v3_hw(struct hisi_hba *hisi_hba, bool enable)
 	return 0;
 }
 
+static int hisi_sas_map_queues(struct Scsi_Host *shost)
+{
+	struct hisi_hba *hisi_hba = shost_priv(shost);
+	struct blk_mq_queue_map *qmap = &shost->tag_set.map[HCTX_TYPE_DEFAULT];
+
+	return blk_mq_pci_map_queues(qmap, hisi_hba->pci_dev,
+				     BASE_VECTORS_V3_HW);
+}
+
 static struct scsi_host_template sht_v3_hw = {
 	.name			= DRV_NAME,
 	.proc_name		= DRV_NAME,
@@ -3079,6 +3056,7 @@ static struct scsi_host_template sht_v3_hw = {
 	.slave_configure	= hisi_sas_slave_configure,
 	.scan_finished		= hisi_sas_scan_finished,
 	.scan_start		= hisi_sas_scan_start,
+	.map_queues		= hisi_sas_map_queues,
 	.change_queue_depth	= sas_change_queue_depth,
 	.bios_param		= sas_bios_param,
 	.this_id		= -1,
@@ -3095,6 +3073,7 @@ static struct scsi_host_template sht_v3_hw = {
 	.shost_attrs		= host_attrs_v3_hw,
 	.tag_alloc_policy	= BLK_TAG_ALLOC_RR,
 	.host_reset             = hisi_sas_host_reset,
+	.host_tagset		= 1,
 };
 
 static const struct hisi_sas_hw hisi_sas_v3_hw = {
@@ -3266,6 +3245,10 @@ hisi_sas_v3_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	if (hisi_sas_debugfs_enable)
 		hisi_sas_debugfs_init(hisi_hba);
 
+	rc = interrupt_preinit_v3_hw(hisi_hba);
+	if (rc)
+		goto err_out_ha;
+	dev_err(dev, "%d hw qeues\n", shost->nr_hw_queues);
 	rc = scsi_add_host(shost, dev);
 	if (rc)
 		goto err_out_ha;
-- 
2.26.2


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH RFC v7 10/12] megaraid_sas: switch fusion adapters to MQ
  2020-06-10 17:29 [PATCH RFC v7 00/12] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs John Garry
                   ` (8 preceding siblings ...)
  2020-06-10 17:29 ` [PATCH RFC v7 09/12] scsi: hisi_sas: Switch v3 hw to MQ John Garry
@ 2020-06-10 17:29 ` John Garry
  2020-07-02 10:23   ` Kashyap Desai
  2020-06-10 17:29 ` [PATCH RFC v7 11/12] smartpqi: enable host tagset John Garry
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 45+ messages in thread
From: John Garry @ 2020-06-10 17:29 UTC (permalink / raw)
  To: axboe, jejb, martin.petersen, don.brace, kashyap.desai,
	sumit.saxena, ming.lei, bvanassche, hare, hch,
	shivasharan.srikanteshwara
  Cc: linux-block, linux-scsi, esc.storagedev, chenxiang66,
	megaraidlinux.pdl, John Garry

From: Hannes Reinecke <hare@suse.com>

Fusion adapters can steer completions to individual queues, and
we now have support for shared host-wide tags.
So we can enable multiqueue support for fusion adapters and
drop the hand-crafted interrupt affinity settings.

Signed-off-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: John Garry <john.garry@huawei.com>
---
 drivers/scsi/megaraid/megaraid_sas.h        |  1 -
 drivers/scsi/megaraid/megaraid_sas_base.c   | 59 +++++++--------------
 drivers/scsi/megaraid/megaraid_sas_fusion.c | 24 +++++----
 3 files changed, 32 insertions(+), 52 deletions(-)

diff --git a/drivers/scsi/megaraid/megaraid_sas.h b/drivers/scsi/megaraid/megaraid_sas.h
index af2c7a2a9565..b27a34a5f5de 100644
--- a/drivers/scsi/megaraid/megaraid_sas.h
+++ b/drivers/scsi/megaraid/megaraid_sas.h
@@ -2261,7 +2261,6 @@ enum MR_PERF_MODE {
 
 struct megasas_instance {
 
-	unsigned int *reply_map;
 	__le32 *producer;
 	dma_addr_t producer_h;
 	__le32 *consumer;
diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c b/drivers/scsi/megaraid/megaraid_sas_base.c
index 00668335c2af..e6bb2a64d51c 100644
--- a/drivers/scsi/megaraid/megaraid_sas_base.c
+++ b/drivers/scsi/megaraid/megaraid_sas_base.c
@@ -37,6 +37,7 @@
 #include <linux/poll.h>
 #include <linux/vmalloc.h>
 #include <linux/irq_poll.h>
+#include <linux/blk-mq-pci.h>
 
 #include <scsi/scsi.h>
 #include <scsi/scsi_cmnd.h>
@@ -3115,6 +3116,19 @@ megasas_bios_param(struct scsi_device *sdev, struct block_device *bdev,
 	return 0;
 }
 
+static int megasas_map_queues(struct Scsi_Host *shost)
+{
+	struct megasas_instance *instance;
+
+	instance = (struct megasas_instance *)shost->hostdata;
+
+	if (!instance->smp_affinity_enable)
+		return 0;
+
+	return blk_mq_pci_map_queues(&shost->tag_set.map[HCTX_TYPE_DEFAULT],
+			instance->pdev, instance->low_latency_index_start);
+}
+
 static void megasas_aen_polling(struct work_struct *work);
 
 /**
@@ -3423,8 +3437,10 @@ static struct scsi_host_template megasas_template = {
 	.eh_timed_out = megasas_reset_timer,
 	.shost_attrs = megaraid_host_attrs,
 	.bios_param = megasas_bios_param,
+	.map_queues = megasas_map_queues,
 	.change_queue_depth = scsi_change_queue_depth,
 	.max_segment_size = 0xffffffff,
+	.host_tagset = 1,
 };
 
 /**
@@ -5708,34 +5724,6 @@ megasas_setup_jbod_map(struct megasas_instance *instance)
 		instance->use_seqnum_jbod_fp = false;
 }
 
-static void megasas_setup_reply_map(struct megasas_instance *instance)
-{
-	const struct cpumask *mask;
-	unsigned int queue, cpu, low_latency_index_start;
-
-	low_latency_index_start = instance->low_latency_index_start;
-
-	for (queue = low_latency_index_start; queue < instance->msix_vectors; queue++) {
-		mask = pci_irq_get_affinity(instance->pdev, queue);
-		if (!mask)
-			goto fallback;
-
-		for_each_cpu(cpu, mask)
-			instance->reply_map[cpu] = queue;
-	}
-	return;
-
-fallback:
-	queue = low_latency_index_start;
-	for_each_possible_cpu(cpu) {
-		instance->reply_map[cpu] = queue;
-		if (queue == (instance->msix_vectors - 1))
-			queue = low_latency_index_start;
-		else
-			queue++;
-	}
-}
-
 /**
  * megasas_get_device_list -	Get the PD and LD device list from FW.
  * @instance:			Adapter soft state
@@ -6158,8 +6146,6 @@ static int megasas_init_fw(struct megasas_instance *instance)
 			goto fail_init_adapter;
 	}
 
-	megasas_setup_reply_map(instance);
-
 	dev_info(&instance->pdev->dev,
 		"current msix/online cpus\t: (%d/%d)\n",
 		instance->msix_vectors, (unsigned int)num_online_cpus());
@@ -6793,6 +6779,9 @@ static int megasas_io_attach(struct megasas_instance *instance)
 	host->max_id = MEGASAS_MAX_DEV_PER_CHANNEL;
 	host->max_lun = MEGASAS_MAX_LUN;
 	host->max_cmd_len = 16;
+	if (instance->adapter_type != MFI_SERIES && instance->msix_vectors > 0)
+		host->nr_hw_queues = instance->msix_vectors -
+			instance->low_latency_index_start;
 
 	/*
 	 * Notify the mid-layer about the new controller
@@ -6960,11 +6949,6 @@ static inline int megasas_alloc_mfi_ctrl_mem(struct megasas_instance *instance)
  */
 static int megasas_alloc_ctrl_mem(struct megasas_instance *instance)
 {
-	instance->reply_map = kcalloc(nr_cpu_ids, sizeof(unsigned int),
-				      GFP_KERNEL);
-	if (!instance->reply_map)
-		return -ENOMEM;
-
 	switch (instance->adapter_type) {
 	case MFI_SERIES:
 		if (megasas_alloc_mfi_ctrl_mem(instance))
@@ -6981,8 +6965,6 @@ static int megasas_alloc_ctrl_mem(struct megasas_instance *instance)
 
 	return 0;
  fail:
-	kfree(instance->reply_map);
-	instance->reply_map = NULL;
 	return -ENOMEM;
 }
 
@@ -6995,7 +6977,6 @@ static int megasas_alloc_ctrl_mem(struct megasas_instance *instance)
  */
 static inline void megasas_free_ctrl_mem(struct megasas_instance *instance)
 {
-	kfree(instance->reply_map);
 	if (instance->adapter_type == MFI_SERIES) {
 		if (instance->producer)
 			dma_free_coherent(&instance->pdev->dev, sizeof(u32),
@@ -7683,8 +7664,6 @@ megasas_resume(struct pci_dev *pdev)
 			goto fail_reenable_msix;
 	}
 
-	megasas_setup_reply_map(instance);
-
 	if (instance->adapter_type != MFI_SERIES) {
 		megasas_reset_reply_desc(instance);
 		if (megasas_ioc_init_fusion(instance)) {
diff --git a/drivers/scsi/megaraid/megaraid_sas_fusion.c b/drivers/scsi/megaraid/megaraid_sas_fusion.c
index 319f241da4b6..8e25b700988e 100644
--- a/drivers/scsi/megaraid/megaraid_sas_fusion.c
+++ b/drivers/scsi/megaraid/megaraid_sas_fusion.c
@@ -373,24 +373,24 @@ megasas_get_msix_index(struct megasas_instance *instance,
 {
 	int sdev_busy;
 
-	/* nr_hw_queue = 1 for MegaRAID */
-	struct blk_mq_hw_ctx *hctx =
-		scmd->device->request_queue->queue_hw_ctx[0];
+	struct blk_mq_hw_ctx *hctx = scmd->request->mq_hctx;
 
 	sdev_busy = atomic_read(&hctx->nr_active);
 
 	if (instance->perf_mode == MR_BALANCED_PERF_MODE &&
-	    sdev_busy > (data_arms * MR_DEVICE_HIGH_IOPS_DEPTH))
+	    sdev_busy > (data_arms * MR_DEVICE_HIGH_IOPS_DEPTH)) {
 		cmd->request_desc->SCSIIO.MSIxIndex =
 			mega_mod64((atomic64_add_return(1, &instance->high_iops_outstanding) /
 					MR_HIGH_IOPS_BATCH_COUNT), instance->low_latency_index_start);
-	else if (instance->msix_load_balance)
+	} else if (instance->msix_load_balance) {
 		cmd->request_desc->SCSIIO.MSIxIndex =
 			(mega_mod64(atomic64_add_return(1, &instance->total_io_count),
 				instance->msix_vectors));
-	else
-		cmd->request_desc->SCSIIO.MSIxIndex =
-			instance->reply_map[raw_smp_processor_id()];
+	} else {
+		u32 tag = blk_mq_unique_tag(scmd->request);
+
+		cmd->request_desc->SCSIIO.MSIxIndex = blk_mq_unique_tag_to_hwq(tag) + instance->low_latency_index_start;
+	}
 }
 
 /**
@@ -3326,7 +3326,7 @@ megasas_build_and_issue_cmd_fusion(struct megasas_instance *instance,
 {
 	struct megasas_cmd_fusion *cmd, *r1_cmd = NULL;
 	union MEGASAS_REQUEST_DESCRIPTOR_UNION *req_desc;
-	u32 index;
+	u32 index, blk_tag, unique_tag;
 
 	if ((megasas_cmd_type(scmd) == READ_WRITE_LDIO) &&
 		instance->ldio_threshold &&
@@ -3342,7 +3342,9 @@ megasas_build_and_issue_cmd_fusion(struct megasas_instance *instance,
 		return SCSI_MLQUEUE_HOST_BUSY;
 	}
 
-	cmd = megasas_get_cmd_fusion(instance, scmd->request->tag);
+	unique_tag = blk_mq_unique_tag(scmd->request);
+	blk_tag = blk_mq_unique_tag_to_tag(unique_tag);
+	cmd = megasas_get_cmd_fusion(instance, blk_tag);
 
 	if (!cmd) {
 		atomic_dec(&instance->fw_outstanding);
@@ -3383,7 +3385,7 @@ megasas_build_and_issue_cmd_fusion(struct megasas_instance *instance,
 	 */
 	if (cmd->r1_alt_dev_handle != MR_DEVHANDLE_INVALID) {
 		r1_cmd = megasas_get_cmd_fusion(instance,
-				(scmd->request->tag + instance->max_fw_cmds));
+				(blk_tag + instance->max_fw_cmds));
 		megasas_prepare_secondRaid1_IO(instance, cmd, r1_cmd);
 	}
 
-- 
2.26.2


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH RFC v7 11/12] smartpqi: enable host tagset
  2020-06-10 17:29 [PATCH RFC v7 00/12] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs John Garry
                   ` (9 preceding siblings ...)
  2020-06-10 17:29 ` [PATCH RFC v7 10/12] megaraid_sas: switch fusion adapters " John Garry
@ 2020-06-10 17:29 ` John Garry
  2020-06-10 17:29 ` [PATCH RFC v7 12/12] hpsa: enable host_tagset and switch to MQ John Garry
  2020-06-11  3:07 ` [PATCH RFC v7 00/12] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs Ming Lei
  12 siblings, 0 replies; 45+ messages in thread
From: John Garry @ 2020-06-10 17:29 UTC (permalink / raw)
  To: axboe, jejb, martin.petersen, don.brace, kashyap.desai,
	sumit.saxena, ming.lei, bvanassche, hare, hch,
	shivasharan.srikanteshwara
  Cc: linux-block, linux-scsi, esc.storagedev, chenxiang66,
	megaraidlinux.pdl, Hannes Reinecke

From: Hannes Reinecke <hare@suse.de>

Enable host tagset for smartpqi; with this we can use the request
tag to look command from the pool avoiding the list iteration in
the hot path.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 drivers/scsi/smartpqi/smartpqi_init.c | 38 ++++++++++++++++++++-------
 1 file changed, 28 insertions(+), 10 deletions(-)

diff --git a/drivers/scsi/smartpqi/smartpqi_init.c b/drivers/scsi/smartpqi/smartpqi_init.c
index cd157f11eb22..1f4de4c2d876 100644
--- a/drivers/scsi/smartpqi/smartpqi_init.c
+++ b/drivers/scsi/smartpqi/smartpqi_init.c
@@ -575,17 +575,29 @@ static inline void pqi_reinit_io_request(struct pqi_io_request *io_request)
 }
 
 static struct pqi_io_request *pqi_alloc_io_request(
-	struct pqi_ctrl_info *ctrl_info)
+	struct pqi_ctrl_info *ctrl_info, struct scsi_cmnd *scmd)
 {
 	struct pqi_io_request *io_request;
+	unsigned int limit = PQI_RESERVED_IO_SLOTS;
 	u16 i = ctrl_info->next_io_request_slot;	/* benignly racy */
 
-	while (1) {
+	if (scmd) {
+		u32 blk_tag = blk_mq_unique_tag(scmd->request);
+
+		i = blk_mq_unique_tag_to_tag(blk_tag) + limit;
 		io_request = &ctrl_info->io_request_pool[i];
-		if (atomic_inc_return(&io_request->refcount) == 1)
-			break;
-		atomic_dec(&io_request->refcount);
-		i = (i + 1) % ctrl_info->max_io_slots;
+		if (WARN_ON(atomic_inc_return(&io_request->refcount) > 1)) {
+			atomic_dec(&io_request->refcount);
+			return NULL;
+		}
+	} else {
+		while (1) {
+			io_request = &ctrl_info->io_request_pool[i];
+			if (atomic_inc_return(&io_request->refcount) == 1)
+				break;
+			atomic_dec(&io_request->refcount);
+			i = (i + 1) % limit;
+		}
 	}
 
 	/* benignly racy */
@@ -4075,7 +4087,7 @@ static int pqi_submit_raid_request_synchronous(struct pqi_ctrl_info *ctrl_info,
 
 	atomic_inc(&ctrl_info->sync_cmds_outstanding);
 
-	io_request = pqi_alloc_io_request(ctrl_info);
+	io_request = pqi_alloc_io_request(ctrl_info, NULL);
 
 	put_unaligned_le16(io_request->index,
 		&(((struct pqi_raid_path_request *)request)->request_id));
@@ -5032,7 +5044,9 @@ static inline int pqi_raid_submit_scsi_cmd(struct pqi_ctrl_info *ctrl_info,
 {
 	struct pqi_io_request *io_request;
 
-	io_request = pqi_alloc_io_request(ctrl_info);
+	io_request = pqi_alloc_io_request(ctrl_info, scmd);
+	if (!io_request)
+		return SCSI_MLQUEUE_HOST_BUSY;
 
 	return pqi_raid_submit_scsi_cmd_with_io_request(ctrl_info, io_request,
 		device, scmd, queue_group);
@@ -5230,7 +5244,10 @@ static int pqi_aio_submit_io(struct pqi_ctrl_info *ctrl_info,
 	struct pqi_io_request *io_request;
 	struct pqi_aio_path_request *request;
 
-	io_request = pqi_alloc_io_request(ctrl_info);
+	io_request = pqi_alloc_io_request(ctrl_info, scmd);
+	if (!io_request)
+		return SCSI_MLQUEUE_HOST_BUSY;
+
 	io_request->io_complete_callback = pqi_aio_io_complete;
 	io_request->scmd = scmd;
 	io_request->raid_bypass = raid_bypass;
@@ -5657,7 +5674,7 @@ static int pqi_lun_reset(struct pqi_ctrl_info *ctrl_info,
 	DECLARE_COMPLETION_ONSTACK(wait);
 	struct pqi_task_management_request *request;
 
-	io_request = pqi_alloc_io_request(ctrl_info);
+	io_request = pqi_alloc_io_request(ctrl_info, NULL);
 	io_request->io_complete_callback = pqi_lun_reset_complete;
 	io_request->context = &wait;
 
@@ -6504,6 +6521,7 @@ static struct scsi_host_template pqi_driver_template = {
 	.map_queues = pqi_map_queues,
 	.sdev_attrs = pqi_sdev_attrs,
 	.shost_attrs = pqi_shost_attrs,
+	.host_tagset = 1,
 };
 
 static int pqi_register_scsi(struct pqi_ctrl_info *ctrl_info)
-- 
2.26.2


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH RFC v7 12/12] hpsa: enable host_tagset and switch to MQ
  2020-06-10 17:29 [PATCH RFC v7 00/12] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs John Garry
                   ` (10 preceding siblings ...)
  2020-06-10 17:29 ` [PATCH RFC v7 11/12] smartpqi: enable host tagset John Garry
@ 2020-06-10 17:29 ` John Garry
  2020-06-11  3:07 ` [PATCH RFC v7 00/12] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs Ming Lei
  12 siblings, 0 replies; 45+ messages in thread
From: John Garry @ 2020-06-10 17:29 UTC (permalink / raw)
  To: axboe, jejb, martin.petersen, don.brace, kashyap.desai,
	sumit.saxena, ming.lei, bvanassche, hare, hch,
	shivasharan.srikanteshwara
  Cc: linux-block, linux-scsi, esc.storagedev, chenxiang66,
	megaraidlinux.pdl, Hannes Reinecke

From: Hannes Reinecke <hare@suse.de>

The smart array HBAs can steer interrupt completion, so this
patch switches the implementation to use multiqueue and enables
'host_tagset' as the HBA has a shared host-wide tagset.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 drivers/scsi/hpsa.c | 44 +++++++-------------------------------------
 drivers/scsi/hpsa.h |  1 -
 2 files changed, 7 insertions(+), 38 deletions(-)

diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c
index 1e9302e99d05..f807f9bdae85 100644
--- a/drivers/scsi/hpsa.c
+++ b/drivers/scsi/hpsa.c
@@ -980,6 +980,7 @@ static struct scsi_host_template hpsa_driver_template = {
 	.shost_attrs = hpsa_shost_attrs,
 	.max_sectors = 2048,
 	.no_write_same = 1,
+	.host_tagset = 1,
 };
 
 static inline u32 next_command(struct ctlr_info *h, u8 q)
@@ -1144,12 +1145,14 @@ static void dial_up_lockup_detection_on_fw_flash_complete(struct ctlr_info *h,
 static void __enqueue_cmd_and_start_io(struct ctlr_info *h,
 	struct CommandList *c, int reply_queue)
 {
+	u32 blk_tag = blk_mq_unique_tag(c->scsi_cmd->request);
+
 	dial_down_lockup_detection_during_fw_flash(h, c);
 	atomic_inc(&h->commands_outstanding);
 	if (c->device)
 		atomic_inc(&c->device->commands_outstanding);
 
-	reply_queue = h->reply_map[raw_smp_processor_id()];
+	reply_queue = blk_mq_unique_tag_to_hwq(blk_tag);
 	switch (c->cmd_type) {
 	case CMD_IOACCEL1:
 		set_ioaccel1_performant_mode(h, c, reply_queue);
@@ -5653,8 +5656,6 @@ static int hpsa_scsi_queue_command(struct Scsi_Host *sh, struct scsi_cmnd *cmd)
 	/* Get the ptr to our adapter structure out of cmd->host. */
 	h = sdev_to_hba(cmd->device);
 
-	BUG_ON(cmd->request->tag < 0);
-
 	dev = cmd->device->hostdata;
 	if (!dev) {
 		cmd->result = DID_NO_CONNECT << 16;
@@ -5830,7 +5831,7 @@ static int hpsa_scsi_host_alloc(struct ctlr_info *h)
 	sh->hostdata[0] = (unsigned long) h;
 	sh->irq = pci_irq_vector(h->pdev, 0);
 	sh->unique_id = sh->irq;
-
+	sh->nr_hw_queues = h->msix_vectors > 0 ? h->msix_vectors : 1;
 	h->scsi_host = sh;
 	return 0;
 }
@@ -5856,7 +5857,8 @@ static int hpsa_scsi_add_host(struct ctlr_info *h)
  */
 static int hpsa_get_cmd_index(struct scsi_cmnd *scmd)
 {
-	int idx = scmd->request->tag;
+	u32 blk_tag = blk_mq_unique_tag(scmd->request);
+	int idx = blk_mq_unique_tag_to_tag(blk_tag);
 
 	if (idx < 0)
 		return idx;
@@ -7456,26 +7458,6 @@ static void hpsa_disable_interrupt_mode(struct ctlr_info *h)
 	h->msix_vectors = 0;
 }
 
-static void hpsa_setup_reply_map(struct ctlr_info *h)
-{
-	const struct cpumask *mask;
-	unsigned int queue, cpu;
-
-	for (queue = 0; queue < h->msix_vectors; queue++) {
-		mask = pci_irq_get_affinity(h->pdev, queue);
-		if (!mask)
-			goto fallback;
-
-		for_each_cpu(cpu, mask)
-			h->reply_map[cpu] = queue;
-	}
-	return;
-
-fallback:
-	for_each_possible_cpu(cpu)
-		h->reply_map[cpu] = 0;
-}
-
 /* If MSI/MSI-X is supported by the kernel we will try to enable it on
  * controllers that are capable. If not, we use legacy INTx mode.
  */
@@ -7872,9 +7854,6 @@ static int hpsa_pci_init(struct ctlr_info *h)
 	if (err)
 		goto clean1;
 
-	/* setup mapping between CPU and reply queue */
-	hpsa_setup_reply_map(h);
-
 	err = hpsa_pci_find_memory_BAR(h->pdev, &h->paddr);
 	if (err)
 		goto clean2;	/* intmode+region, pci */
@@ -8613,7 +8592,6 @@ static struct workqueue_struct *hpsa_create_controller_wq(struct ctlr_info *h,
 
 static void hpda_free_ctlr_info(struct ctlr_info *h)
 {
-	kfree(h->reply_map);
 	kfree(h);
 }
 
@@ -8622,14 +8600,6 @@ static struct ctlr_info *hpda_alloc_ctlr_info(void)
 	struct ctlr_info *h;
 
 	h = kzalloc(sizeof(*h), GFP_KERNEL);
-	if (!h)
-		return NULL;
-
-	h->reply_map = kcalloc(nr_cpu_ids, sizeof(*h->reply_map), GFP_KERNEL);
-	if (!h->reply_map) {
-		kfree(h);
-		return NULL;
-	}
 	return h;
 }
 
diff --git a/drivers/scsi/hpsa.h b/drivers/scsi/hpsa.h
index f8c88fc7b80a..ea4a609e3eb7 100644
--- a/drivers/scsi/hpsa.h
+++ b/drivers/scsi/hpsa.h
@@ -161,7 +161,6 @@ struct bmic_controller_parameters {
 #pragma pack()
 
 struct ctlr_info {
-	unsigned int *reply_map;
 	int	ctlr;
 	char	devname[8];
 	char    *product_name;
-- 
2.26.2


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RFC v7 02/12] blk-mq: rename blk_mq_update_tag_set_depth()
  2020-06-10 17:29 ` [PATCH RFC v7 02/12] blk-mq: rename blk_mq_update_tag_set_depth() John Garry
@ 2020-06-11  2:57   ` Ming Lei
  2020-06-11  8:26     ` John Garry
  0 siblings, 1 reply; 45+ messages in thread
From: Ming Lei @ 2020-06-11  2:57 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, jejb, martin.petersen, don.brace, kashyap.desai,
	sumit.saxena, bvanassche, hare, hch, shivasharan.srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	megaraidlinux.pdl, Hannes Reinecke

On Thu, Jun 11, 2020 at 01:29:09AM +0800, John Garry wrote:
> From: Hannes Reinecke <hare@suse.de>
> 
> The function does not set the depth, but rather transitions from
> shared to non-shared queues and vice versa.
> So rename it to blk_mq_update_tag_set_shared() to better reflect
> its purpose.

It is fine to rename it for me, however:

This patch claims to rename blk_mq_update_tag_set_shared(), but also
change blk_mq_init_bitmap_tags's signature.

So suggest to split this patch into two or add comment log on changing
blk_mq_init_bitmap_tags().


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RFC v7 00/12] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs
  2020-06-10 17:29 [PATCH RFC v7 00/12] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs John Garry
                   ` (11 preceding siblings ...)
  2020-06-10 17:29 ` [PATCH RFC v7 12/12] hpsa: enable host_tagset and switch to MQ John Garry
@ 2020-06-11  3:07 ` Ming Lei
  2020-06-11  9:35   ` John Garry
  12 siblings, 1 reply; 45+ messages in thread
From: Ming Lei @ 2020-06-11  3:07 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, jejb, martin.petersen, don.brace, kashyap.desai,
	sumit.saxena, bvanassche, hare, hch, shivasharan.srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	megaraidlinux.pdl

On Thu, Jun 11, 2020 at 01:29:07AM +0800, John Garry wrote:
> Hi all,
> 
> Here is v7 of the patchset.
> 
> In this version of the series, we keep the shared sbitmap for driver tags,
> and introduce changes to fix up the tag budgeting across request queues (
> and associated debugfs changes).
> 
> Some performance figures:
> 
> Using 12x SAS SSDs on hisi_sas v3 hw. mq-deadline results are included,
> but it is not always an appropriate scheduler to use.
> 
> Tag depth 		4000 (default)			260**
> 
> Baseline:
> none sched:		2290K IOPS			894K
> mq-deadline sched:	2341K IOPS			2313K
> 
> Final, host_tagset=0 in LLDD*
> none sched:		2289K IOPS			703K
> mq-deadline sched:	2337K IOPS			2291K
> 
> Final:
> none sched:		2281K IOPS			1101K
> mq-deadline sched:	2322K IOPS			1278K
> 
> * this is relevant as this is the performance in supporting but not
>   enabling the feature
> ** depth=260 is relevant as some point where we are regularly waiting for
>    tags to be available. Figures were are a bit unstable here for testing.
> 
> A copy of the patches can be found here:
> https://github.com/hisilicon/kernel-dev/commits/private-topic-blk-mq-shared-tags-rfc-v7
> 
> And to progress this series, we the the following to go in first, when ready:
> https://lore.kernel.org/linux-scsi/20200430131904.5847-1-hare@suse.de/

I'd suggest to add options to enable shared tags for null_blk & scsi_debug in V8, so
that it is easier to verify the changes without real hardware.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RFC v7 04/12] blk-mq: Facilitate a shared sbitmap per tagset
  2020-06-10 17:29 ` [PATCH RFC v7 04/12] blk-mq: Facilitate a shared sbitmap per tagset John Garry
@ 2020-06-11  3:37   ` Ming Lei
  2020-06-11 10:09     ` John Garry
  0 siblings, 1 reply; 45+ messages in thread
From: Ming Lei @ 2020-06-11  3:37 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, jejb, martin.petersen, don.brace, kashyap.desai,
	sumit.saxena, bvanassche, hare, hch, shivasharan.srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	megaraidlinux.pdl

On Thu, Jun 11, 2020 at 01:29:11AM +0800, John Garry wrote:
> Some SCSI HBAs (such as HPSA, megaraid, mpt3sas, hisi_sas_v3 ..) support
> multiple reply queues with single hostwide tags.
> 
> In addition, these drivers want to use interrupt assignment in
> pci_alloc_irq_vectors(PCI_IRQ_AFFINITY). However, as discussed in [0],
> CPU hotplug may cause in-flight IO completion to not be serviced when an
> interrupt is shutdown. That problem is solved in commit bf0beec0607d
> ("blk-mq: drain I/O when all CPUs in a hctx are offline").
> 
> However, to take advantage of that blk-mq feature, the HBA HW queuess are
> required to be mapped to that of the blk-mq hctx's; to do that, the HBA HW queues
> need to be exposed to the upper layer.
> 
> In making that transition, the per-SCSI command request tags are no
> longer unique per Scsi host - they are just unique per hctx. As such, the
> HBA LLDD would have to generate this tag internally, which has a certain
> performance overhead.
> 
> However another problem is that blk-mq assumes the host may accept
> (Scsi_host.can_queue * #hw queue) commands. In commit 6eb045e092ef ("scsi:
>  core: avoid host-wide host_busy counter for scsi_mq"), the Scsi host busy
> counter was removed, which would stop the LLDD being sent more than
> .can_queue commands; however, it should still be ensured that the block
> layer does not issue more than .can_queue commands to the Scsi host.
> 
> To solve this problem, introduce a shared sbitmap per blk_mq_tag_set,
> which may be requested at init time.
> 
> New flag BLK_MQ_F_TAG_HCTX_SHARED should be set when requesting the
> tagset to indicate whether the shared sbitmap should be used.
> 
> Even when BLK_MQ_F_TAG_HCTX_SHARED is set, a full set of tags and requests
> are still allocated per hctx; the reason for this is that if tags and
> requests were only allocated for a single hctx - like hctx0 - it may break
> block drivers which expect a request be associated with a specific hctx,
> i.e. not always hctx0. This will introduce extra memory usage.
> 
> This change is based on work originally from Ming Lei in [1] and from
> Bart's suggestion in [2].
> 
> [0] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/
> [1] https://lore.kernel.org/linux-block/20190531022801.10003-1-ming.lei@redhat.com/
> [2] https://lore.kernel.org/linux-block/ff77beff-5fd9-9f05-12b6-826922bace1f@huawei.com/T/#m3db0a602f095cbcbff27e9c884d6b4ae826144be
> 
> Signed-off-by: John Garry <john.garry@huawei.com>
> ---
>  block/blk-mq-tag.c     | 39 +++++++++++++++++++++++++++++++++++++--
>  block/blk-mq-tag.h     | 10 +++++++++-
>  block/blk-mq.c         | 24 +++++++++++++++++++++++-
>  block/blk-mq.h         |  5 +++++
>  include/linux/blk-mq.h |  6 ++++++
>  5 files changed, 80 insertions(+), 4 deletions(-)
> 
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index be39db3c88d7..92843e3e1a2a 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -228,7 +228,7 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
>  	 * We can hit rq == NULL here, because the tagging functions
>  	 * test and set the bit before assigning ->rqs[].
>  	 */
> -	if (rq && rq->q == hctx->queue)
> +	if (rq && rq->q == hctx->queue && rq->mq_hctx == hctx)
>  		return iter_data->fn(hctx, rq, iter_data->data, reserved);
>  	return true;
>  }
> @@ -466,6 +466,7 @@ static int blk_mq_init_bitmap_tags(struct blk_mq_tags *tags,
>  		     round_robin, node))
>  		goto free_bitmap_tags;
>  
> +	/* We later overwrite these in case of per-set shared sbitmap */
>  	tags->bitmap_tags = &tags->__bitmap_tags;
>  	tags->breserved_tags = &tags->__breserved_tags;

You may skip to allocate anything for blk_mq_is_sbitmap_shared(), and
similar change for blk_mq_free_tags().

>  
> @@ -475,7 +476,32 @@ static int blk_mq_init_bitmap_tags(struct blk_mq_tags *tags,
>  	return -ENOMEM;
>  }
>  
> -struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
> +bool blk_mq_init_shared_sbitmap(struct blk_mq_tag_set *tag_set)
> +{
> +	unsigned int depth = tag_set->queue_depth - tag_set->reserved_tags;
> +	int alloc_policy = BLK_MQ_FLAG_TO_ALLOC_POLICY(tag_set->flags);
> +	bool round_robin = alloc_policy == BLK_TAG_ALLOC_RR;
> +	int node = tag_set->numa_node;
> +
> +	if (bt_alloc(&tag_set->__bitmap_tags, depth, round_robin, node))
> +		return false;
> +	if (bt_alloc(&tag_set->__breserved_tags, tag_set->reserved_tags,
> +		     round_robin, node))
> +		goto free_bitmap_tags;
> +	return true;
> +free_bitmap_tags:
> +	sbitmap_queue_free(&tag_set->__bitmap_tags);
> +	return false;
> +}
> +
> +void blk_mq_exit_shared_sbitmap(struct blk_mq_tag_set *tag_set)
> +{
> +	sbitmap_queue_free(&tag_set->__bitmap_tags);
> +	sbitmap_queue_free(&tag_set->__breserved_tags);
> +}
> +
> +struct blk_mq_tags *blk_mq_init_tags(struct blk_mq_tag_set *set,
> +				     unsigned int total_tags,
>  				     unsigned int reserved_tags,
>  				     int node, int alloc_policy)
>  {
> @@ -502,6 +528,10 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
>  
>  void blk_mq_free_tags(struct blk_mq_tags *tags)
>  {
> +	/*
> +	 * Do not free tags->{bitmap, breserved}_tags, as this may point to
> +	 * shared sbitmap
> +	 */
>  	sbitmap_queue_free(&tags->__bitmap_tags);
>  	sbitmap_queue_free(&tags->__breserved_tags);
>  	kfree(tags);
> @@ -560,6 +590,11 @@ int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
>  	return 0;
>  }
>  
> +void blk_mq_tag_resize_shared_sbitmap(struct blk_mq_tag_set *set, unsigned int size)
> +{
> +	sbitmap_queue_resize(&set->__bitmap_tags, size - set->reserved_tags);
> +}
> +
>  /**
>   * blk_mq_unique_tag() - return a tag that is unique queue-wide
>   * @rq: request for which to compute a unique tag
> diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
> index cebf7a4b280a..cf39dd13a24d 100644
> --- a/block/blk-mq-tag.h
> +++ b/block/blk-mq-tag.h
> @@ -25,7 +25,12 @@ struct blk_mq_tags {
>  };
>  
>  
> -extern struct blk_mq_tags *blk_mq_init_tags(unsigned int nr_tags, unsigned int reserved_tags, int node, int alloc_policy);
> +extern bool blk_mq_init_shared_sbitmap(struct blk_mq_tag_set *tag_set);
> +extern void blk_mq_exit_shared_sbitmap(struct blk_mq_tag_set *tag_set);
> +extern struct blk_mq_tags *blk_mq_init_tags(struct blk_mq_tag_set *tag_set,
> +					    unsigned int nr_tags,
> +					    unsigned int reserved_tags,
> +					    int node, int alloc_policy);
>  extern void blk_mq_free_tags(struct blk_mq_tags *tags);
>  
>  extern unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data);
> @@ -34,6 +39,9 @@ extern void blk_mq_put_tag(struct blk_mq_tags *tags, struct blk_mq_ctx *ctx,
>  extern int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
>  					struct blk_mq_tags **tags,
>  					unsigned int depth, bool can_grow);
> +extern void blk_mq_tag_resize_shared_sbitmap(struct blk_mq_tag_set *set,
> +					     unsigned int size);
> +
>  extern void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags, bool);
>  void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
>  		void *priv);
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 90b645c3092c..77120dd4e4d5 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2229,7 +2229,7 @@ struct blk_mq_tags *blk_mq_alloc_rq_map(struct blk_mq_tag_set *set,
>  	if (node == NUMA_NO_NODE)
>  		node = set->numa_node;
>  
> -	tags = blk_mq_init_tags(nr_tags, reserved_tags, node,
> +	tags = blk_mq_init_tags(set, nr_tags, reserved_tags, node,
>  				BLK_MQ_FLAG_TO_ALLOC_POLICY(set->flags));
>  	if (!tags)
>  		return NULL;
> @@ -3349,11 +3349,28 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
>  	if (ret)
>  		goto out_free_mq_map;
>  
> +	if (blk_mq_is_sbitmap_shared(set)) {
> +		if (!blk_mq_init_shared_sbitmap(set)) {
> +			ret = -ENOMEM;
> +			goto out_free_mq_rq_maps;
> +		}
> +
> +		for (i = 0; i < set->nr_hw_queues; i++) {
> +			struct blk_mq_tags *tags = set->tags[i];
> +
> +			tags->bitmap_tags = &set->__bitmap_tags;
> +			tags->breserved_tags = &set->__breserved_tags;
> +		}

I am wondering why you don't put ->[bitmap|breserved]_tags initialization into
blk_mq_init_shared_sbitmap().


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RFC v7 05/12] blk-mq: Record nr_active_requests per queue for when using shared sbitmap
  2020-06-10 17:29 ` [PATCH RFC v7 05/12] blk-mq: Record nr_active_requests per queue for when using shared sbitmap John Garry
@ 2020-06-11  4:04   ` Ming Lei
  2020-06-11 10:22     ` John Garry
  0 siblings, 1 reply; 45+ messages in thread
From: Ming Lei @ 2020-06-11  4:04 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, jejb, martin.petersen, don.brace, kashyap.desai,
	sumit.saxena, bvanassche, hare, hch, shivasharan.srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	megaraidlinux.pdl

On Thu, Jun 11, 2020 at 01:29:12AM +0800, John Garry wrote:
> The per-hctx nr_active value can no longer be used to fairly assign a share
> of tag depth per request queue for when using a shared sbitmap, as it does
> not consider that the tags are shared tags over all hctx's.
> 
> For this case, record the nr_active_requests per request_queue, and make
> the judgment based on that value.
> 
> Also introduce a debugfs version of per-hctx blk_mq_debugfs_attr, omitting
> hctx_active_show() (as blk_mq_hw_ctx.nr_active is no longer maintained for
> the case of shared sbitmap) and other entries which we can add which would
> be revised specifically for when using a shared sbitmap.
> 
> Co-developed-with: Kashyap Desai <kashyap.desai@broadcom.com>
> Signed-off-by: John Garry <john.garry@huawei.com>
> ---
>  block/blk-core.c       |  2 ++
>  block/blk-mq-debugfs.c | 23 ++++++++++++++++++++++-
>  block/blk-mq-tag.c     | 10 ++++++----
>  block/blk-mq.c         |  6 +++---
>  block/blk-mq.h         | 28 +++++++++++++++++++++++++++-
>  include/linux/blkdev.h |  2 ++
>  6 files changed, 62 insertions(+), 9 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 03252af8c82c..c622453c1363 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -529,6 +529,8 @@ struct request_queue *__blk_alloc_queue(int node_id)
>  	q->backing_dev_info->capabilities = BDI_CAP_CGROUP_WRITEBACK;
>  	q->node = node_id;
>  
> +	atomic_set(&q->nr_active_requests_shared_sbitmap, 0);
> +
>  	timer_setup(&q->backing_dev_info->laptop_mode_wb_timer,
>  		    laptop_mode_timer_fn, 0);
>  	timer_setup(&q->timeout, blk_rq_timed_out_timer, 0);
> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> index a400b6698dff..0fa3af41ab65 100644
> --- a/block/blk-mq-debugfs.c
> +++ b/block/blk-mq-debugfs.c
> @@ -796,6 +796,23 @@ static const struct blk_mq_debugfs_attr blk_mq_debugfs_hctx_attrs[] = {
>  	{},
>  };
>  
> +static const struct blk_mq_debugfs_attr blk_mq_debugfs_hctx_shared_sbitmap_attrs[] = {
> +	{"state", 0400, hctx_state_show},
> +	{"flags", 0400, hctx_flags_show},
> +	{"dispatch", 0400, .seq_ops = &hctx_dispatch_seq_ops},
> +	{"busy", 0400, hctx_busy_show},
> +	{"ctx_map", 0400, hctx_ctx_map_show},
> +	{"sched_tags", 0400, hctx_sched_tags_show},
> +	{"sched_tags_bitmap", 0400, hctx_sched_tags_bitmap_show},
> +	{"io_poll", 0600, hctx_io_poll_show, hctx_io_poll_write},
> +	{"dispatched", 0600, hctx_dispatched_show, hctx_dispatched_write},
> +	{"queued", 0600, hctx_queued_show, hctx_queued_write},
> +	{"run", 0600, hctx_run_show, hctx_run_write},
> +	{"active", 0400, hctx_active_show},
> +	{"dispatch_busy", 0400, hctx_dispatch_busy_show},
> +	{}
> +};

You may use macro or whatever to avoid so the duplication.

> +
>  static const struct blk_mq_debugfs_attr blk_mq_debugfs_ctx_attrs[] = {
>  	{"default_rq_list", 0400, .seq_ops = &ctx_default_rq_list_seq_ops},
>  	{"read_rq_list", 0400, .seq_ops = &ctx_read_rq_list_seq_ops},
> @@ -878,13 +895,17 @@ void blk_mq_debugfs_register_hctx(struct request_queue *q,
>  				  struct blk_mq_hw_ctx *hctx)
>  {
>  	struct blk_mq_ctx *ctx;
> +	struct blk_mq_tag_set *set = q->tag_set;
>  	char name[20];
>  	int i;
>  
>  	snprintf(name, sizeof(name), "hctx%u", hctx->queue_num);
>  	hctx->debugfs_dir = debugfs_create_dir(name, q->debugfs_dir);
>  
> -	debugfs_create_files(hctx->debugfs_dir, hctx, blk_mq_debugfs_hctx_attrs);
> +	if (blk_mq_is_sbitmap_shared(set))
> +		debugfs_create_files(hctx->debugfs_dir, hctx, blk_mq_debugfs_hctx_shared_sbitmap_attrs);
> +	else
> +		debugfs_create_files(hctx->debugfs_dir, hctx, blk_mq_debugfs_hctx_attrs);
>  
>  	hctx_for_each_ctx(hctx, ctx, i)
>  		blk_mq_debugfs_register_ctx(hctx, ctx);
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index 92843e3e1a2a..7db16e49f6f6 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -60,9 +60,11 @@ void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
>   * For shared tag users, we track the number of currently active users
>   * and attempt to provide a fair share of the tag depth for each of them.
>   */
> -static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx,
> +static inline bool hctx_may_queue(struct blk_mq_alloc_data *data,
>  				  struct sbitmap_queue *bt)
>  {
> +	struct blk_mq_hw_ctx *hctx = data->hctx;
> +	struct request_queue *q = data->q;
>  	unsigned int depth, users;
>  
>  	if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED))
> @@ -84,15 +86,15 @@ static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx,
>  	 * Allow at least some tags
>  	 */
>  	depth = max((bt->sb.depth + users - 1) / users, 4U);
> -	return atomic_read(&hctx->nr_active) < depth;
> +	return __blk_mq_active_requests(hctx, q) < depth;

There is big change on 'users' too:

	users = atomic_read(&hctx->tags->active_queues);

Originally there is single hctx->tags for these HBAs, now there are many
hctx->tags, so 'users' may become much smaller than before.

Maybe '->active_queues' can be moved to tag_set for blk_mq_is_sbitmap_shared().

>  }
>  
>  static int __blk_mq_get_tag(struct blk_mq_alloc_data *data,
>  			    struct sbitmap_queue *bt)
>  {
>  	if (!(data->flags & BLK_MQ_REQ_INTERNAL) &&
> -	    !hctx_may_queue(data->hctx, bt))
> -		return BLK_MQ_NO_TAG;
> +	    !hctx_may_queue(data, bt))
> +		return -1;

BLK_MQ_NO_TAG should have been returned.

>  	if (data->shallow_depth)
>  		return __sbitmap_queue_get_shallow(bt, data->shallow_depth);
>  	else
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 77120dd4e4d5..0f7e062a1665 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -283,7 +283,7 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
>  	} else {
>  		if (data->hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED) {
>  			rq_flags = RQF_MQ_INFLIGHT;
> -			atomic_inc(&data->hctx->nr_active);
> +			__blk_mq_inc_active_requests(data->hctx, data->q);
>  		}
>  		rq->tag = tag;
>  		rq->internal_tag = BLK_MQ_NO_TAG;
> @@ -527,7 +527,7 @@ void blk_mq_free_request(struct request *rq)
>  
>  	ctx->rq_completed[rq_is_sync(rq)]++;
>  	if (rq->rq_flags & RQF_MQ_INFLIGHT)
> -		atomic_dec(&hctx->nr_active);
> +		__blk_mq_dec_active_requests(hctx, q);
>  
>  	if (unlikely(laptop_mode && !blk_rq_is_passthrough(rq)))
>  		laptop_io_completion(q->backing_dev_info);
> @@ -1073,7 +1073,7 @@ bool blk_mq_get_driver_tag(struct request *rq)
>  	if (rq->tag >= 0) {
>  		if (shared) {
>  			rq->rq_flags |= RQF_MQ_INFLIGHT;
> -			atomic_inc(&data.hctx->nr_active);
> +			__blk_mq_inc_active_requests(rq->mq_hctx, rq->q);
>  		}
>  		data.hctx->tags->rqs[rq->tag] = rq;
>  	}
> diff --git a/block/blk-mq.h b/block/blk-mq.h
> index 1a283c707215..9c1e612c2298 100644
> --- a/block/blk-mq.h
> +++ b/block/blk-mq.h
> @@ -202,6 +202,32 @@ static inline bool blk_mq_get_dispatch_budget(struct blk_mq_hw_ctx *hctx)
>  	return true;
>  }
>  
> +static inline void __blk_mq_inc_active_requests(struct blk_mq_hw_ctx *hctx,
> +						struct request_queue *q)
> +{
> +	if (blk_mq_is_sbitmap_shared(q->tag_set))
> +		atomic_inc(&q->nr_active_requests_shared_sbitmap);
> +	else
> +		atomic_inc(&hctx->nr_active);
> +}
> +
> +static inline void __blk_mq_dec_active_requests(struct blk_mq_hw_ctx *hctx,
> +						struct request_queue *q)
> +{
> +	if (blk_mq_is_sbitmap_shared(q->tag_set))
> +		atomic_dec(&q->nr_active_requests_shared_sbitmap);
> +	else
> +		atomic_dec(&hctx->nr_active);
> +}
> +
> +static inline int __blk_mq_active_requests(struct blk_mq_hw_ctx *hctx,
> +					   struct request_queue *q)
> +{
> +	if (blk_mq_is_sbitmap_shared(q->tag_set))

I'd suggest to add one hctx version of blk_mq_is_sbitmap_shared() since
q->tag_set is seldom used in fast path, and hctx->flags is more
efficient than tag_set->flags.


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RFC v7 02/12] blk-mq: rename blk_mq_update_tag_set_depth()
  2020-06-11  2:57   ` Ming Lei
@ 2020-06-11  8:26     ` John Garry
  2020-06-23 11:25       ` John Garry
  0 siblings, 1 reply; 45+ messages in thread
From: John Garry @ 2020-06-11  8:26 UTC (permalink / raw)
  To: Ming Lei
  Cc: axboe, jejb, martin.petersen, don.brace, kashyap.desai,
	sumit.saxena, bvanassche, hare, hch, shivasharan.srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	megaraidlinux.pdl, Hannes Reinecke

On 11/06/2020 03:57, Ming Lei wrote:
> On Thu, Jun 11, 2020 at 01:29:09AM +0800, John Garry wrote:
>> From: Hannes Reinecke <hare@suse.de>
>>
>> The function does not set the depth, but rather transitions from
>> shared to non-shared queues and vice versa.
>> So rename it to blk_mq_update_tag_set_shared() to better reflect
>> its purpose.
> 
> It is fine to rename it for me, however:
> 
> This patch claims to rename blk_mq_update_tag_set_shared(), but also
> change blk_mq_init_bitmap_tags's signature.

I was going to update the commit message here, but forgot again...

> 
> So suggest to split this patch into two or add comment log on changing
> blk_mq_init_bitmap_tags().

I think I'll just split into 2x commits.

Thanks,
John

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RFC v7 00/12] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs
  2020-06-11  3:07 ` [PATCH RFC v7 00/12] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs Ming Lei
@ 2020-06-11  9:35   ` John Garry
  2020-06-12 18:47     ` Kashyap Desai
  0 siblings, 1 reply; 45+ messages in thread
From: John Garry @ 2020-06-11  9:35 UTC (permalink / raw)
  To: Ming Lei
  Cc: axboe, jejb, martin.petersen, don.brace, kashyap.desai,
	sumit.saxena, bvanassche, hare, hch, shivasharan.srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	megaraidlinux.pdl

On 11/06/2020 04:07, Ming Lei wrote:
>> Using 12x SAS SSDs on hisi_sas v3 hw. mq-deadline results are included,
>> but it is not always an appropriate scheduler to use.
>>
>> Tag depth 		4000 (default)			260**
>>
>> Baseline:
>> none sched:		2290K IOPS			894K
>> mq-deadline sched:	2341K IOPS			2313K
>>
>> Final, host_tagset=0 in LLDD*
>> none sched:		2289K IOPS			703K
>> mq-deadline sched:	2337K IOPS			2291K
>>
>> Final:
>> none sched:		2281K IOPS			1101K
>> mq-deadline sched:	2322K IOPS			1278K
>>
>> * this is relevant as this is the performance in supporting but not
>>    enabling the feature
>> ** depth=260 is relevant as some point where we are regularly waiting for
>>     tags to be available. Figures were are a bit unstable here for testing.
>>
>> A copy of the patches can be found here:
>> https://github.com/hisilicon/kernel-dev/commits/private-topic-blk-mq-shared-tags-rfc-v7
>>
>> And to progress this series, we the the following to go in first, when ready:
>> https://lore.kernel.org/linux-scsi/20200430131904.5847-1-hare@suse.de/
> I'd suggest to add options to enable shared tags for null_blk & scsi_debug in V8, so
> that it is easier to verify the changes without real hardware.
> 

ok, fine, I can look at including that. To stop the series getting too 
large, I might spin off the early patches, which are not strictly related.

Thanks,
John

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RFC v7 04/12] blk-mq: Facilitate a shared sbitmap per tagset
  2020-06-11  3:37   ` Ming Lei
@ 2020-06-11 10:09     ` John Garry
  0 siblings, 0 replies; 45+ messages in thread
From: John Garry @ 2020-06-11 10:09 UTC (permalink / raw)
  To: Ming Lei
  Cc: axboe, jejb, martin.petersen, don.brace, kashyap.desai,
	sumit.saxena, bvanassche, hare, hch, shivasharan.srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	megaraidlinux.pdl

On 11/06/2020 04:37, Ming Lei wrote:

Hi Ming,

Thanks for checking this.

>> bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
>>   	 * We can hit rq == NULL here, because the tagging functions
>>   	 * test and set the bit before assigning ->rqs[].
>>   	 */
>> -	if (rq && rq->q == hctx->queue)
>> +	if (rq && rq->q == hctx->queue && rq->mq_hctx == hctx)
>>   		return iter_data->fn(hctx, rq, iter_data->data, reserved);
>>   	return true;
>>   }
>> @@ -466,6 +466,7 @@ static int blk_mq_init_bitmap_tags(struct blk_mq_tags *tags,
>>   		     round_robin, node))
>>   		goto free_bitmap_tags;
>>   
>> +	/* We later overwrite these in case of per-set shared sbitmap */
>>   	tags->bitmap_tags = &tags->__bitmap_tags;
>>   	tags->breserved_tags = &tags->__breserved_tags;
> You may skip to allocate anything for blk_mq_is_sbitmap_shared(), and
> similar change for blk_mq_free_tags().

I did try that, but it breaks scheduler tags allocation - this is common 
code. Maybe I can pass some flag, to avoid the allocation for case of 
shared sbitmap and !sched tags. Same for free path.

BTW, if you check patch 7/12, I mentioned that we could use this sbitmap 
for iterating to get the per-hctx bitmap, instead of allocating a temp 
sbitmap. Maybe it's better.

> 
>>   
>> @@ -475,7 +476,32 @@ static int blk_mq_init_bitmap_tags(struct blk_mq_tags *tags,
>>   	return -ENOMEM;
>>   }
>>   
>> -struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
>> +bool blk_mq_init_shared_sbitmap(struct blk_mq_tag_set *tag_set)
>> +{
>> +	unsigned int depth = tag_set->queue_depth - tag_set->reserved_tags;
>> +	int alloc_policy = BLK_MQ_FLAG_TO_ALLOC_POLICY(tag_set->flags);
>> +	bool round_robin = alloc_policy == BLK_TAG_ALLOC_RR;
>> +	int node = tag_set->numa_node;
>> +
>> +	if (bt_alloc(&tag_set->__bitmap_tags, depth, round_robin, node))
>> +		return false;
>> +	if (bt_alloc(&tag_set->__breserved_tags, tag_set->reserved_tags,
>> +		     round_robin, node))
>> +		goto free_bitmap_tags;
>> +	return true;
>> +free_bitmap_tags:
>> +	sbitmap_queue_free(&tag_set->__bitmap_tags);
>> +	return false;
>> +}
>> +

[...]

>> index 90b645c3092c..77120dd4e4d5 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -2229,7 +2229,7 @@ struct blk_mq_tags *blk_mq_alloc_rq_map(struct blk_mq_tag_set *set,
>>   	if (node == NUMA_NO_NODE)
>>   		node = set->numa_node;
>>   
>> -	tags = blk_mq_init_tags(nr_tags, reserved_tags, node,
>> +	tags = blk_mq_init_tags(set, nr_tags, reserved_tags, node,
>>   				BLK_MQ_FLAG_TO_ALLOC_POLICY(set->flags));
>>   	if (!tags)
>>   		return NULL;
>> @@ -3349,11 +3349,28 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
>>   	if (ret)
>>   		goto out_free_mq_map;
>>   
>> +	if (blk_mq_is_sbitmap_shared(set)) {
>> +		if (!blk_mq_init_shared_sbitmap(set)) {
>> +			ret = -ENOMEM;
>> +			goto out_free_mq_rq_maps;
>> +		}
>> +
>> +		for (i = 0; i < set->nr_hw_queues; i++) {
>> +			struct blk_mq_tags *tags = set->tags[i];
>> +
>> +			tags->bitmap_tags = &set->__bitmap_tags;
>> +			tags->breserved_tags = &set->__breserved_tags;
>> +		}
> I am wondering why you don't put ->[bitmap|breserved]_tags initialization into
> blk_mq_init_shared_sbitmap().

I suppose I could.

Thanks,
John

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RFC v7 05/12] blk-mq: Record nr_active_requests per queue for when using shared sbitmap
  2020-06-11  4:04   ` Ming Lei
@ 2020-06-11 10:22     ` John Garry
  0 siblings, 0 replies; 45+ messages in thread
From: John Garry @ 2020-06-11 10:22 UTC (permalink / raw)
  To: Ming Lei
  Cc: axboe, jejb, martin.petersen, don.brace, kashyap.desai,
	sumit.saxena, bvanassche, hare, hch, shivasharan.srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	megaraidlinux.pdl

>> +static const struct blk_mq_debugfs_attr blk_mq_debugfs_hctx_shared_sbitmap_attrs[] = {
>> +	{"state", 0400, hctx_state_show},
>> +	{"flags", 0400, hctx_flags_show},
>> +	{"dispatch", 0400, .seq_ops = &hctx_dispatch_seq_ops},
>> +	{"busy", 0400, hctx_busy_show},
>> +	{"ctx_map", 0400, hctx_ctx_map_show},
>> +	{"sched_tags", 0400, hctx_sched_tags_show},
>> +	{"sched_tags_bitmap", 0400, hctx_sched_tags_bitmap_show},
>> +	{"io_poll", 0600, hctx_io_poll_show, hctx_io_poll_write},
>> +	{"dispatched", 0600, hctx_dispatched_show, hctx_dispatched_write},
>> +	{"queued", 0600, hctx_queued_show, hctx_queued_write},
>> +	{"run", 0600, hctx_run_show, hctx_run_write},
>> +	{"active", 0400, hctx_active_show},
>> +	{"dispatch_busy", 0400, hctx_dispatch_busy_show},
>> +	{}
>> +};
> 
> You may use macro or whatever to avoid so the duplication.

Let me check alternatives.

> 
>> +
>>   static const struct blk_mq_debugfs_attr blk_mq_debugfs_ctx_attrs[] = {
>>   	{"default_rq_list", 0400, .seq_ops = &ctx_default_rq_list_seq_ops},
>>   	{"read_rq_list", 0400, .seq_ops = &ctx_read_rq_list_seq_ops},
>> @@ -878,13 +895,17 @@ void blk_mq_debugfs_register_hctx(struct request_queue *q,
>>   				  struct blk_mq_hw_ctx *hctx)
>>   {
>>   	struct blk_mq_ctx *ctx;
>> +	struct blk_mq_tag_set *set = q->tag_set;
>>   	char name[20];
>>   	int i;
>>   
>>   	snprintf(name, sizeof(name), "hctx%u", hctx->queue_num);
>>   	hctx->debugfs_dir = debugfs_create_dir(name, q->debugfs_dir);
>>   
>> -	debugfs_create_files(hctx->debugfs_dir, hctx, blk_mq_debugfs_hctx_attrs);
>> +	if (blk_mq_is_sbitmap_shared(set))
>> +		debugfs_create_files(hctx->debugfs_dir, hctx, blk_mq_debugfs_hctx_shared_sbitmap_attrs);
>> +	else
>> +		debugfs_create_files(hctx->debugfs_dir, hctx, blk_mq_debugfs_hctx_attrs);
>>   
>>   	hctx_for_each_ctx(hctx, ctx, i)
>>   		blk_mq_debugfs_register_ctx(hctx, ctx);
>> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
>> index 92843e3e1a2a..7db16e49f6f6 100644
>> --- a/block/blk-mq-tag.c
>> +++ b/block/blk-mq-tag.c
>> @@ -60,9 +60,11 @@ void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
>>    * For shared tag users, we track the number of currently active users
>>    * and attempt to provide a fair share of the tag depth for each of them.
>>    */
>> -static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx,
>> +static inline bool hctx_may_queue(struct blk_mq_alloc_data *data,
>>   				  struct sbitmap_queue *bt)
>>   {
>> +	struct blk_mq_hw_ctx *hctx = data->hctx;
>> +	struct request_queue *q = data->q;
>>   	unsigned int depth, users;
>>   
>>   	if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED))
>> @@ -84,15 +86,15 @@ static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx,
>>   	 * Allow at least some tags
>>   	 */
>>   	depth = max((bt->sb.depth + users - 1) / users, 4U);
>> -	return atomic_read(&hctx->nr_active) < depth;
>> +	return __blk_mq_active_requests(hctx, q) < depth;
> 
> There is big change on 'users' too:
> 
> 	users = atomic_read(&hctx->tags->active_queues);
> 
> Originally there is single hctx->tags for these HBAs, now there are many
> hctx->tags, so 'users' may become much smaller than before.

Can you please check how I handled that in the next patch? There we 
record the number of active request queues per set.

(I will note that I could have combined some of these patches, but I 
liked the piecemeal appraoch, and none of these paths are enabled until 
later).

> 
> Maybe '->active_queues' can be moved to tag_set for blk_mq_is_sbitmap_shared().
> 
>>   }
>>   
>>   static int __blk_mq_get_tag(struct blk_mq_alloc_data *data,
>>   			    struct sbitmap_queue *bt)
>>   {
>>   	if (!(data->flags & BLK_MQ_REQ_INTERNAL) &&
>> -	    !hctx_may_queue(data->hctx, bt))
>> -		return BLK_MQ_NO_TAG;
>> +	    !hctx_may_queue(data, bt))
>> +		return -1;
> 
> BLK_MQ_NO_TAG should have been returned.

OK, I missed that in the rebase.

> 
>>   	if (data->shallow_depth)
>>   		return __sbitmap_queue_get_shallow(bt, data->shallow_depth);
>>   	else
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index 77120dd4e4d5..0f7e062a1665 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -283,7 +283,7 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
>>   	} else {
>>   		if (data->hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED) {
>>   			rq_flags = RQF_MQ_INFLIGHT;
>> -			atomic_inc(&data->hctx->nr_active);
>> +			__blk_mq_inc_active_requests(data->hctx, data->q);
>>   		}
>>   		rq->tag = tag;
>>   		rq->internal_tag = BLK_MQ_NO_TAG;
>> @@ -527,7 +527,7 @@ void blk_mq_free_request(struct request *rq)
>>   
>>   	ctx->rq_completed[rq_is_sync(rq)]++;
>>   	if (rq->rq_flags & RQF_MQ_INFLIGHT)
>> -		atomic_dec(&hctx->nr_active);
>> +		__blk_mq_dec_active_requests(hctx, q);
>>   
>>   	if (unlikely(laptop_mode && !blk_rq_is_passthrough(rq)))
>>   		laptop_io_completion(q->backing_dev_info);
>> @@ -1073,7 +1073,7 @@ bool blk_mq_get_driver_tag(struct request *rq)
>>   	if (rq->tag >= 0) {
>>   		if (shared) {
>>   			rq->rq_flags |= RQF_MQ_INFLIGHT;
>> -			atomic_inc(&data.hctx->nr_active);
>> +			__blk_mq_inc_active_requests(rq->mq_hctx, rq->q);
>>   		}
>>   		data.hctx->tags->rqs[rq->tag] = rq;
>>   	}
>> diff --git a/block/blk-mq.h b/block/blk-mq.h
>> index 1a283c707215..9c1e612c2298 100644
>> --- a/block/blk-mq.h
>> +++ b/block/blk-mq.h
>> @@ -202,6 +202,32 @@ static inline bool blk_mq_get_dispatch_budget(struct blk_mq_hw_ctx *hctx)
>>   	return true;
>>   }
>>   
>> +static inline void __blk_mq_inc_active_requests(struct blk_mq_hw_ctx *hctx,
>> +						struct request_queue *q)
>> +{
>> +	if (blk_mq_is_sbitmap_shared(q->tag_set))
>> +		atomic_inc(&q->nr_active_requests_shared_sbitmap);
>> +	else
>> +		atomic_inc(&hctx->nr_active);
>> +}
>> +
>> +static inline void __blk_mq_dec_active_requests(struct blk_mq_hw_ctx *hctx,
>> +						struct request_queue *q)
>> +{
>> +	if (blk_mq_is_sbitmap_shared(q->tag_set))
>> +		atomic_dec(&q->nr_active_requests_shared_sbitmap);
>> +	else
>> +		atomic_dec(&hctx->nr_active);
>> +}
>> +
>> +static inline int __blk_mq_active_requests(struct blk_mq_hw_ctx *hctx,
>> +					   struct request_queue *q)
>> +{
>> +	if (blk_mq_is_sbitmap_shared(q->tag_set))
> 
> I'd suggest to add one hctx version of blk_mq_is_sbitmap_shared() since
> q->tag_set is seldom used in fast path, and hctx->flags is more
> efficient than tag_set->flags.

OK

Thanks,
John

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RFC v7 06/12] blk-mq: Record active_queues_shared_sbitmap per tag_set for when using shared sbitmap
  2020-06-10 17:29 ` [PATCH RFC v7 06/12] blk-mq: Record active_queues_shared_sbitmap per tag_set " John Garry
@ 2020-06-11 13:16   ` Hannes Reinecke
  2020-06-11 14:22     ` John Garry
  0 siblings, 1 reply; 45+ messages in thread
From: Hannes Reinecke @ 2020-06-11 13:16 UTC (permalink / raw)
  To: John Garry, axboe, jejb, martin.petersen, don.brace,
	kashyap.desai, sumit.saxena, ming.lei, bvanassche, hare, hch,
	shivasharan.srikanteshwara
  Cc: linux-block, linux-scsi, esc.storagedev, chenxiang66, megaraidlinux.pdl

On 6/10/20 7:29 PM, John Garry wrote:
> For when using a shared sbitmap, no longer should the number of active
> request queues per hctx be relied on for when judging how to share the tag
> bitmap.
> 
> Instead maintain the number of active request queues per tag_set, and make
> the judgment based on that.
> 
> And since the blk_mq_tags.active_queues is no longer maintained, do not
> show it in debugfs.
> 
> Originally-from: Kashyap Desai <kashyap.desai@broadcom.com>
> Signed-off-by: John Garry <john.garry@huawei.com>
> ---
>   block/blk-mq-debugfs.c | 25 ++++++++++++++++++++--
>   block/blk-mq-tag.c     | 47 ++++++++++++++++++++++++++++++++----------
>   block/blk-mq.c         |  2 ++
>   include/linux/blk-mq.h |  1 +
>   include/linux/blkdev.h |  1 +
>   5 files changed, 63 insertions(+), 13 deletions(-)
> 
> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> index 0fa3af41ab65..05b4be0c03d9 100644
> --- a/block/blk-mq-debugfs.c
> +++ b/block/blk-mq-debugfs.c
> @@ -458,17 +458,37 @@ static void blk_mq_debugfs_tags_show(struct seq_file *m,
>   	}
>   }
>   
> +static void blk_mq_debugfs_tags_shared_sbitmap_show(struct seq_file *m,
> +				     struct blk_mq_tags *tags)
> +{
> +	seq_printf(m, "nr_tags=%u\n", tags->nr_tags);
> +	seq_printf(m, "nr_reserved_tags=%u\n", tags->nr_reserved_tags);
> +
> +	seq_puts(m, "\nbitmap_tags:\n");
> +	sbitmap_queue_show(tags->bitmap_tags, m);
> +
> +	if (tags->nr_reserved_tags) {
> +		seq_puts(m, "\nbreserved_tags:\n");
> +		sbitmap_queue_show(tags->breserved_tags, m);
> +	}
> +}
> +
>   static int hctx_tags_show(void *data, struct seq_file *m)
>   {
>   	struct blk_mq_hw_ctx *hctx = data;
>   	struct request_queue *q = hctx->queue;
> +	struct blk_mq_tag_set *set = q->tag_set;
>   	int res;
>   
>   	res = mutex_lock_interruptible(&q->sysfs_lock);
>   	if (res)
>   		goto out;
> -	if (hctx->tags)
> -		blk_mq_debugfs_tags_show(m, hctx->tags);
> +	if (hctx->tags) {
> +		if (blk_mq_is_sbitmap_shared(set))
> +			blk_mq_debugfs_tags_shared_sbitmap_show(m, hctx->tags);
> +		else
> +			blk_mq_debugfs_tags_show(m, hctx->tags);
> +	}
>   	mutex_unlock(&q->sysfs_lock);
>   
>   out:
> @@ -802,6 +822,7 @@ static const struct blk_mq_debugfs_attr blk_mq_debugfs_hctx_shared_sbitmap_attrs
>   	{"dispatch", 0400, .seq_ops = &hctx_dispatch_seq_ops},
>   	{"busy", 0400, hctx_busy_show},
>   	{"ctx_map", 0400, hctx_ctx_map_show},
> +	{"tags", 0400, hctx_tags_show},
>   	{"sched_tags", 0400, hctx_sched_tags_show},
>   	{"sched_tags_bitmap", 0400, hctx_sched_tags_bitmap_show},
>   	{"io_poll", 0600, hctx_io_poll_show, hctx_io_poll_write},

I had been pondering this, too, when creating v6. Problem is that it'll 
show the tags per hctx, but as they are shared I guess the list looks 
pretty identical per hctx.
So I guess we should filter the tags per hctx to have only those active 
on that hctx displayed. But when doing so we can only print the 
in-flight tags, the others are not assigned to a hctx and as such we 
can't make a decision on which hctx they'll end up.

> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index 7db16e49f6f6..6ca06b1c3a99 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -23,9 +23,19 @@
>    */
>   bool __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx)
>   {
> -	if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state) &&
> -	    !test_and_set_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
> -		atomic_inc(&hctx->tags->active_queues);
> +	struct request_queue *q = hctx->queue;
> +	struct blk_mq_tag_set *set = q->tag_set;
> +
> +	if (blk_mq_is_sbitmap_shared(set)){
> +		if (!test_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags) &&
> +		    !test_and_set_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags))
> +			atomic_inc(&set->active_queues_shared_sbitmap);
> +
> +	} else {
> +		if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state) &&
> +		    !test_and_set_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
> +			atomic_inc(&hctx->tags->active_queues);
> +	}
>   
>   	return true;
>   }
At one point someone would need to educate me what this double 
'test_bit' and 'test_and_set_bit' is supposed to achieve.
Other than deliberately injecting a race condition ...

> @@ -47,11 +57,19 @@ void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags, bool include_reserve)
>   void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
>   {
>   	struct blk_mq_tags *tags = hctx->tags;
> -
> -	if (!test_and_clear_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
> -		return;
> -
> -	atomic_dec(&tags->active_queues);
> +	struct request_queue *q = hctx->queue;
> +	struct blk_mq_tag_set *set = q->tag_set;
> +
> +	if (blk_mq_is_sbitmap_shared(q->tag_set)){
> +		if (!test_and_clear_bit(QUEUE_FLAG_HCTX_ACTIVE,
> +					&q->queue_flags))
> +			return;
> +		atomic_dec(&set->active_queues_shared_sbitmap);
> +	} else {
> +		if (!test_and_clear_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
> +			return;
> +		atomic_dec(&tags->active_queues);
> +	}
>   
>   	blk_mq_tag_wakeup_all(tags, false);
>   }
> @@ -65,12 +83,11 @@ static inline bool hctx_may_queue(struct blk_mq_alloc_data *data,
>   {
>   	struct blk_mq_hw_ctx *hctx = data->hctx;
>   	struct request_queue *q = data->q;
> +	struct blk_mq_tag_set *set = q->tag_set;
>   	unsigned int depth, users;
>   
>   	if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED))
>   		return true;
> -	if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
> -		return true;
>   
>   	/*
>   	 * Don't try dividing an ant
> @@ -78,7 +95,15 @@ static inline bool hctx_may_queue(struct blk_mq_alloc_data *data,
>   	if (bt->sb.depth == 1)
>   		return true;
>   
> -	users = atomic_read(&hctx->tags->active_queues);
> +	if (blk_mq_is_sbitmap_shared(q->tag_set)) {
> +		if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &q->queue_flags))
> +			return true;
> +		users = atomic_read(&set->active_queues_shared_sbitmap);
> +	} else {
> +		if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
> +			return true;
> +		users = atomic_read(&hctx->tags->active_queues);
> +	}
>   	if (!users)
>   		return true;
>   
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 0f7e062a1665..f73a2f9c58bd 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -3350,6 +3350,8 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
>   		goto out_free_mq_map;
>   
>   	if (blk_mq_is_sbitmap_shared(set)) {
> +		atomic_set(&set->active_queues_shared_sbitmap, 0);
> +
>   		if (!blk_mq_init_shared_sbitmap(set)) {
>   			ret = -ENOMEM;
>   			goto out_free_mq_rq_maps;
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index 7b31cdb92a71..66711c7234db 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -252,6 +252,7 @@ struct blk_mq_tag_set {
>   	unsigned int		timeout;
>   	unsigned int		flags;
>   	void			*driver_data;
> +	atomic_t		active_queues_shared_sbitmap;
>   
>   	struct sbitmap_queue	__bitmap_tags;
>   	struct sbitmap_queue	__breserved_tags;
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index c536278bec9e..1b0087e8d01a 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -619,6 +619,7 @@ struct request_queue {
>   #define QUEUE_FLAG_PCI_P2PDMA	25	/* device supports PCI p2p requests */
>   #define QUEUE_FLAG_ZONE_RESETALL 26	/* supports Zone Reset All */
>   #define QUEUE_FLAG_RQ_ALLOC_TIME 27	/* record rq->alloc_time_ns */
> +#define QUEUE_FLAG_HCTX_ACTIVE 28	/* at least one blk-mq hctx is active */
>   
>   #define QUEUE_FLAG_MQ_DEFAULT	((1 << QUEUE_FLAG_IO_STAT) |		\
>   				 (1 << QUEUE_FLAG_SAME_COMP))
> 
Other than that it looks fine.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                               +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RFC v7 07/12] blk-mq: Add support in hctx_tags_bitmap_show() for a shared sbitmap
  2020-06-10 17:29 ` [PATCH RFC v7 07/12] blk-mq: Add support in hctx_tags_bitmap_show() for a " John Garry
@ 2020-06-11 13:19   ` Hannes Reinecke
  2020-06-11 14:33     ` John Garry
  0 siblings, 1 reply; 45+ messages in thread
From: Hannes Reinecke @ 2020-06-11 13:19 UTC (permalink / raw)
  To: John Garry, axboe, jejb, martin.petersen, don.brace,
	kashyap.desai, sumit.saxena, ming.lei, bvanassche, hare, hch,
	shivasharan.srikanteshwara
  Cc: linux-block, linux-scsi, esc.storagedev, chenxiang66, megaraidlinux.pdl

On 6/10/20 7:29 PM, John Garry wrote:
> Since a set-wide shared tag sbitmap may be used, it is no longer valid to
> examine the per-hctx tagset for getting the active requests for a hctx
> (when a shared sbitmap is used).
> 
> As such, add support for the shared sbitmap by using an intermediate
> sbitmap per hctx, iterating all active tags for the specific hctx in the
> shared sbitmap.
> 
> Originally-by: Bart Van Assche <bvanassche@acm.org>
> Reviewed-by: Hannes Reinecke <hare@suse.de> #earlier version
> Signed-off-by: John Garry <john.garry@huawei.com>
> ---
>   block/blk-mq-debugfs.c | 62 ++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 62 insertions(+)
> 
> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> index 05b4be0c03d9..4da7e54adf3b 100644
> --- a/block/blk-mq-debugfs.c
> +++ b/block/blk-mq-debugfs.c
> @@ -495,6 +495,67 @@ static int hctx_tags_show(void *data, struct seq_file *m)
>   	return res;
>   }
>   
> +struct hctx_sb_data {
> +	struct sbitmap		*sb;	/* output bitmap */
> +	struct blk_mq_hw_ctx	*hctx;	/* input hctx */
> +};
> +
> +static bool hctx_filter_fn(struct blk_mq_hw_ctx *hctx, struct request *req,
> +			   void *priv, bool reserved)
> +{
> +	struct hctx_sb_data *hctx_sb_data = priv;
> +
> +	if (hctx == hctx_sb_data->hctx)
> +		sbitmap_set_bit(hctx_sb_data->sb, req->tag);
> +
> +	return true;
> +}
> +
> +static void hctx_filter_sb(struct sbitmap *sb, struct blk_mq_hw_ctx *hctx)
> +{
> +	struct hctx_sb_data hctx_sb_data = { .sb = sb, .hctx = hctx };
> +
> +	blk_mq_queue_tag_busy_iter(hctx->queue, hctx_filter_fn, &hctx_sb_data);
> +}
> +
> +static int hctx_tags_shared_sbitmap_bitmap_show(void *data, struct seq_file *m)
> +{
> +	struct blk_mq_hw_ctx *hctx = data;
> +	struct request_queue *q = hctx->queue;
> +	struct blk_mq_tag_set *set = q->tag_set;
> +	struct sbitmap shared_sb, *sb;
> +	int res;
> +
> +	if (!set)
> +		return 0;
> +
> +	/*
> +	 * We could use the allocated sbitmap for that hctx here, but
> +	 * that would mean that we would need to clean it prior to use.
> +	 */
> +	res = sbitmap_init_node(&shared_sb,
> +				set->__bitmap_tags.sb.depth,
> +				set->__bitmap_tags.sb.shift,
> +				GFP_KERNEL, NUMA_NO_NODE);
> +	if (res)
> +		return res;
> +	sb = &shared_sb;
> +
> +	res = mutex_lock_interruptible(&q->sysfs_lock);
> +	if (res)
> +		goto out;
> +	if (hctx->tags) {
> +		hctx_filter_sb(sb, hctx);
> +		sbitmap_bitmap_show(sb, m);
> +	}
> +
> +	mutex_unlock(&q->sysfs_lock);
> +
> +out:
> +	sbitmap_free(&shared_sb);
> +	return res;
> +}
> +
>   static int hctx_tags_bitmap_show(void *data, struct seq_file *m)
>   {
>   	struct blk_mq_hw_ctx *hctx = data;
> @@ -823,6 +884,7 @@ static const struct blk_mq_debugfs_attr blk_mq_debugfs_hctx_shared_sbitmap_attrs
>   	{"busy", 0400, hctx_busy_show},
>   	{"ctx_map", 0400, hctx_ctx_map_show},
>   	{"tags", 0400, hctx_tags_show},
> +	{"tags_bitmap", 0400, hctx_tags_shared_sbitmap_bitmap_show},
>   	{"sched_tags", 0400, hctx_sched_tags_show},
>   	{"sched_tags_bitmap", 0400, hctx_sched_tags_bitmap_show},
>   	{"io_poll", 0600, hctx_io_poll_show, hctx_io_poll_write},
> 
Ah, right. Here it is.

But I don't get it; why do we have to allocate a temporary bitmap and 
can't walk the existing shared sbitmap?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                               +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RFC v7 06/12] blk-mq: Record active_queues_shared_sbitmap per tag_set for when using shared sbitmap
  2020-06-11 13:16   ` Hannes Reinecke
@ 2020-06-11 14:22     ` John Garry
  0 siblings, 0 replies; 45+ messages in thread
From: John Garry @ 2020-06-11 14:22 UTC (permalink / raw)
  To: Hannes Reinecke, axboe, jejb, martin.petersen, don.brace,
	kashyap.desai, sumit.saxena, ming.lei, bvanassche, hare, hch,
	shivasharan.srikanteshwara
  Cc: linux-block, linux-scsi, esc.storagedev, chenxiang66, megaraidlinux.pdl

>>   out:
>> @@ -802,6 +822,7 @@ static const struct blk_mq_debugfs_attr 
>> blk_mq_debugfs_hctx_shared_sbitmap_attrs
>>       {"dispatch", 0400, .seq_ops = &hctx_dispatch_seq_ops},
>>       {"busy", 0400, hctx_busy_show},
>>       {"ctx_map", 0400, hctx_ctx_map_show},
>> +    {"tags", 0400, hctx_tags_show},
>>       {"sched_tags", 0400, hctx_sched_tags_show},
>>       {"sched_tags_bitmap", 0400, hctx_sched_tags_bitmap_show},
>>       {"io_poll", 0600, hctx_io_poll_show, hctx_io_poll_write},
> 
> I had been pondering this, too, when creating v6. Problem is that it'll 
> show the tags per hctx, but as they are shared I guess the list looks 
> pretty identical per hctx.

Right, so my main concern in this series is debugfs, and how to present 
the tag/hctx info. We can hide things under the hood mostly, but not so 
much for debugfs.

> So I guess we should filter the tags per hctx to have only those active 
> on that hctx displayed. But when doing so we can only print the 
> in-flight tags, the others are not assigned to a hctx and as such we 
> can't make a decision on which hctx they'll end up.

So we filter the tags in 7/12, and that should be ok.

But a concern is that we present identical hctx_tags_show() -> 
blk_mq_debugfs_shared_sitmap_show() -> sbitmap_queue_show() per-hctx 
info, which may be inappropriate/misleading/wrong. I need to audit that 
more thoroughly.

> 
>> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
>> index 7db16e49f6f6..6ca06b1c3a99 100644
>> --- a/block/blk-mq-tag.c
>> +++ b/block/blk-mq-tag.c
>> @@ -23,9 +23,19 @@
>>    */
>>   bool __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx)
>>   {
>> -    if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state) &&
>> -        !test_and_set_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
>> -        atomic_inc(&hctx->tags->active_queues);
>> +    struct request_queue *q = hctx->queue;
>> +    struct blk_mq_tag_set *set = q->tag_set;
>> +
>> +    if (blk_mq_is_sbitmap_shared(set)){
>> +        if (!test_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags) &&
>> +            !test_and_set_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags))
>> +            atomic_inc(&set->active_queues_shared_sbitmap);
>> +
>> +    } else {
>> +        if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state) &&
>> +            !test_and_set_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
>> +            atomic_inc(&hctx->tags->active_queues);
>> +    }
>>       return true;
>>   }
> At one point someone would need to educate me what this double 
> 'test_bit' and 'test_and_set_bit' is supposed to achieve.
> Other than deliberately injecting a race condition ...

As I see, it's not a dangerous race, and meant as a performance 
optimization.

test_bit() should be is quick compared to test_and_set_bit(), so nicer 
to avoid always doing a test_and_set_bit().

So if we race with failed test_bit() calls and have multiple attempts 
with the test_and_set_bit(), only 1 will succeed and do the increment.

> 
>> @@ -47,11 +57,19 @@ void blk_mq_tag_wakeup_all(struct blk_mq_tags 
>> *tags, bool include_reserve)
>>   void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
>>   {
>>       struct blk_mq_tags *tags = hctx->tags;
>> -
>> -    if (!test_and_clear_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
>> -        return;
>> -
>> -    atomic_dec(&tags->active_queues);
>> +    struct request_queue *q = hctx->queue;
>> +    struct blk_mq_tag_set *set = q->tag_set;
>> +
>> +    if (blk_mq_is_sbitmap_shared(q->tag_set)){
>> +        if (!test_and_clear_bit(QUEUE_FLAG_HCTX_ACTIVE,
>> +                    &q->queue_flags))
>> +            return;
>> +        atomic_dec(&set->active_queues_shared_sbitmap);
>> +    } else {
>> +        if (!test_and_clear_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
>> +            return;
>> +        atomic_dec(&tags->active_queues);
>> +    }
>>       blk_mq_tag_wakeup_all(tags, false);
>>   }
>> @@ -65,12 +83,11 @@ static inline bool hctx_may_queue(struct 
>> blk_mq_alloc_data *data,
>>   {
>>       struct blk_mq_hw_ctx *hctx = data->hctx;
>>       struct request_queue *q = data->q;
>> +    struct blk_mq_tag_set *set = q->tag_set;
>>       unsigned int depth, users;
>>       if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED))
>>           return true;
>> -    if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
>> -        return true;
>>       /*
>>        * Don't try dividing an ant
>> @@ -78,7 +95,15 @@ static inline bool hctx_may_queue(struct 
>> blk_mq_alloc_data *data,
>>       if (bt->sb.depth == 1)
>>           return true;
>> -    users = atomic_read(&hctx->tags->active_queues);
>> +    if (blk_mq_is_sbitmap_shared(q->tag_set)) {
>> +        if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &q->queue_flags))
>> +            return true;
>> +        users = atomic_read(&set->active_queues_shared_sbitmap);
>> +    } else {
>> +        if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
>> +            return true;
>> +        users = atomic_read(&hctx->tags->active_queues);
>> +    }
>>       if (!users)
>>           return true;
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index 0f7e062a1665..f73a2f9c58bd 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -3350,6 +3350,8 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set 
>> *set)
>>           goto out_free_mq_map;
>>       if (blk_mq_is_sbitmap_shared(set)) {
>> +        atomic_set(&set->active_queues_shared_sbitmap, 0);
>> +
>>           if (!blk_mq_init_shared_sbitmap(set)) {
>>               ret = -ENOMEM;
>>               goto out_free_mq_rq_maps;
>> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
>> index 7b31cdb92a71..66711c7234db 100644
>> --- a/include/linux/blk-mq.h
>> +++ b/include/linux/blk-mq.h
>> @@ -252,6 +252,7 @@ struct blk_mq_tag_set {
>>       unsigned int        timeout;
>>       unsigned int        flags;
>>       void            *driver_data;
>> +    atomic_t        active_queues_shared_sbitmap;

I'm not sure if we should present this in debugfs. It is not according 
to this series.

>>       struct sbitmap_queue    __bitmap_tags;
>>       struct sbitmap_queue    __breserved_tags;
>> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
>> index c536278bec9e..1b0087e8d01a 100644
>> --- a/include/linux/blkdev.h
>> +++ b/include/linux/blkdev.h
>> @@ -619,6 +619,7 @@ struct request_queue {
>>   #define QUEUE_FLAG_PCI_P2PDMA    25    /* device supports PCI p2p 
>> requests */
>>   #define QUEUE_FLAG_ZONE_RESETALL 26    /* supports Zone Reset All */
>>   #define QUEUE_FLAG_RQ_ALLOC_TIME 27    /* record rq->alloc_time_ns */
>> +#define QUEUE_FLAG_HCTX_ACTIVE 28    /* at least one blk-mq hctx is 
>> active */
>>   #define QUEUE_FLAG_MQ_DEFAULT    ((1 << QUEUE_FLAG_IO_STAT) |        \
>>                    (1 << QUEUE_FLAG_SAME_COMP))
>>
> Other than that it looks fine.

Cheers

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RFC v7 07/12] blk-mq: Add support in hctx_tags_bitmap_show() for a shared sbitmap
  2020-06-11 13:19   ` Hannes Reinecke
@ 2020-06-11 14:33     ` John Garry
  2020-06-12  6:06       ` Hannes Reinecke
  0 siblings, 1 reply; 45+ messages in thread
From: John Garry @ 2020-06-11 14:33 UTC (permalink / raw)
  To: Hannes Reinecke, axboe, jejb, martin.petersen, don.brace,
	kashyap.desai, sumit.saxena, ming.lei, bvanassche, hare, hch,
	shivasharan.srikanteshwara
  Cc: linux-block, linux-scsi, esc.storagedev, chenxiang66, megaraidlinux.pdl

>> +static int hctx_tags_shared_sbitmap_bitmap_show(void *data, struct 
>> seq_file *m)
>> +{
>> +    struct blk_mq_hw_ctx *hctx = data;
>> +    struct request_queue *q = hctx->queue;
>> +    struct blk_mq_tag_set *set = q->tag_set;
>> +    struct sbitmap shared_sb, *sb;
>> +    int res;
>> +
>> +    if (!set)
>> +        return 0;
>> +
>> +    /*
>> +     * We could use the allocated sbitmap for that hctx here, but
>> +     * that would mean that we would need to clean it prior to use.
>> +     */

*

>> +    res = sbitmap_init_node(&shared_sb,
>> +                set->__bitmap_tags.sb.depth,
>> +                set->__bitmap_tags.sb.shift,
>> +                GFP_KERNEL, NUMA_NO_NODE);
>> +    if (res)
>> +        return res;
>> +    sb = &shared_sb;
>> +
>> +    res = mutex_lock_interruptible(&q->sysfs_lock);
>> +    if (res)
>> +        goto out;
>> +    if (hctx->tags) {
>> +        hctx_filter_sb(sb, hctx);
>> +        sbitmap_bitmap_show(sb, m);
>> +    }
>> +
>> +    mutex_unlock(&q->sysfs_lock);
>> +
>> +out:
>> +    sbitmap_free(&shared_sb);
>> +    return res;
>> +}
>> +
>>   static int hctx_tags_bitmap_show(void *data, struct seq_file *m)
>>   {
>>       struct blk_mq_hw_ctx *hctx = data;
>> @@ -823,6 +884,7 @@ static const struct blk_mq_debugfs_attr 
>> blk_mq_debugfs_hctx_shared_sbitmap_attrs
>>       {"busy", 0400, hctx_busy_show},
>>       {"ctx_map", 0400, hctx_ctx_map_show},
>>       {"tags", 0400, hctx_tags_show},
>> +    {"tags_bitmap", 0400, hctx_tags_shared_sbitmap_bitmap_show},
>>       {"sched_tags", 0400, hctx_sched_tags_show},
>>       {"sched_tags_bitmap", 0400, hctx_sched_tags_bitmap_show},
>>       {"io_poll", 0600, hctx_io_poll_show, hctx_io_poll_write},
>>
> Ah, right. Here it is.
> 
> But I don't get it; why do we have to allocate a temporary bitmap and 
> can't walk the existing shared sbitmap?

For the bitmap dump - sbitmap_bitmap_show() - it is passed a struct 
sbitmap *. So we have to filter into a temp per-hctx struct sbitmap. We 
could change sbitmap_bitmap_show() to accept a filter iterator - which I 
think you're getting at - but I am not sure it's worth the change. Or 
else use the allocated sbitmap for the hctx, as above*, but I may be 
remove that (see 4/12 response).

Cheers,
John

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RFC v7 07/12] blk-mq: Add support in hctx_tags_bitmap_show() for a shared sbitmap
  2020-06-11 14:33     ` John Garry
@ 2020-06-12  6:06       ` Hannes Reinecke
  2020-06-29 15:32         ` About sbitmap_bitmap_show() and cleared bits (was Re: [PATCH RFC v7 07/12] blk-mq: Add support in hctx_tags_bitmap_show() for a shared sbitmap) John Garry
  0 siblings, 1 reply; 45+ messages in thread
From: Hannes Reinecke @ 2020-06-12  6:06 UTC (permalink / raw)
  To: John Garry, axboe, jejb, martin.petersen, don.brace,
	kashyap.desai, sumit.saxena, ming.lei, bvanassche, hare, hch,
	shivasharan.srikanteshwara
  Cc: linux-block, linux-scsi, esc.storagedev, chenxiang66, megaraidlinux.pdl

On 6/11/20 4:33 PM, John Garry wrote:
>>> +static int hctx_tags_shared_sbitmap_bitmap_show(void *data, struct 
>>> seq_file *m)
>>> +{
>>> +    struct blk_mq_hw_ctx *hctx = data;
>>> +    struct request_queue *q = hctx->queue;
>>> +    struct blk_mq_tag_set *set = q->tag_set;
>>> +    struct sbitmap shared_sb, *sb;
>>> +    int res;
>>> +
>>> +    if (!set)
>>> +        return 0;
>>> +
>>> +    /*
>>> +     * We could use the allocated sbitmap for that hctx here, but
>>> +     * that would mean that we would need to clean it prior to use.
>>> +     */
> 
> *
> 
>>> +    res = sbitmap_init_node(&shared_sb,
>>> +                set->__bitmap_tags.sb.depth,
>>> +                set->__bitmap_tags.sb.shift,
>>> +                GFP_KERNEL, NUMA_NO_NODE);
>>> +    if (res)
>>> +        return res;
>>> +    sb = &shared_sb;
>>> +
>>> +    res = mutex_lock_interruptible(&q->sysfs_lock);
>>> +    if (res)
>>> +        goto out;
>>> +    if (hctx->tags) {
>>> +        hctx_filter_sb(sb, hctx);
>>> +        sbitmap_bitmap_show(sb, m);
>>> +    }
>>> +
>>> +    mutex_unlock(&q->sysfs_lock);
>>> +
>>> +out:
>>> +    sbitmap_free(&shared_sb);
>>> +    return res;
>>> +}
>>> +
>>>   static int hctx_tags_bitmap_show(void *data, struct seq_file *m)
>>>   {
>>>       struct blk_mq_hw_ctx *hctx = data;
>>> @@ -823,6 +884,7 @@ static const struct blk_mq_debugfs_attr 
>>> blk_mq_debugfs_hctx_shared_sbitmap_attrs
>>>       {"busy", 0400, hctx_busy_show},
>>>       {"ctx_map", 0400, hctx_ctx_map_show},
>>>       {"tags", 0400, hctx_tags_show},
>>> +    {"tags_bitmap", 0400, hctx_tags_shared_sbitmap_bitmap_show},
>>>       {"sched_tags", 0400, hctx_sched_tags_show},
>>>       {"sched_tags_bitmap", 0400, hctx_sched_tags_bitmap_show},
>>>       {"io_poll", 0600, hctx_io_poll_show, hctx_io_poll_write},
>>>
>> Ah, right. Here it is.
>>
>> But I don't get it; why do we have to allocate a temporary bitmap and 
>> can't walk the existing shared sbitmap?
> 
> For the bitmap dump - sbitmap_bitmap_show() - it is passed a struct 
> sbitmap *. So we have to filter into a temp per-hctx struct sbitmap. We 
> could change sbitmap_bitmap_show() to accept a filter iterator - which I 
> think you're getting at - but I am not sure it's worth the change. Or 
> else use the allocated sbitmap for the hctx, as above*, but I may be 
> remove that (see 4/12 response).
> 
Yes, I do think I would prefer updating sbitmap_bitmap_show() to accept 
a filter. Especially as Ming Lei has now updated the tag iterators to 
accept a filter, too, so it should be an easy change.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                               +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: [PATCH RFC v7 00/12] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs
  2020-06-11  9:35   ` John Garry
@ 2020-06-12 18:47     ` Kashyap Desai
  2020-06-15  2:13       ` Ming Lei
  0 siblings, 1 reply; 45+ messages in thread
From: Kashyap Desai @ 2020-06-12 18:47 UTC (permalink / raw)
  To: John Garry, Ming Lei
  Cc: axboe, jejb, martin.petersen, don.brace, Sumit Saxena,
	bvanassche, hare, hch, Shivasharan Srikanteshwara, linux-block,
	linux-scsi, esc.storagedev, chenxiang66, PDL,MEGARAIDLINUX

> On 11/06/2020 04:07, Ming Lei wrote:
> >> Using 12x SAS SSDs on hisi_sas v3 hw. mq-deadline results are
> >> included, but it is not always an appropriate scheduler to use.
> >>
> >> Tag depth 		4000 (default)			260**
> >>
> >> Baseline:
> >> none sched:		2290K IOPS			894K
> >> mq-deadline sched:	2341K IOPS			2313K
> >>
> >> Final, host_tagset=0 in LLDD*
> >> none sched:		2289K IOPS			703K
> >> mq-deadline sched:	2337K IOPS			2291K
> >>
> >> Final:
> >> none sched:		2281K IOPS			1101K
> >> mq-deadline sched:	2322K IOPS			1278K
> >>
> >> * this is relevant as this is the performance in supporting but not
> >>    enabling the feature
> >> ** depth=260 is relevant as some point where we are regularly waiting
> >> for
> >>     tags to be available. Figures were are a bit unstable here for
> >> testing.

John -

I tried V7 series and debug further on mq-deadline interface. This time I
have used another setup since HDD based setup is not readily available for
me.
In fact, I was able to simulate issue very easily using single scsi_device
as well. BTW, this is not an issue with this RFC, but generic issue.
Since I have converted nr_hw_queue > 1 for Broadcom product using this RFC,
It becomes noticeable now.

Problem - Using below command  I see heavy CPU utilization on "
native_queued_spin_lock_slowpath". This is because kblockd work queue is
submitting IO from all the CPUs even though fio is bound to single CPU.
Lock contention from " dd_dispatch_request" is causing this issue.

numactl -C 13  fio
single.fio --iodepth=32 --bs=4k --rw=randread --ioscheduler=none --numjobs=1
 --cpus_allowed_policy=split --ioscheduler=mq-deadline
--group_reporting --filename=/dev/sdd

While running above command, ideally we expect only kworker/13 to be active.
But you can see below - All the CPU is attempting submission and lots of CPU
consumption is due to lock contention.

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 2726 root       0 -20       0      0      0 R  56.5  0.0   0:53.20
kworker/13:1H-k
 7815 root      20   0  712404  15536   2228 R  43.2  0.0   0:05.03 fio
 2792 root       0 -20       0      0      0 I  26.6  0.0   0:22.19
kworker/18:1H-k
 2791 root       0 -20       0      0      0 I  19.9  0.0   0:17.17
kworker/19:1H-k
 1419 root       0 -20       0      0      0 I  19.6  0.0   0:17.03
kworker/20:1H-k
 2793 root       0 -20       0      0      0 I  18.3  0.0   0:15.64
kworker/21:1H-k
 1424 root       0 -20       0      0      0 I  17.3  0.0   0:14.99
kworker/22:1H-k
 2626 root       0 -20       0      0      0 I  16.9  0.0   0:14.68
kworker/26:1H-k
 2794 root       0 -20       0      0      0 I  16.9  0.0   0:14.87
kworker/23:1H-k
 2795 root       0 -20       0      0      0 I  16.9  0.0   0:14.81
kworker/24:1H-k
 2797 root       0 -20       0      0      0 I  16.9  0.0   0:14.62
kworker/27:1H-k
 1415 root       0 -20       0      0      0 I  16.6  0.0   0:14.44
kworker/30:1H-k
 2669 root       0 -20       0      0      0 I  16.6  0.0   0:14.38
kworker/31:1H-k
 2796 root       0 -20       0      0      0 I  16.6  0.0   0:14.74
kworker/25:1H-k
 2799 root       0 -20       0      0      0 I  16.6  0.0   0:14.56
kworker/28:1H-k
 1425 root       0 -20       0      0      0 I  16.3  0.0   0:14.21
kworker/34:1H-k
 2746 root       0 -20       0      0      0 I  16.3  0.0   0:14.33
kworker/32:1H-k
 2798 root       0 -20       0      0      0 I  16.3  0.0   0:14.50
kworker/29:1H-k
 2800 root       0 -20       0      0      0 I  16.3  0.0   0:14.27
kworker/33:1H-k
 1423 root       0 -20       0      0      0 I  15.9  0.0   0:14.10
kworker/54:1H-k
 1784 root       0 -20       0      0      0 I  15.9  0.0   0:14.03
kworker/55:1H-k
 2801 root       0 -20       0      0      0 I  15.9  0.0   0:14.15
kworker/35:1H-k
 2815 root       0 -20       0      0      0 I  15.9  0.0   0:13.97
kworker/56:1H-k
 1484 root       0 -20       0      0      0 I  15.6  0.0   0:13.90
kworker/57:1H-k
 1485 root       0 -20       0      0      0 I  15.6  0.0   0:13.82
kworker/59:1H-k
 1519 root       0 -20       0      0      0 I  15.6  0.0   0:13.64
kworker/62:1H-k
 2315 root       0 -20       0      0      0 I  15.6  0.0   0:13.87
kworker/58:1H-k
 2627 root       0 -20       0      0      0 I  15.6  0.0   0:13.69
kworker/61:1H-k
 2816 root       0 -20       0      0      0 I  15.6  0.0   0:13.75
kworker/60:1H-k


I root cause this issue -

Block layer always queue IO on hctx context mapped to CPU-13, but hw queue
run from all the hctx context.
I noticed in my test hctx48 has queued all the IOs. No other hctx has queued
IO. But all the hctx is counting for "run".

# cat hctx48/queued
2087058

#cat hctx*/run
151318
30038
83110
50680
69907
60391
111239
18036
33935
91648
34582
22853
61286
19489

Below patch has fix - "Run the hctx queue for which request was completed
instead of running all the hardware queue."
If this looks valid fix, please include in V8 OR I can post separate patch
for this. Just want to have some level of review from this discussion.

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 0652acd..f52118f 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -554,6 +554,7 @@ static bool scsi_end_request(struct request *req,
blk_status_t error,
        struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(req);
        struct scsi_device *sdev = cmd->device;
        struct request_queue *q = sdev->request_queue;
+       struct blk_mq_hw_ctx *mq_hctx = req->mq_hctx;

        if (blk_update_request(req, error, bytes))
                return true;
@@ -595,7 +596,8 @@ static bool scsi_end_request(struct request *req,
blk_status_t error,
            !list_empty(&sdev->host->starved_list))
                kblockd_schedule_work(&sdev->requeue_work);
        else
-               blk_mq_run_hw_queues(q, true);
+               blk_mq_run_hw_queue(mq_hctx, true);
+               //blk_mq_run_hw_queues(q, true);

        percpu_ref_put(&q->q_usage_counter);
        return false;


After above patch - Only kworker/13 is actively doing submission.

3858 root       0 -20       0      0      0 I  22.9  0.0   3:24.04
kworker/13:1H-k
16768 root      20   0  712008  14968   2180 R  21.6  0.0   0:03.27 fio
16769 root      20   0  712012  14968   2180 R  21.6  0.0   0:03.27 fio

Without above patch - 24 SSD driver can give 1.5M IOPS and after above patch
3.2M IOPS.

I will continue my testing.

Thanks, Kashyap

> >>
> >> A copy of the patches can be found here:
> >> https://github.com/hisilicon/kernel-dev/commits/private-topic-blk-mq-
> >> shared-tags-rfc-v7
> >>
> >> And to progress this series, we the the following to go in first, when
> >> ready:
> >> https://lore.kernel.org/linux-scsi/20200430131904.5847-1-hare@suse.de
> >> /
> > I'd suggest to add options to enable shared tags for null_blk &
> > scsi_debug in V8, so that it is easier to verify the changes without
> > real
> hardware.
> >
>
> ok, fine, I can look at including that. To stop the series getting too
> large, I
> might spin off the early patches, which are not strictly related.
>
> Thanks,
> John

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RFC v7 00/12] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs
  2020-06-12 18:47     ` Kashyap Desai
@ 2020-06-15  2:13       ` Ming Lei
  2020-06-15  6:57         ` Kashyap Desai
  0 siblings, 1 reply; 45+ messages in thread
From: Ming Lei @ 2020-06-15  2:13 UTC (permalink / raw)
  To: Kashyap Desai
  Cc: John Garry, axboe, jejb, martin.petersen, don.brace,
	Sumit Saxena, bvanassche, hare, hch, Shivasharan Srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	PDL,MEGARAIDLINUX

On Sat, Jun 13, 2020 at 12:17:37AM +0530, Kashyap Desai wrote:
> > On 11/06/2020 04:07, Ming Lei wrote:
> > >> Using 12x SAS SSDs on hisi_sas v3 hw. mq-deadline results are
> > >> included, but it is not always an appropriate scheduler to use.
> > >>
> > >> Tag depth 		4000 (default)			260**
> > >>
> > >> Baseline:
> > >> none sched:		2290K IOPS			894K
> > >> mq-deadline sched:	2341K IOPS			2313K
> > >>
> > >> Final, host_tagset=0 in LLDD*
> > >> none sched:		2289K IOPS			703K
> > >> mq-deadline sched:	2337K IOPS			2291K
> > >>
> > >> Final:
> > >> none sched:		2281K IOPS			1101K
> > >> mq-deadline sched:	2322K IOPS			1278K
> > >>
> > >> * this is relevant as this is the performance in supporting but not
> > >>    enabling the feature
> > >> ** depth=260 is relevant as some point where we are regularly waiting
> > >> for
> > >>     tags to be available. Figures were are a bit unstable here for
> > >> testing.
> 
> John -
> 
> I tried V7 series and debug further on mq-deadline interface. This time I
> have used another setup since HDD based setup is not readily available for
> me.
> In fact, I was able to simulate issue very easily using single scsi_device
> as well. BTW, this is not an issue with this RFC, but generic issue.
> Since I have converted nr_hw_queue > 1 for Broadcom product using this RFC,
> It becomes noticeable now.
> 
> Problem - Using below command  I see heavy CPU utilization on "
> native_queued_spin_lock_slowpath". This is because kblockd work queue is
> submitting IO from all the CPUs even though fio is bound to single CPU.
> Lock contention from " dd_dispatch_request" is causing this issue.
> 
> numactl -C 13  fio
> single.fio --iodepth=32 --bs=4k --rw=randread --ioscheduler=none --numjobs=1
>  --cpus_allowed_policy=split --ioscheduler=mq-deadline
> --group_reporting --filename=/dev/sdd
> 
> While running above command, ideally we expect only kworker/13 to be active.
> But you can see below - All the CPU is attempting submission and lots of CPU
> consumption is due to lock contention.
> 
>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
>  2726 root       0 -20       0      0      0 R  56.5  0.0   0:53.20
> kworker/13:1H-k
>  7815 root      20   0  712404  15536   2228 R  43.2  0.0   0:05.03 fio
>  2792 root       0 -20       0      0      0 I  26.6  0.0   0:22.19
> kworker/18:1H-k
>  2791 root       0 -20       0      0      0 I  19.9  0.0   0:17.17
> kworker/19:1H-k
>  1419 root       0 -20       0      0      0 I  19.6  0.0   0:17.03
> kworker/20:1H-k
>  2793 root       0 -20       0      0      0 I  18.3  0.0   0:15.64
> kworker/21:1H-k
>  1424 root       0 -20       0      0      0 I  17.3  0.0   0:14.99
> kworker/22:1H-k
>  2626 root       0 -20       0      0      0 I  16.9  0.0   0:14.68
> kworker/26:1H-k
>  2794 root       0 -20       0      0      0 I  16.9  0.0   0:14.87
> kworker/23:1H-k
>  2795 root       0 -20       0      0      0 I  16.9  0.0   0:14.81
> kworker/24:1H-k
>  2797 root       0 -20       0      0      0 I  16.9  0.0   0:14.62
> kworker/27:1H-k
>  1415 root       0 -20       0      0      0 I  16.6  0.0   0:14.44
> kworker/30:1H-k
>  2669 root       0 -20       0      0      0 I  16.6  0.0   0:14.38
> kworker/31:1H-k
>  2796 root       0 -20       0      0      0 I  16.6  0.0   0:14.74
> kworker/25:1H-k
>  2799 root       0 -20       0      0      0 I  16.6  0.0   0:14.56
> kworker/28:1H-k
>  1425 root       0 -20       0      0      0 I  16.3  0.0   0:14.21
> kworker/34:1H-k
>  2746 root       0 -20       0      0      0 I  16.3  0.0   0:14.33
> kworker/32:1H-k
>  2798 root       0 -20       0      0      0 I  16.3  0.0   0:14.50
> kworker/29:1H-k
>  2800 root       0 -20       0      0      0 I  16.3  0.0   0:14.27
> kworker/33:1H-k
>  1423 root       0 -20       0      0      0 I  15.9  0.0   0:14.10
> kworker/54:1H-k
>  1784 root       0 -20       0      0      0 I  15.9  0.0   0:14.03
> kworker/55:1H-k
>  2801 root       0 -20       0      0      0 I  15.9  0.0   0:14.15
> kworker/35:1H-k
>  2815 root       0 -20       0      0      0 I  15.9  0.0   0:13.97
> kworker/56:1H-k
>  1484 root       0 -20       0      0      0 I  15.6  0.0   0:13.90
> kworker/57:1H-k
>  1485 root       0 -20       0      0      0 I  15.6  0.0   0:13.82
> kworker/59:1H-k
>  1519 root       0 -20       0      0      0 I  15.6  0.0   0:13.64
> kworker/62:1H-k
>  2315 root       0 -20       0      0      0 I  15.6  0.0   0:13.87
> kworker/58:1H-k
>  2627 root       0 -20       0      0      0 I  15.6  0.0   0:13.69
> kworker/61:1H-k
>  2816 root       0 -20       0      0      0 I  15.6  0.0   0:13.75
> kworker/60:1H-k
> 
> 
> I root cause this issue -
> 
> Block layer always queue IO on hctx context mapped to CPU-13, but hw queue
> run from all the hctx context.
> I noticed in my test hctx48 has queued all the IOs. No other hctx has queued
> IO. But all the hctx is counting for "run".
> 
> # cat hctx48/queued
> 2087058
> 
> #cat hctx*/run
> 151318
> 30038
> 83110
> 50680
> 69907
> 60391
> 111239
> 18036
> 33935
> 91648
> 34582
> 22853
> 61286
> 19489
> 
> Below patch has fix - "Run the hctx queue for which request was completed
> instead of running all the hardware queue."
> If this looks valid fix, please include in V8 OR I can post separate patch
> for this. Just want to have some level of review from this discussion.
> 
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index 0652acd..f52118f 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -554,6 +554,7 @@ static bool scsi_end_request(struct request *req,
> blk_status_t error,
>         struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(req);
>         struct scsi_device *sdev = cmd->device;
>         struct request_queue *q = sdev->request_queue;
> +       struct blk_mq_hw_ctx *mq_hctx = req->mq_hctx;
> 
>         if (blk_update_request(req, error, bytes))
>                 return true;
> @@ -595,7 +596,8 @@ static bool scsi_end_request(struct request *req,
> blk_status_t error,
>             !list_empty(&sdev->host->starved_list))
>                 kblockd_schedule_work(&sdev->requeue_work);
>         else
> -               blk_mq_run_hw_queues(q, true);
> +               blk_mq_run_hw_queue(mq_hctx, true);
> +               //blk_mq_run_hw_queues(q, true);

This way may cause IO hang because ->device_busy is shared by all hctxs.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: [PATCH RFC v7 00/12] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs
  2020-06-15  2:13       ` Ming Lei
@ 2020-06-15  6:57         ` Kashyap Desai
  2020-06-16  1:00           ` Ming Lei
  0 siblings, 1 reply; 45+ messages in thread
From: Kashyap Desai @ 2020-06-15  6:57 UTC (permalink / raw)
  To: Ming Lei
  Cc: John Garry, axboe, jejb, martin.petersen, don.brace,
	Sumit Saxena, bvanassche, hare, hch, Shivasharan Srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	PDL,MEGARAIDLINUX

> >
> > John -
> >
> > I tried V7 series and debug further on mq-deadline interface. This
> > time I have used another setup since HDD based setup is not readily
> > available for me.
> > In fact, I was able to simulate issue very easily using single
> > scsi_device as well. BTW, this is not an issue with this RFC, but
generic issue.
> > Since I have converted nr_hw_queue > 1 for Broadcom product using this
> > RFC, It becomes noticeable now.
> >
> > Problem - Using below command  I see heavy CPU utilization on "
> > native_queued_spin_lock_slowpath". This is because kblockd work queue
> > is submitting IO from all the CPUs even though fio is bound to single
CPU.
> > Lock contention from " dd_dispatch_request" is causing this issue.
> >
> > numactl -C 13  fio
> > single.fio --iodepth=32 --bs=4k --rw=randread --ioscheduler=none
> > --numjobs=1  --cpus_allowed_policy=split --ioscheduler=mq-deadline
> > --group_reporting --filename=/dev/sdd
> >
> > While running above command, ideally we expect only kworker/13 to be
> active.
> > But you can see below - All the CPU is attempting submission and lots
> > of CPU consumption is due to lock contention.
> >
> >   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+
COMMAND
> >  2726 root       0 -20       0      0      0 R  56.5  0.0   0:53.20
> > kworker/13:1H-k
> >  7815 root      20   0  712404  15536   2228 R  43.2  0.0   0:05.03
fio
> >  2792 root       0 -20       0      0      0 I  26.6  0.0   0:22.19
> > kworker/18:1H-k
> >  2791 root       0 -20       0      0      0 I  19.9  0.0   0:17.17
> > kworker/19:1H-k
> >  1419 root       0 -20       0      0      0 I  19.6  0.0   0:17.03
> > kworker/20:1H-k
> >  2793 root       0 -20       0      0      0 I  18.3  0.0   0:15.64
> > kworker/21:1H-k
> >  1424 root       0 -20       0      0      0 I  17.3  0.0   0:14.99
> > kworker/22:1H-k
> >  2626 root       0 -20       0      0      0 I  16.9  0.0   0:14.68
> > kworker/26:1H-k
> >  2794 root       0 -20       0      0      0 I  16.9  0.0   0:14.87
> > kworker/23:1H-k
> >  2795 root       0 -20       0      0      0 I  16.9  0.0   0:14.81
> > kworker/24:1H-k
> >  2797 root       0 -20       0      0      0 I  16.9  0.0   0:14.62
> > kworker/27:1H-k
> >  1415 root       0 -20       0      0      0 I  16.6  0.0   0:14.44
> > kworker/30:1H-k
> >  2669 root       0 -20       0      0      0 I  16.6  0.0   0:14.38
> > kworker/31:1H-k
> >  2796 root       0 -20       0      0      0 I  16.6  0.0   0:14.74
> > kworker/25:1H-k
> >  2799 root       0 -20       0      0      0 I  16.6  0.0   0:14.56
> > kworker/28:1H-k
> >  1425 root       0 -20       0      0      0 I  16.3  0.0   0:14.21
> > kworker/34:1H-k
> >  2746 root       0 -20       0      0      0 I  16.3  0.0   0:14.33
> > kworker/32:1H-k
> >  2798 root       0 -20       0      0      0 I  16.3  0.0   0:14.50
> > kworker/29:1H-k
> >  2800 root       0 -20       0      0      0 I  16.3  0.0   0:14.27
> > kworker/33:1H-k
> >  1423 root       0 -20       0      0      0 I  15.9  0.0   0:14.10
> > kworker/54:1H-k
> >  1784 root       0 -20       0      0      0 I  15.9  0.0   0:14.03
> > kworker/55:1H-k
> >  2801 root       0 -20       0      0      0 I  15.9  0.0   0:14.15
> > kworker/35:1H-k
> >  2815 root       0 -20       0      0      0 I  15.9  0.0   0:13.97
> > kworker/56:1H-k
> >  1484 root       0 -20       0      0      0 I  15.6  0.0   0:13.90
> > kworker/57:1H-k
> >  1485 root       0 -20       0      0      0 I  15.6  0.0   0:13.82
> > kworker/59:1H-k
> >  1519 root       0 -20       0      0      0 I  15.6  0.0   0:13.64
> > kworker/62:1H-k
> >  2315 root       0 -20       0      0      0 I  15.6  0.0   0:13.87
> > kworker/58:1H-k
> >  2627 root       0 -20       0      0      0 I  15.6  0.0   0:13.69
> > kworker/61:1H-k
> >  2816 root       0 -20       0      0      0 I  15.6  0.0   0:13.75
> > kworker/60:1H-k
> >
> >
> > I root cause this issue -
> >
> > Block layer always queue IO on hctx context mapped to CPU-13, but hw
> > queue run from all the hctx context.
> > I noticed in my test hctx48 has queued all the IOs. No other hctx has
> > queued IO. But all the hctx is counting for "run".
> >
> > # cat hctx48/queued
> > 2087058
> >
> > #cat hctx*/run
> > 151318
> > 30038
> > 83110
> > 50680
> > 69907
> > 60391
> > 111239
> > 18036
> > 33935
> > 91648
> > 34582
> > 22853
> > 61286
> > 19489
> >
> > Below patch has fix - "Run the hctx queue for which request was
> > completed instead of running all the hardware queue."
> > If this looks valid fix, please include in V8 OR I can post separate
> > patch for this. Just want to have some level of review from this
discussion.
> >
> > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index
> > 0652acd..f52118f 100644
> > --- a/drivers/scsi/scsi_lib.c
> > +++ b/drivers/scsi/scsi_lib.c
> > @@ -554,6 +554,7 @@ static bool scsi_end_request(struct request *req,
> > blk_status_t error,
> >         struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(req);
> >         struct scsi_device *sdev = cmd->device;
> >         struct request_queue *q = sdev->request_queue;
> > +       struct blk_mq_hw_ctx *mq_hctx = req->mq_hctx;
> >
> >         if (blk_update_request(req, error, bytes))
> >                 return true;
> > @@ -595,7 +596,8 @@ static bool scsi_end_request(struct request *req,
> > blk_status_t error,
> >             !list_empty(&sdev->host->starved_list))
> >                 kblockd_schedule_work(&sdev->requeue_work);
> >         else
> > -               blk_mq_run_hw_queues(q, true);
> > +               blk_mq_run_hw_queue(mq_hctx, true);
> > +               //blk_mq_run_hw_queues(q, true);
>
> This way may cause IO hang because ->device_busy is shared by all hctxs.

From SCSI stack, if we attempt to run all h/w queue, is it possible that
block layer actually run hw_queue which has really not queued any IO.
Currently, in case of mq-deadline, IOS are inserted using
"dd_insert_request". This function will add IOs on elevator data which is
per request queue and not per hctx.
When there is an attempt to run hctx, "blk_mq_sched_has_work" will check
pending work which is per request queue and not per hctx.
Because of this, IOs queued on only one hctx will be run from all the hctx
and this will create unnecessary lock contention.

How about below patch - ?

diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 126021f..1d30bd3 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -74,6 +74,13 @@ static inline bool blk_mq_sched_has_work(struct
blk_mq_hw_ctx *hctx)
 {
        struct elevator_queue *e = hctx->queue->elevator;

+       /* If current hctx has not queued any request, there is no need to
run.
+        * blk_mq_run_hw_queue() on hctx which has queued IO will handle
+        * running specific hctx.
+        */
+       if (!hctx->queued)
+               return false;
+
        if (e && e->type->ops.has_work)
                return e->type->ops.has_work(hctx);

Kashyap

>
> Thanks,
> Ming

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RFC v7 00/12] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs
  2020-06-15  6:57         ` Kashyap Desai
@ 2020-06-16  1:00           ` Ming Lei
  2020-06-17 11:26             ` Kashyap Desai
  0 siblings, 1 reply; 45+ messages in thread
From: Ming Lei @ 2020-06-16  1:00 UTC (permalink / raw)
  To: Kashyap Desai
  Cc: John Garry, axboe, jejb, martin.petersen, don.brace,
	Sumit Saxena, bvanassche, hare, hch, Shivasharan Srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	PDL,MEGARAIDLINUX

On Mon, Jun 15, 2020 at 12:27:30PM +0530, Kashyap Desai wrote:
> > >
> > > John -
> > >
> > > I tried V7 series and debug further on mq-deadline interface. This
> > > time I have used another setup since HDD based setup is not readily
> > > available for me.
> > > In fact, I was able to simulate issue very easily using single
> > > scsi_device as well. BTW, this is not an issue with this RFC, but
> generic issue.
> > > Since I have converted nr_hw_queue > 1 for Broadcom product using this
> > > RFC, It becomes noticeable now.
> > >
> > > Problem - Using below command  I see heavy CPU utilization on "
> > > native_queued_spin_lock_slowpath". This is because kblockd work queue
> > > is submitting IO from all the CPUs even though fio is bound to single
> CPU.
> > > Lock contention from " dd_dispatch_request" is causing this issue.
> > >
> > > numactl -C 13  fio
> > > single.fio --iodepth=32 --bs=4k --rw=randread --ioscheduler=none
> > > --numjobs=1  --cpus_allowed_policy=split --ioscheduler=mq-deadline
> > > --group_reporting --filename=/dev/sdd
> > >
> > > While running above command, ideally we expect only kworker/13 to be
> > active.
> > > But you can see below - All the CPU is attempting submission and lots
> > > of CPU consumption is due to lock contention.
> > >
> > >   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+
> COMMAND
> > >  2726 root       0 -20       0      0      0 R  56.5  0.0   0:53.20
> > > kworker/13:1H-k
> > >  7815 root      20   0  712404  15536   2228 R  43.2  0.0   0:05.03
> fio
> > >  2792 root       0 -20       0      0      0 I  26.6  0.0   0:22.19
> > > kworker/18:1H-k
> > >  2791 root       0 -20       0      0      0 I  19.9  0.0   0:17.17
> > > kworker/19:1H-k
> > >  1419 root       0 -20       0      0      0 I  19.6  0.0   0:17.03
> > > kworker/20:1H-k
> > >  2793 root       0 -20       0      0      0 I  18.3  0.0   0:15.64
> > > kworker/21:1H-k
> > >  1424 root       0 -20       0      0      0 I  17.3  0.0   0:14.99
> > > kworker/22:1H-k
> > >  2626 root       0 -20       0      0      0 I  16.9  0.0   0:14.68
> > > kworker/26:1H-k
> > >  2794 root       0 -20       0      0      0 I  16.9  0.0   0:14.87
> > > kworker/23:1H-k
> > >  2795 root       0 -20       0      0      0 I  16.9  0.0   0:14.81
> > > kworker/24:1H-k
> > >  2797 root       0 -20       0      0      0 I  16.9  0.0   0:14.62
> > > kworker/27:1H-k
> > >  1415 root       0 -20       0      0      0 I  16.6  0.0   0:14.44
> > > kworker/30:1H-k
> > >  2669 root       0 -20       0      0      0 I  16.6  0.0   0:14.38
> > > kworker/31:1H-k
> > >  2796 root       0 -20       0      0      0 I  16.6  0.0   0:14.74
> > > kworker/25:1H-k
> > >  2799 root       0 -20       0      0      0 I  16.6  0.0   0:14.56
> > > kworker/28:1H-k
> > >  1425 root       0 -20       0      0      0 I  16.3  0.0   0:14.21
> > > kworker/34:1H-k
> > >  2746 root       0 -20       0      0      0 I  16.3  0.0   0:14.33
> > > kworker/32:1H-k
> > >  2798 root       0 -20       0      0      0 I  16.3  0.0   0:14.50
> > > kworker/29:1H-k
> > >  2800 root       0 -20       0      0      0 I  16.3  0.0   0:14.27
> > > kworker/33:1H-k
> > >  1423 root       0 -20       0      0      0 I  15.9  0.0   0:14.10
> > > kworker/54:1H-k
> > >  1784 root       0 -20       0      0      0 I  15.9  0.0   0:14.03
> > > kworker/55:1H-k
> > >  2801 root       0 -20       0      0      0 I  15.9  0.0   0:14.15
> > > kworker/35:1H-k
> > >  2815 root       0 -20       0      0      0 I  15.9  0.0   0:13.97
> > > kworker/56:1H-k
> > >  1484 root       0 -20       0      0      0 I  15.6  0.0   0:13.90
> > > kworker/57:1H-k
> > >  1485 root       0 -20       0      0      0 I  15.6  0.0   0:13.82
> > > kworker/59:1H-k
> > >  1519 root       0 -20       0      0      0 I  15.6  0.0   0:13.64
> > > kworker/62:1H-k
> > >  2315 root       0 -20       0      0      0 I  15.6  0.0   0:13.87
> > > kworker/58:1H-k
> > >  2627 root       0 -20       0      0      0 I  15.6  0.0   0:13.69
> > > kworker/61:1H-k
> > >  2816 root       0 -20       0      0      0 I  15.6  0.0   0:13.75
> > > kworker/60:1H-k
> > >
> > >
> > > I root cause this issue -
> > >
> > > Block layer always queue IO on hctx context mapped to CPU-13, but hw
> > > queue run from all the hctx context.
> > > I noticed in my test hctx48 has queued all the IOs. No other hctx has
> > > queued IO. But all the hctx is counting for "run".
> > >
> > > # cat hctx48/queued
> > > 2087058
> > >
> > > #cat hctx*/run
> > > 151318
> > > 30038
> > > 83110
> > > 50680
> > > 69907
> > > 60391
> > > 111239
> > > 18036
> > > 33935
> > > 91648
> > > 34582
> > > 22853
> > > 61286
> > > 19489
> > >
> > > Below patch has fix - "Run the hctx queue for which request was
> > > completed instead of running all the hardware queue."
> > > If this looks valid fix, please include in V8 OR I can post separate
> > > patch for this. Just want to have some level of review from this
> discussion.
> > >
> > > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index
> > > 0652acd..f52118f 100644
> > > --- a/drivers/scsi/scsi_lib.c
> > > +++ b/drivers/scsi/scsi_lib.c
> > > @@ -554,6 +554,7 @@ static bool scsi_end_request(struct request *req,
> > > blk_status_t error,
> > >         struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(req);
> > >         struct scsi_device *sdev = cmd->device;
> > >         struct request_queue *q = sdev->request_queue;
> > > +       struct blk_mq_hw_ctx *mq_hctx = req->mq_hctx;
> > >
> > >         if (blk_update_request(req, error, bytes))
> > >                 return true;
> > > @@ -595,7 +596,8 @@ static bool scsi_end_request(struct request *req,
> > > blk_status_t error,
> > >             !list_empty(&sdev->host->starved_list))
> > >                 kblockd_schedule_work(&sdev->requeue_work);
> > >         else
> > > -               blk_mq_run_hw_queues(q, true);
> > > +               blk_mq_run_hw_queue(mq_hctx, true);
> > > +               //blk_mq_run_hw_queues(q, true);
> >
> > This way may cause IO hang because ->device_busy is shared by all hctxs.
> 
> From SCSI stack, if we attempt to run all h/w queue, is it possible that
> block layer actually run hw_queue which has really not queued any IO.
> Currently, in case of mq-deadline, IOS are inserted using
> "dd_insert_request". This function will add IOs on elevator data which is
> per request queue and not per hctx.
> When there is an attempt to run hctx, "blk_mq_sched_has_work" will check
> pending work which is per request queue and not per hctx.
> Because of this, IOs queued on only one hctx will be run from all the hctx
> and this will create unnecessary lock contention.

Deadline is supposed for HDD. slow disks, so the lock contention shouldn't have
been one problem given there is seldom MQ HDD. before this patchset.

Another related issue is default scheduler, I guess deadline still should have
been the default io sched for HDDs. attached to this kind HBA with multiple reply
queues and single submission queue.

> 
> How about below patch - ?
> 
> diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
> index 126021f..1d30bd3 100644
> --- a/block/blk-mq-sched.h
> +++ b/block/blk-mq-sched.h
> @@ -74,6 +74,13 @@ static inline bool blk_mq_sched_has_work(struct
> blk_mq_hw_ctx *hctx)
>  {
>         struct elevator_queue *e = hctx->queue->elevator;
> 
> +       /* If current hctx has not queued any request, there is no need to
> run.
> +        * blk_mq_run_hw_queue() on hctx which has queued IO will handle
> +        * running specific hctx.
> +        */
> +       if (!hctx->queued)
> +               return false;
> +
>         if (e && e->type->ops.has_work)
>                 return e->type->ops.has_work(hctx);

->queued is increased only and not decreased just for debug purpose so far, so
it can't be relied for this purpose.

One approach is to add one similar counter, and maintain it by scheduler's insert/dispatch
callback.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: [PATCH RFC v7 00/12] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs
  2020-06-16  1:00           ` Ming Lei
@ 2020-06-17 11:26             ` Kashyap Desai
  2020-06-22  6:24               ` Hannes Reinecke
  0 siblings, 1 reply; 45+ messages in thread
From: Kashyap Desai @ 2020-06-17 11:26 UTC (permalink / raw)
  To: Ming Lei
  Cc: John Garry, axboe, jejb, martin.petersen, don.brace,
	Sumit Saxena, bvanassche, hare, hch, Shivasharan Srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	PDL,MEGARAIDLINUX

> On Mon, Jun 15, 2020 at 12:27:30PM +0530, Kashyap Desai wrote:
> > > >
> > > > John -
> > > >
> > > > I tried V7 series and debug further on mq-deadline interface. This
> > > > time I have used another setup since HDD based setup is not
> > > > readily available for me.
> > > > In fact, I was able to simulate issue very easily using single
> > > > scsi_device as well. BTW, this is not an issue with this RFC, but
> > generic issue.
> > > > Since I have converted nr_hw_queue > 1 for Broadcom product using
> > > > this RFC, It becomes noticeable now.
> > > >
> > > > Problem - Using below command  I see heavy CPU utilization on "
> > > > native_queued_spin_lock_slowpath". This is because kblockd work
> > > > queue is submitting IO from all the CPUs even though fio is bound
> > > > to single
> > CPU.
> > > > Lock contention from " dd_dispatch_request" is causing this issue.
> > > >
> > > > numactl -C 13  fio
> > > > single.fio --iodepth=32 --bs=4k --rw=randread --ioscheduler=none
> > > > --numjobs=1  --cpus_allowed_policy=split --ioscheduler=mq-deadline
> > > > --group_reporting --filename=/dev/sdd
> > > >
> > > > While running above command, ideally we expect only kworker/13 to
> > > > be
> > > active.
> > > > But you can see below - All the CPU is attempting submission and
> > > > lots of CPU consumption is due to lock contention.
> > > >
> > > >   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM
TIME+
> > COMMAND
> > > >  2726 root       0 -20       0      0      0 R  56.5  0.0
0:53.20
> > > > kworker/13:1H-k
> > > >  7815 root      20   0  712404  15536   2228 R  43.2  0.0
0:05.03
> > fio
> > > >  2792 root       0 -20       0      0      0 I  26.6  0.0
0:22.19
> > > > kworker/18:1H-k
> > > >  2791 root       0 -20       0      0      0 I  19.9  0.0
0:17.17
> > > > kworker/19:1H-k
> > > >  1419 root       0 -20       0      0      0 I  19.6  0.0
0:17.03
> > > > kworker/20:1H-k
> > > >  2793 root       0 -20       0      0      0 I  18.3  0.0
0:15.64
> > > > kworker/21:1H-k
> > > >  1424 root       0 -20       0      0      0 I  17.3  0.0
0:14.99
> > > > kworker/22:1H-k
> > > >  2626 root       0 -20       0      0      0 I  16.9  0.0
0:14.68
> > > > kworker/26:1H-k
> > > >  2794 root       0 -20       0      0      0 I  16.9  0.0
0:14.87
> > > > kworker/23:1H-k
> > > >  2795 root       0 -20       0      0      0 I  16.9  0.0
0:14.81
> > > > kworker/24:1H-k
> > > >  2797 root       0 -20       0      0      0 I  16.9  0.0
0:14.62
> > > > kworker/27:1H-k
> > > >  1415 root       0 -20       0      0      0 I  16.6  0.0
0:14.44
> > > > kworker/30:1H-k
> > > >  2669 root       0 -20       0      0      0 I  16.6  0.0
0:14.38
> > > > kworker/31:1H-k
> > > >  2796 root       0 -20       0      0      0 I  16.6  0.0
0:14.74
> > > > kworker/25:1H-k
> > > >  2799 root       0 -20       0      0      0 I  16.6  0.0
0:14.56
> > > > kworker/28:1H-k
> > > >  1425 root       0 -20       0      0      0 I  16.3  0.0
0:14.21
> > > > kworker/34:1H-k
> > > >  2746 root       0 -20       0      0      0 I  16.3  0.0
0:14.33
> > > > kworker/32:1H-k
> > > >  2798 root       0 -20       0      0      0 I  16.3  0.0
0:14.50
> > > > kworker/29:1H-k
> > > >  2800 root       0 -20       0      0      0 I  16.3  0.0
0:14.27
> > > > kworker/33:1H-k
> > > >  1423 root       0 -20       0      0      0 I  15.9  0.0
0:14.10
> > > > kworker/54:1H-k
> > > >  1784 root       0 -20       0      0      0 I  15.9  0.0
0:14.03
> > > > kworker/55:1H-k
> > > >  2801 root       0 -20       0      0      0 I  15.9  0.0
0:14.15
> > > > kworker/35:1H-k
> > > >  2815 root       0 -20       0      0      0 I  15.9  0.0
0:13.97
> > > > kworker/56:1H-k
> > > >  1484 root       0 -20       0      0      0 I  15.6  0.0
0:13.90
> > > > kworker/57:1H-k
> > > >  1485 root       0 -20       0      0      0 I  15.6  0.0
0:13.82
> > > > kworker/59:1H-k
> > > >  1519 root       0 -20       0      0      0 I  15.6  0.0
0:13.64
> > > > kworker/62:1H-k
> > > >  2315 root       0 -20       0      0      0 I  15.6  0.0
0:13.87
> > > > kworker/58:1H-k
> > > >  2627 root       0 -20       0      0      0 I  15.6  0.0
0:13.69
> > > > kworker/61:1H-k
> > > >  2816 root       0 -20       0      0      0 I  15.6  0.0
0:13.75
> > > > kworker/60:1H-k
> > > >
> > > >
> > > > I root cause this issue -
> > > >
> > > > Block layer always queue IO on hctx context mapped to CPU-13, but
> > > > hw queue run from all the hctx context.
> > > > I noticed in my test hctx48 has queued all the IOs. No other hctx
> > > > has queued IO. But all the hctx is counting for "run".
> > > >
> > > > # cat hctx48/queued
> > > > 2087058
> > > >
> > > > #cat hctx*/run
> > > > 151318
> > > > 30038
> > > > 83110
> > > > 50680
> > > > 69907
> > > > 60391
> > > > 111239
> > > > 18036
> > > > 33935
> > > > 91648
> > > > 34582
> > > > 22853
> > > > 61286
> > > > 19489
> > > >
> > > > Below patch has fix - "Run the hctx queue for which request was
> > > > completed instead of running all the hardware queue."
> > > > If this looks valid fix, please include in V8 OR I can post
> > > > separate patch for this. Just want to have some level of review
> > > > from this
> > discussion.
> > > >
> > > > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> > > > index 0652acd..f52118f 100644
> > > > --- a/drivers/scsi/scsi_lib.c
> > > > +++ b/drivers/scsi/scsi_lib.c
> > > > @@ -554,6 +554,7 @@ static bool scsi_end_request(struct request
> > > > *req, blk_status_t error,
> > > >         struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(req);
> > > >         struct scsi_device *sdev = cmd->device;
> > > >         struct request_queue *q = sdev->request_queue;
> > > > +       struct blk_mq_hw_ctx *mq_hctx = req->mq_hctx;
> > > >
> > > >         if (blk_update_request(req, error, bytes))
> > > >                 return true;
> > > > @@ -595,7 +596,8 @@ static bool scsi_end_request(struct request
> > > > *req, blk_status_t error,
> > > >             !list_empty(&sdev->host->starved_list))
> > > >                 kblockd_schedule_work(&sdev->requeue_work);
> > > >         else
> > > > -               blk_mq_run_hw_queues(q, true);
> > > > +               blk_mq_run_hw_queue(mq_hctx, true);
> > > > +               //blk_mq_run_hw_queues(q, true);
> > >
> > > This way may cause IO hang because ->device_busy is shared by all
hctxs.
> >
> > From SCSI stack, if we attempt to run all h/w queue, is it possible
> > that block layer actually run hw_queue which has really not queued any
IO.
> > Currently, in case of mq-deadline, IOS are inserted using
> > "dd_insert_request". This function will add IOs on elevator data which
> > is per request queue and not per hctx.
> > When there is an attempt to run hctx, "blk_mq_sched_has_work" will
> > check pending work which is per request queue and not per hctx.
> > Because of this, IOs queued on only one hctx will be run from all the
> > hctx and this will create unnecessary lock contention.
>
> Deadline is supposed for HDD. slow disks, so the lock contention
shouldn't
> have been one problem given there is seldom MQ HDD. before this
patchset.
>
> Another related issue is default scheduler, I guess deadline still
should have
> been the default io sched for HDDs. attached to this kind HBA with
multiple
> reply queues and single submission queue.
>
> >
> > How about below patch - ?
> >
> > diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h index
> > 126021f..1d30bd3 100644
> > --- a/block/blk-mq-sched.h
> > +++ b/block/blk-mq-sched.h
> > @@ -74,6 +74,13 @@ static inline bool blk_mq_sched_has_work(struct
> > blk_mq_hw_ctx *hctx)  {
> >         struct elevator_queue *e = hctx->queue->elevator;
> >
> > +       /* If current hctx has not queued any request, there is no
> > + need to
> > run.
> > +        * blk_mq_run_hw_queue() on hctx which has queued IO will
handle
> > +        * running specific hctx.
> > +        */
> > +       if (!hctx->queued)
> > +               return false;
> > +
> >         if (e && e->type->ops.has_work)
> >                 return e->type->ops.has_work(hctx);
>
> ->queued is increased only and not decreased just for debug purpose so
> ->far, so
> it can't be relied for this purpose.

Thanks. I overlooked that that it is only incremental counter.

>
> One approach is to add one similar counter, and maintain it by
scheduler's
> insert/dispatch callback.

I tried below  and I see performance is on expected range.

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index fdcc2c1..ea201d0 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -485,6 +485,7 @@ void blk_mq_sched_insert_request(struct request *rq,
bool at_head,

                list_add(&rq->queuelist, &list);
                e->type->ops.insert_requests(hctx, &list, at_head);
+               atomic_inc(&hctx->elevator_queued);
        } else {
                spin_lock(&ctx->lock);
                __blk_mq_insert_request(hctx, rq, at_head);
@@ -511,8 +512,10 @@ void blk_mq_sched_insert_requests(struct
blk_mq_hw_ctx *hctx,
        percpu_ref_get(&q->q_usage_counter);

        e = hctx->queue->elevator;
-       if (e && e->type->ops.insert_requests)
+       if (e && e->type->ops.insert_requests) {
                e->type->ops.insert_requests(hctx, list, false);
+               atomic_inc(&hctx->elevator_queued);
+       }
        else {
                /*
                 * try to issue requests directly if the hw queue isn't
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 126021f..946b47a 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -74,6 +74,13 @@ static inline bool blk_mq_sched_has_work(struct
blk_mq_hw_ctx *hctx)
 {
        struct elevator_queue *e = hctx->queue->elevator;

+       /* If current hctx has not queued any request, there is no need to
run.
+        * blk_mq_run_hw_queue() on hctx which has queued IO will handle
+        * running specific hctx.
+        */
+       if (!atomic_read(&hctx->elevator_queued))
+               return false;
+
        if (e && e->type->ops.has_work)
                return e->type->ops.has_work(hctx);

diff --git a/block/blk-mq.c b/block/blk-mq.c
index f73a2f9..48f1824 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -517,8 +517,10 @@ void blk_mq_free_request(struct request *rq)
        struct blk_mq_hw_ctx *hctx = rq->mq_hctx;

        if (rq->rq_flags & RQF_ELVPRIV) {
-               if (e && e->type->ops.finish_request)
+               if (e && e->type->ops.finish_request) {
                        e->type->ops.finish_request(rq);
+                       atomic_dec(&hctx->elevator_queued);
+               }
                if (rq->elv.icq) {
                        put_io_context(rq->elv.icq->ioc);
                        rq->elv.icq = NULL;
@@ -2571,6 +2573,7 @@ blk_mq_alloc_hctx(struct request_queue *q, struct
blk_mq_tag_set *set,
                goto free_hctx;

        atomic_set(&hctx->nr_active, 0);
+       atomic_set(&hctx->elevator_queued, 0);
        if (node == NUMA_NO_NODE)
                node = set->numa_node;
        hctx->numa_node = node;
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 66711c7..ea1ddb1 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -139,6 +139,10 @@ struct blk_mq_hw_ctx {
         * shared across request queues.
         */
        atomic_t                nr_active;
+       /**
+        * @elevator_queued: Number of queued requests on hctx.
+        */
+       atomic_t                elevator_queued;

        /** @cpuhp_online: List to store request if CPU is going to die */
        struct hlist_node       cpuhp_online;



>
> Thanks,
> Ming

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RFC v7 00/12] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs
  2020-06-17 11:26             ` Kashyap Desai
@ 2020-06-22  6:24               ` Hannes Reinecke
  2020-06-23  0:55                 ` Ming Lei
  0 siblings, 1 reply; 45+ messages in thread
From: Hannes Reinecke @ 2020-06-22  6:24 UTC (permalink / raw)
  To: Kashyap Desai, Ming Lei
  Cc: John Garry, axboe, jejb, martin.petersen, don.brace,
	Sumit Saxena, bvanassche, hare, hch, Shivasharan Srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	PDL,MEGARAIDLINUX

On 6/17/20 1:26 PM, Kashyap Desai wrote:
>>
>> ->queued is increased only and not decreased just for debug purpose so
>> ->far, so
>> it can't be relied for this purpose.
> 
> Thanks. I overlooked that that it is only incremental counter.
> 
>>
>> One approach is to add one similar counter, and maintain it by
> scheduler's
>> insert/dispatch callback.
> 
> I tried below  and I see performance is on expected range.
> 
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index fdcc2c1..ea201d0 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -485,6 +485,7 @@ void blk_mq_sched_insert_request(struct request *rq,
> bool at_head,
> 
>                  list_add(&rq->queuelist, &list);
>                  e->type->ops.insert_requests(hctx, &list, at_head);
> +               atomic_inc(&hctx->elevator_queued);
>          } else {
>                  spin_lock(&ctx->lock);
>                  __blk_mq_insert_request(hctx, rq, at_head);
> @@ -511,8 +512,10 @@ void blk_mq_sched_insert_requests(struct
> blk_mq_hw_ctx *hctx,
>          percpu_ref_get(&q->q_usage_counter);
> 
>          e = hctx->queue->elevator;
> -       if (e && e->type->ops.insert_requests)
> +       if (e && e->type->ops.insert_requests) {
>                  e->type->ops.insert_requests(hctx, list, false);
> +               atomic_inc(&hctx->elevator_queued);
> +       }
>          else {
>                  /*
>                   * try to issue requests directly if the hw queue isn't
> diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
> index 126021f..946b47a 100644
> --- a/block/blk-mq-sched.h
> +++ b/block/blk-mq-sched.h
> @@ -74,6 +74,13 @@ static inline bool blk_mq_sched_has_work(struct
> blk_mq_hw_ctx *hctx)
>   {
>          struct elevator_queue *e = hctx->queue->elevator;
> 
> +       /* If current hctx has not queued any request, there is no need to
> run.
> +        * blk_mq_run_hw_queue() on hctx which has queued IO will handle
> +        * running specific hctx.
> +        */
> +       if (!atomic_read(&hctx->elevator_queued))
> +               return false;
> +
>          if (e && e->type->ops.has_work)
>                  return e->type->ops.has_work(hctx);
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index f73a2f9..48f1824 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -517,8 +517,10 @@ void blk_mq_free_request(struct request *rq)
>          struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
> 
>          if (rq->rq_flags & RQF_ELVPRIV) {
> -               if (e && e->type->ops.finish_request)
> +               if (e && e->type->ops.finish_request) {
>                          e->type->ops.finish_request(rq);
> +                       atomic_dec(&hctx->elevator_queued);
> +               }
>                  if (rq->elv.icq) {
>                          put_io_context(rq->elv.icq->ioc);
>                          rq->elv.icq = NULL;
> @@ -2571,6 +2573,7 @@ blk_mq_alloc_hctx(struct request_queue *q, struct
> blk_mq_tag_set *set,
>                  goto free_hctx;
> 
>          atomic_set(&hctx->nr_active, 0);
> +       atomic_set(&hctx->elevator_queued, 0);
>          if (node == NUMA_NO_NODE)
>                  node = set->numa_node;
>          hctx->numa_node = node;
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index 66711c7..ea1ddb1 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -139,6 +139,10 @@ struct blk_mq_hw_ctx {
>           * shared across request queues.
>           */
>          atomic_t                nr_active;
> +       /**
> +        * @elevator_queued: Number of queued requests on hctx.
> +        */
> +       atomic_t                elevator_queued;
> 
>          /** @cpuhp_online: List to store request if CPU is going to die */
>          struct hlist_node       cpuhp_online;
> 
> 
> 
Would it make sense to move it into the elevator itself?
It's a bit odd having a value 'elevator_queued' with no direct reference 
to the elevator...

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                               +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RFC v7 00/12] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs
  2020-06-22  6:24               ` Hannes Reinecke
@ 2020-06-23  0:55                 ` Ming Lei
  2020-06-23 11:50                   ` Kashyap Desai
  2020-06-23 12:11                   ` Kashyap Desai
  0 siblings, 2 replies; 45+ messages in thread
From: Ming Lei @ 2020-06-23  0:55 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Kashyap Desai, John Garry, axboe, jejb, martin.petersen,
	don.brace, Sumit Saxena, bvanassche, hare, hch,
	Shivasharan Srikanteshwara, linux-block, linux-scsi,
	esc.storagedev, chenxiang66, PDL,MEGARAIDLINUX

On Mon, Jun 22, 2020 at 08:24:39AM +0200, Hannes Reinecke wrote:
> On 6/17/20 1:26 PM, Kashyap Desai wrote:
> > > 
> > > ->queued is increased only and not decreased just for debug purpose so
> > > ->far, so
> > > it can't be relied for this purpose.
> > 
> > Thanks. I overlooked that that it is only incremental counter.
> > 
> > > 
> > > One approach is to add one similar counter, and maintain it by
> > scheduler's
> > > insert/dispatch callback.
> > 
> > I tried below  and I see performance is on expected range.
> > 
> > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> > index fdcc2c1..ea201d0 100644
> > --- a/block/blk-mq-sched.c
> > +++ b/block/blk-mq-sched.c
> > @@ -485,6 +485,7 @@ void blk_mq_sched_insert_request(struct request *rq,
> > bool at_head,
> > 
> >                  list_add(&rq->queuelist, &list);
> >                  e->type->ops.insert_requests(hctx, &list, at_head);
> > +               atomic_inc(&hctx->elevator_queued);
> >          } else {
> >                  spin_lock(&ctx->lock);
> >                  __blk_mq_insert_request(hctx, rq, at_head);
> > @@ -511,8 +512,10 @@ void blk_mq_sched_insert_requests(struct
> > blk_mq_hw_ctx *hctx,
> >          percpu_ref_get(&q->q_usage_counter);
> > 
> >          e = hctx->queue->elevator;
> > -       if (e && e->type->ops.insert_requests)
> > +       if (e && e->type->ops.insert_requests) {
> >                  e->type->ops.insert_requests(hctx, list, false);
> > +               atomic_inc(&hctx->elevator_queued);
> > +       }
> >          else {
> >                  /*
> >                   * try to issue requests directly if the hw queue isn't
> > diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
> > index 126021f..946b47a 100644
> > --- a/block/blk-mq-sched.h
> > +++ b/block/blk-mq-sched.h
> > @@ -74,6 +74,13 @@ static inline bool blk_mq_sched_has_work(struct
> > blk_mq_hw_ctx *hctx)
> >   {
> >          struct elevator_queue *e = hctx->queue->elevator;
> > 
> > +       /* If current hctx has not queued any request, there is no need to
> > run.
> > +        * blk_mq_run_hw_queue() on hctx which has queued IO will handle
> > +        * running specific hctx.
> > +        */
> > +       if (!atomic_read(&hctx->elevator_queued))
> > +               return false;
> > +
> >          if (e && e->type->ops.has_work)
> >                  return e->type->ops.has_work(hctx);
> > 
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index f73a2f9..48f1824 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -517,8 +517,10 @@ void blk_mq_free_request(struct request *rq)
> >          struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
> > 
> >          if (rq->rq_flags & RQF_ELVPRIV) {
> > -               if (e && e->type->ops.finish_request)
> > +               if (e && e->type->ops.finish_request) {
> >                          e->type->ops.finish_request(rq);
> > +                       atomic_dec(&hctx->elevator_queued);
> > +               }
> >                  if (rq->elv.icq) {
> >                          put_io_context(rq->elv.icq->ioc);
> >                          rq->elv.icq = NULL;
> > @@ -2571,6 +2573,7 @@ blk_mq_alloc_hctx(struct request_queue *q, struct
> > blk_mq_tag_set *set,
> >                  goto free_hctx;
> > 
> >          atomic_set(&hctx->nr_active, 0);
> > +       atomic_set(&hctx->elevator_queued, 0);
> >          if (node == NUMA_NO_NODE)
> >                  node = set->numa_node;
> >          hctx->numa_node = node;
> > diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> > index 66711c7..ea1ddb1 100644
> > --- a/include/linux/blk-mq.h
> > +++ b/include/linux/blk-mq.h
> > @@ -139,6 +139,10 @@ struct blk_mq_hw_ctx {
> >           * shared across request queues.
> >           */
> >          atomic_t                nr_active;
> > +       /**
> > +        * @elevator_queued: Number of queued requests on hctx.
> > +        */
> > +       atomic_t                elevator_queued;
> > 
> >          /** @cpuhp_online: List to store request if CPU is going to die */
> >          struct hlist_node       cpuhp_online;
> > 
> > 
> > 
> Would it make sense to move it into the elevator itself?

That is my initial suggestion, and the counter is just done for bfq &
mq-deadline, then we needn't to pay the cost for others.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RFC v7 02/12] blk-mq: rename blk_mq_update_tag_set_depth()
  2020-06-11  8:26     ` John Garry
@ 2020-06-23 11:25       ` John Garry
  2020-06-23 14:23         ` Hannes Reinecke
  0 siblings, 1 reply; 45+ messages in thread
From: John Garry @ 2020-06-23 11:25 UTC (permalink / raw)
  To: Ming Lei, Hannes Reinecke
  Cc: axboe, jejb, martin.petersen, don.brace, kashyap.desai,
	sumit.saxena, bvanassche, hare, hch, shivasharan.srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	megaraidlinux.pdl

On 11/06/2020 09:26, John Garry wrote:
> On 11/06/2020 03:57, Ming Lei wrote:
>> On Thu, Jun 11, 2020 at 01:29:09AM +0800, John Garry wrote:
>>> From: Hannes Reinecke <hare@suse.de>
>>>
>>> The function does not set the depth, but rather transitions from
>>> shared to non-shared queues and vice versa.
>>> So rename it to blk_mq_update_tag_set_shared() to better reflect
>>> its purpose.
>>
>> It is fine to rename it for me, however:
>>
>> This patch claims to rename blk_mq_update_tag_set_shared(), but also
>> change blk_mq_init_bitmap_tags's signature.
> 
> I was going to update the commit message here, but forgot again...
> 
>>
>> So suggest to split this patch into two or add comment log on changing
>> blk_mq_init_bitmap_tags().
> 
> I think I'll just split into 2x commits.

Hi Hannes,

Do you have any issue with splitting the undocumented changes into 
another patch as so:

-------------------->8---------------------

 From db3f8ec1295efbf53273ffc137d348857cbd411e Mon Sep 17 00:00:00 2001
From: Hannes Reinecke <hare@suse.de>
Date: Tue, 23 Jun 2020 12:07:33 +0100
Subject: [PATCH] blk-mq: Free tags in blk_mq_init_tags() upon error

Since the tags are allocated in blk_mq_init_tags() it's better practice
to free in that same function upon error, rather than a callee which is 
to init the bitmap tags - blk_mq_init_tags().

Signed-off-by: Hannes Reinecke <hare@suse.de>
[jpg: split from an earlier patch with a new commit message, minor mod 
to return NULL directly for error]
Signed-off-by: John Garry <john.garry@huawei.com>

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 1085dc9848f3..b8972014cd90 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -487,24 +487,22 @@ static int bt_alloc(struct sbitmap_queue *bt, 
unsigned int depth,
  				       node);
  }

-static struct blk_mq_tags *blk_mq_init_bitmap_tags(struct blk_mq_tags 
*tags,
-						   int node, int alloc_policy)
+static int blk_mq_init_bitmap_tags(struct blk_mq_tags *tags,
+				   int node, int alloc_policy)
  {
  	unsigned int depth = tags->nr_tags - tags->nr_reserved_tags;
  	bool round_robin = alloc_policy == BLK_TAG_ALLOC_RR;

  	if (bt_alloc(&tags->bitmap_tags, depth, round_robin, node))
-		goto free_tags;
+		return -ENOMEM;
  	if (bt_alloc(&tags->breserved_tags, tags->nr_reserved_tags, round_robin,
  		     node))
  		goto free_bitmap_tags;

-	return tags;
+	return 0;
  free_bitmap_tags:
  	sbitmap_queue_free(&tags->bitmap_tags);
-free_tags:
-	kfree(tags);
-	return NULL;
+	return -ENOMEM;
  }

  struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
@@ -525,7 +523,11 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int 
total_tags,
  	tags->nr_tags = total_tags;
  	tags->nr_reserved_tags = reserved_tags;

-	return blk_mq_init_bitmap_tags(tags, node, alloc_policy);
+	if (blk_mq_init_bitmap_tags(tags, node, alloc_policy) < 0) {
+		kfree(tags);
+		return NULL;
+	}
+	return tags;
  }

  void blk_mq_free_tags(struct blk_mq_tags *tags)

--------------------8<---------------------

Thanks

^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: [PATCH RFC v7 00/12] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs
  2020-06-23  0:55                 ` Ming Lei
@ 2020-06-23 11:50                   ` Kashyap Desai
  2020-06-23 12:11                   ` Kashyap Desai
  1 sibling, 0 replies; 45+ messages in thread
From: Kashyap Desai @ 2020-06-23 11:50 UTC (permalink / raw)
  To: Ming Lei, Hannes Reinecke
  Cc: John Garry, axboe, jejb, martin.petersen, don.brace,
	Sumit Saxena, bvanassche, hare, hch, Shivasharan Srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	PDL,MEGARAIDLINUX

>
> On Mon, Jun 22, 2020 at 08:24:39AM +0200, Hannes Reinecke wrote:
> > On 6/17/20 1:26 PM, Kashyap Desai wrote:
> > > >
> > > > ->queued is increased only and not decreased just for debug
> > > > ->purpose so far, so
> > > > it can't be relied for this purpose.
> > >
> > > Thanks. I overlooked that that it is only incremental counter.
> > >
> > > >
> > > > One approach is to add one similar counter, and maintain it by
> > > scheduler's
> > > > insert/dispatch callback.
> > >
> > > I tried below  and I see performance is on expected range.
> > >
> > > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c index
> > > fdcc2c1..ea201d0 100644
> > > --- a/block/blk-mq-sched.c
> > > +++ b/block/blk-mq-sched.c
> > > @@ -485,6 +485,7 @@ void blk_mq_sched_insert_request(struct request
> > > *rq, bool at_head,
> > >
> > >                  list_add(&rq->queuelist, &list);
> > >                  e->type->ops.insert_requests(hctx, &list, at_head);
> > > +               atomic_inc(&hctx->elevator_queued);
> > >          } else {
> > >                  spin_lock(&ctx->lock);
> > >                  __blk_mq_insert_request(hctx, rq, at_head); @@
> > > -511,8 +512,10 @@ void blk_mq_sched_insert_requests(struct
> > > blk_mq_hw_ctx *hctx,
> > >          percpu_ref_get(&q->q_usage_counter);
> > >
> > >          e = hctx->queue->elevator;
> > > -       if (e && e->type->ops.insert_requests)
> > > +       if (e && e->type->ops.insert_requests) {
> > >                  e->type->ops.insert_requests(hctx, list, false);
> > > +               atomic_inc(&hctx->elevator_queued);
> > > +       }
> > >          else {
> > >                  /*
> > >                   * try to issue requests directly if the hw queue
> > > isn't diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h index
> > > 126021f..946b47a 100644
> > > --- a/block/blk-mq-sched.h
> > > +++ b/block/blk-mq-sched.h
> > > @@ -74,6 +74,13 @@ static inline bool blk_mq_sched_has_work(struct
> > > blk_mq_hw_ctx *hctx)
> > >   {
> > >          struct elevator_queue *e = hctx->queue->elevator;
> > >
> > > +       /* If current hctx has not queued any request, there is no
> > > + need to
> > > run.
> > > +        * blk_mq_run_hw_queue() on hctx which has queued IO will
handle
> > > +        * running specific hctx.
> > > +        */
> > > +       if (!atomic_read(&hctx->elevator_queued))
> > > +               return false;
> > > +
> > >          if (e && e->type->ops.has_work)
> > >                  return e->type->ops.has_work(hctx);
> > >
> > > diff --git a/block/blk-mq.c b/block/blk-mq.c index f73a2f9..48f1824
> > > 100644
> > > --- a/block/blk-mq.c
> > > +++ b/block/blk-mq.c
> > > @@ -517,8 +517,10 @@ void blk_mq_free_request(struct request *rq)
> > >          struct blk_mq_hw_ctx *hctx = rq->mq_hctx;
> > >
> > >          if (rq->rq_flags & RQF_ELVPRIV) {
> > > -               if (e && e->type->ops.finish_request)
> > > +               if (e && e->type->ops.finish_request) {
> > >                          e->type->ops.finish_request(rq);
> > > +                       atomic_dec(&hctx->elevator_queued);
> > > +               }
> > >                  if (rq->elv.icq) {
> > >                          put_io_context(rq->elv.icq->ioc);
> > >                          rq->elv.icq = NULL; @@ -2571,6 +2573,7 @@
> > > blk_mq_alloc_hctx(struct request_queue *q, struct blk_mq_tag_set
> > > *set,
> > >                  goto free_hctx;
> > >
> > >          atomic_set(&hctx->nr_active, 0);
> > > +       atomic_set(&hctx->elevator_queued, 0);
> > >          if (node == NUMA_NO_NODE)
> > >                  node = set->numa_node;
> > >          hctx->numa_node = node;
> > > diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index
> > > 66711c7..ea1ddb1 100644
> > > --- a/include/linux/blk-mq.h
> > > +++ b/include/linux/blk-mq.h
> > > @@ -139,6 +139,10 @@ struct blk_mq_hw_ctx {
> > >           * shared across request queues.
> > >           */
> > >          atomic_t                nr_active;
> > > +       /**
> > > +        * @elevator_queued: Number of queued requests on hctx.
> > > +        */
> > > +       atomic_t                elevator_queued;
> > >
> > >          /** @cpuhp_online: List to store request if CPU is going to
die */
> > >          struct hlist_node       cpuhp_online;
> > >
> > >
> > >
> > Would it make sense to move it into the elevator itself?

I am not sure where exactly I should add this counter since I need counter
per hctx. Elevator data is per request object.
Please suggest.

>
> That is my initial suggestion, and the counter is just done for bfq &
mq-
> deadline, then we needn't to pay the cost for others.

I have updated patch -

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index a1123d4..3e0005c 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -4640,6 +4640,12 @@ static bool bfq_has_work(struct blk_mq_hw_ctx
*hctx)
 {
        struct bfq_data *bfqd = hctx->queue->elevator->elevator_data;

+       /* If current hctx has not queued any request, there is no need to
run.
+        * blk_mq_run_hw_queue() on hctx which has queued IO will handle
+        * running specific hctx.
+        */
+       if (!atomic_read(&hctx->elevator_queued))
+               return false;
        /*
         * Avoiding lock: a race on bfqd->busy_queues should cause at
         * most a call to dispatch for nothing
@@ -5554,6 +5561,7 @@ static void bfq_insert_requests(struct blk_mq_hw_ctx
*hctx,
                rq = list_first_entry(list, struct request, queuelist);
                list_del_init(&rq->queuelist);
                bfq_insert_request(hctx, rq, at_head);
+              atomic_inc(&hctx->elevator_queued);
        }
 }

@@ -5925,6 +5933,7 @@ static void bfq_finish_requeue_request(struct
request *rq)

        if (likely(rq->rq_flags & RQF_STARTED)) {
                unsigned long flags;
+              struct blk_mq_hw_ctx *mq_hctx = rq->mq_hctx;

                spin_lock_irqsave(&bfqd->lock, flags);

@@ -5934,6 +5943,7 @@ static void bfq_finish_requeue_request(struct
request *rq)
                bfq_completed_request(bfqq, bfqd);
                bfq_finish_requeue_request_body(bfqq);

+              atomic_dec(&hctx->elevator_queued);
                spin_unlock_irqrestore(&bfqd->lock, flags);
        } else {
                /*
diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 126021f..946b47a 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -74,6 +74,13 @@ static inline bool blk_mq_sched_has_work(struct
blk_mq_hw_ctx *hctx)
 {
        struct elevator_queue *e = hctx->queue->elevator;

+       /* If current hctx has not queued any request, there is no need to
run.
+        * blk_mq_run_hw_queue() on hctx which has queued IO will handle
+        * running specific hctx.
+        */
+       if (!atomic_read(&hctx->elevator_queued))
+               return false;
+
        if (e && e->type->ops.has_work)
                return e->type->ops.has_work(hctx);

diff --git a/block/blk-mq.c b/block/blk-mq.c
index f73a2f9..82dd152 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2571,6 +2571,7 @@ blk_mq_alloc_hctx(struct request_queue *q, struct
blk_mq_tag_set *set,
                goto free_hctx;

        atomic_set(&hctx->nr_active, 0);
+      atomic_set(&hctx->elevator_queued, 0);
        if (node == NUMA_NO_NODE)
                node = set->numa_node;
        hctx->numa_node = node;
diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index b57470e..703ac55 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -533,6 +533,7 @@ static void dd_insert_requests(struct blk_mq_hw_ctx
*hctx,
                rq = list_first_entry(list, struct request, queuelist);
                list_del_init(&rq->queuelist);
                dd_insert_request(hctx, rq, at_head);
+              atomic_inc(&hctx->elevator_queued);
        }
        spin_unlock(&dd->lock);
 }
@@ -562,6 +563,7 @@ static void dd_prepare_request(struct request *rq)
 static void dd_finish_request(struct request *rq)
 {
        struct request_queue *q = rq->q;
+      struct blk_mq_hw_ctx *hctx = rq->mq_hctx;

        if (blk_queue_is_zoned(q)) {
                struct deadline_data *dd = q->elevator->elevator_data;
@@ -570,15 +572,23 @@ static void dd_finish_request(struct request *rq)
                spin_lock_irqsave(&dd->zone_lock, flags);
                blk_req_zone_write_unlock(rq);
                if (!list_empty(&dd->fifo_list[WRITE]))
-                       blk_mq_sched_mark_restart_hctx(rq->mq_hctx);
+                       blk_mq_sched_mark_restart_hctx(hctx);
                spin_unlock_irqrestore(&dd->zone_lock, flags);
        }
+       atomic_dec(&hctx->elevator_queued);
 }

 static bool dd_has_work(struct blk_mq_hw_ctx *hctx)
 {
        struct deadline_data *dd = hctx->queue->elevator->elevator_data;

+       /* If current hctx has not queued any request, there is no need to
run.
+        * blk_mq_run_hw_queue() on hctx which has queued IO will handle
+        * running specific hctx.
+        */
+       if (!atomic_read(&hctx->elevator_queued))
+               return false;
+
        return !list_empty_careful(&dd->dispatch) ||
                !list_empty_careful(&dd->fifo_list[0]) ||
                !list_empty_careful(&dd->fifo_list[1]);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 66711c7..ea1ddb1 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -139,6 +139,10 @@ struct blk_mq_hw_ctx {
         * shared across request queues.
         */
        atomic_t                nr_active;
+       /**
+        * @elevator_queued: Number of queued requests on hctx.
+        */
+       atomic_t                elevator_queued;

        /** @cpuhp_online: List to store request if CPU is going to die */
        struct hlist_node       cpuhp_online;


>
> Thanks,
> Ming

^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: [PATCH RFC v7 00/12] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs
  2020-06-23  0:55                 ` Ming Lei
  2020-06-23 11:50                   ` Kashyap Desai
@ 2020-06-23 12:11                   ` Kashyap Desai
  1 sibling, 0 replies; 45+ messages in thread
From: Kashyap Desai @ 2020-06-23 12:11 UTC (permalink / raw)
  To: Ming Lei, Hannes Reinecke
  Cc: John Garry, axboe, jejb, martin.petersen, don.brace,
	Sumit Saxena, bvanassche, hare, hch, Shivasharan Srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	PDL,MEGARAIDLINUX

> > > >
> > > Would it make sense to move it into the elevator itself?
>
> I am not sure where exactly I should add this counter since I need
counter per
> hctx. Elevator data is per request object.
> Please suggest.
>
> >
> > That is my initial suggestion, and the counter is just done for bfq &
> > mq- deadline, then we needn't to pay the cost for others.
>
> I have updated patch -
>
> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c index
a1123d4..3e0005c
> 100644
> --- a/block/bfq-iosched.c
> +++ b/block/bfq-iosched.c
> @@ -4640,6 +4640,12 @@ static bool bfq_has_work(struct blk_mq_hw_ctx
> *hctx)  {
>         struct bfq_data *bfqd = hctx->queue->elevator->elevator_data;
>
> +       /* If current hctx has not queued any request, there is no need
to run.
> +        * blk_mq_run_hw_queue() on hctx which has queued IO will handle
> +        * running specific hctx.
> +        */
> +       if (!atomic_read(&hctx->elevator_queued))
> +               return false;
>         /*
>          * Avoiding lock: a race on bfqd->busy_queues should cause at
>          * most a call to dispatch for nothing @@ -5554,6 +5561,7 @@
static void
> bfq_insert_requests(struct blk_mq_hw_ctx *hctx,
>                 rq = list_first_entry(list, struct request, queuelist);
>                 list_del_init(&rq->queuelist);
>                 bfq_insert_request(hctx, rq, at_head);
> +              atomic_inc(&hctx->elevator_queued);
>         }
>  }
>
> @@ -5925,6 +5933,7 @@ static void bfq_finish_requeue_request(struct
> request *rq)
>
>         if (likely(rq->rq_flags & RQF_STARTED)) {
>                 unsigned long flags;
> +              struct blk_mq_hw_ctx *mq_hctx = rq->mq_hctx;
>
>                 spin_lock_irqsave(&bfqd->lock, flags);
>
> @@ -5934,6 +5943,7 @@ static void bfq_finish_requeue_request(struct
> request *rq)
>                 bfq_completed_request(bfqq, bfqd);
>                 bfq_finish_requeue_request_body(bfqq);
>
> +              atomic_dec(&hctx->elevator_queued);
>                 spin_unlock_irqrestore(&bfqd->lock, flags);
>         } else {
>                 /*
> diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h index
> 126021f..946b47a 100644
> --- a/block/blk-mq-sched.h
> +++ b/block/blk-mq-sched.h
> @@ -74,6 +74,13 @@ static inline bool blk_mq_sched_has_work(struct
> blk_mq_hw_ctx *hctx)  {
>         struct elevator_queue *e = hctx->queue->elevator;
>
> +       /* If current hctx has not queued any request, there is no need
to run.
> +        * blk_mq_run_hw_queue() on hctx which has queued IO will handle
> +        * running specific hctx.
> +        */
> +       if (!atomic_read(&hctx->elevator_queued))
> +               return false;
> +

I have missed this. I will remove above code since it is now managed
within mq-deadline and bfq-iosched *has_work* callback.

>         if (e && e->type->ops.has_work)
>                 return e->type->ops.has_work(hctx);
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RFC v7 02/12] blk-mq: rename blk_mq_update_tag_set_depth()
  2020-06-23 11:25       ` John Garry
@ 2020-06-23 14:23         ` Hannes Reinecke
  2020-06-24  8:13           ` Kashyap Desai
  0 siblings, 1 reply; 45+ messages in thread
From: Hannes Reinecke @ 2020-06-23 14:23 UTC (permalink / raw)
  To: John Garry, Ming Lei
  Cc: axboe, jejb, martin.petersen, don.brace, kashyap.desai,
	sumit.saxena, bvanassche, hare, hch, shivasharan.srikanteshwara,
	linux-block, linux-scsi, esc.storagedev, chenxiang66,
	Kashyap Desai


[-- Attachment #1: Type: text/plain, Size: 1385 bytes --]

On 6/23/20 1:25 PM, John Garry wrote:
> On 11/06/2020 09:26, John Garry wrote:
>> On 11/06/2020 03:57, Ming Lei wrote:
>>> On Thu, Jun 11, 2020 at 01:29:09AM +0800, John Garry wrote:
>>>> From: Hannes Reinecke <hare@suse.de>
>>>>
>>>> The function does not set the depth, but rather transitions from
>>>> shared to non-shared queues and vice versa.
>>>> So rename it to blk_mq_update_tag_set_shared() to better reflect
>>>> its purpose.
>>>
>>> It is fine to rename it for me, however:
>>>
>>> This patch claims to rename blk_mq_update_tag_set_shared(), but also
>>> change blk_mq_init_bitmap_tags's signature.
>>
>> I was going to update the commit message here, but forgot again...
>>
>>>
>>> So suggest to split this patch into two or add comment log on changing
>>> blk_mq_init_bitmap_tags().
>>
>> I think I'll just split into 2x commits.
> 
> Hi Hannes,
> 
> Do you have any issue with splitting the undocumented changes into 
> another patch as so:
> 
No, that's perfectly fine.

Kashyap, I've also attached an updated patch for the elevator_count 
patch; if you agree John can include it in the next version.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                               +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

[-- Attachment #2: 0001-elevator-count-requests-per-hctx-to-improve-performa.patch --]
[-- Type: text/x-patch, Size: 3484 bytes --]

From d50b5f773713070208c405f7c7056eb1afed896a Mon Sep 17 00:00:00 2001
From: Hannes Reinecke <hare@suse.de>
Date: Tue, 23 Jun 2020 16:18:40 +0200
Subject: [PATCH] elevator: count requests per hctx to improve performance

Add a 'elevator_queued' count to the hctx to avoid triggering
the elevator even though there are no requests queued.

Suggested-by: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 block/bfq-iosched.c    | 5 +++++
 block/blk-mq.c         | 1 +
 block/mq-deadline.c    | 5 +++++
 include/linux/blk-mq.h | 4 ++++
 4 files changed, 15 insertions(+)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index a1123d4d586d..3d63b35f6121 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -4640,6 +4640,9 @@ static bool bfq_has_work(struct blk_mq_hw_ctx *hctx)
 {
 	struct bfq_data *bfqd = hctx->queue->elevator->elevator_data;
 
+	if (!atomic_read(&hctx->elevator_queued))
+		return false;
+
 	/*
 	 * Avoiding lock: a race on bfqd->busy_queues should cause at
 	 * most a call to dispatch for nothing
@@ -5554,6 +5557,7 @@ static void bfq_insert_requests(struct blk_mq_hw_ctx *hctx,
 		rq = list_first_entry(list, struct request, queuelist);
 		list_del_init(&rq->queuelist);
 		bfq_insert_request(hctx, rq, at_head);
+		atomic_inc(&hctx->elevator_queued)
 	}
 }
 
@@ -5933,6 +5937,7 @@ static void bfq_finish_requeue_request(struct request *rq)
 
 		bfq_completed_request(bfqq, bfqd);
 		bfq_finish_requeue_request_body(bfqq);
+		atomic_dec(&rq->mq_hctx->elevator_queued);
 
 		spin_unlock_irqrestore(&bfqd->lock, flags);
 	} else {
diff --git a/block/blk-mq.c b/block/blk-mq.c
index e06e8c9f326f..f5403fc97572 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2542,6 +2542,7 @@ blk_mq_alloc_hctx(struct request_queue *q, struct blk_mq_tag_set *set,
 		goto free_hctx;
 
 	atomic_set(&hctx->nr_active, 0);
+	atomic_set(&hctx->elevator_queued, 0);
 	if (node == NUMA_NO_NODE)
 		node = set->numa_node;
 	hctx->numa_node = node;
diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index b57470e154c8..9d753745e6be 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -533,6 +533,7 @@ static void dd_insert_requests(struct blk_mq_hw_ctx *hctx,
 		rq = list_first_entry(list, struct request, queuelist);
 		list_del_init(&rq->queuelist);
 		dd_insert_request(hctx, rq, at_head);
+		atomic_inc(&hctx->elevator_queued);
 	}
 	spin_unlock(&dd->lock);
 }
@@ -573,12 +574,16 @@ static void dd_finish_request(struct request *rq)
 			blk_mq_sched_mark_restart_hctx(rq->mq_hctx);
 		spin_unlock_irqrestore(&dd->zone_lock, flags);
 	}
+	atomic_dec(&rq->mq_hctx->elevator_queued);
 }
 
 static bool dd_has_work(struct blk_mq_hw_ctx *hctx)
 {
 	struct deadline_data *dd = hctx->queue->elevator->elevator_data;
 
+	if (!atomic_read(&hctx->elevator_queued))
+		return false;
+
 	return !list_empty_careful(&dd->dispatch) ||
 		!list_empty_careful(&dd->fifo_list[0]) ||
 		!list_empty_careful(&dd->fifo_list[1]);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 66711c7234db..a18c506b14e7 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -139,6 +139,10 @@ struct blk_mq_hw_ctx {
 	 * shared across request queues.
 	 */
 	atomic_t		nr_active;
+	/**
+	 * @elevator_queued: Number of queued requests on hctx.
+	 */
+	atomic_t                elevator_queued;
 
 	/** @cpuhp_online: List to store request if CPU is going to die */
 	struct hlist_node	cpuhp_online;
-- 
2.26.2


^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: [PATCH RFC v7 02/12] blk-mq: rename blk_mq_update_tag_set_depth()
  2020-06-23 14:23         ` Hannes Reinecke
@ 2020-06-24  8:13           ` Kashyap Desai
  2020-06-29 16:18             ` John Garry
  0 siblings, 1 reply; 45+ messages in thread
From: Kashyap Desai @ 2020-06-24  8:13 UTC (permalink / raw)
  To: Hannes Reinecke, John Garry, Ming Lei
  Cc: axboe, jejb, martin.petersen, don.brace, Sumit Saxena,
	bvanassche, hare, hch, Shivasharan Srikanteshwara, linux-block,
	linux-scsi, esc.storagedev, chenxiang66

>
> On 6/23/20 1:25 PM, John Garry wrote:
> > On 11/06/2020 09:26, John Garry wrote:
> >> On 11/06/2020 03:57, Ming Lei wrote:
> >>> On Thu, Jun 11, 2020 at 01:29:09AM +0800, John Garry wrote:
> >>>> From: Hannes Reinecke <hare@suse.de>
> >>>>
> >>>> The function does not set the depth, but rather transitions from
> >>>> shared to non-shared queues and vice versa.
> >>>> So rename it to blk_mq_update_tag_set_shared() to better reflect
> >>>> its purpose.
> >>>
> >>> It is fine to rename it for me, however:
> >>>
> >>> This patch claims to rename blk_mq_update_tag_set_shared(), but also
> >>> change blk_mq_init_bitmap_tags's signature.
> >>
> >> I was going to update the commit message here, but forgot again...
> >>
> >>>
> >>> So suggest to split this patch into two or add comment log on
> >>> changing blk_mq_init_bitmap_tags().
> >>
> >> I think I'll just split into 2x commits.
> >
> > Hi Hannes,
> >
> > Do you have any issue with splitting the undocumented changes into
> > another patch as so:
> >
> No, that's perfectly fine.
>
> Kashyap, I've also attached an updated patch for the elevator_count patch;
> if
> you agree John can include it in the next version.

Hannes - Patch looks good.   Header does not include problem statement. How
about adding below in header ?

High CPU utilization on "native_queued_spin_lock_slowpath" due to lock
contention is possible in mq-deadline and bfq io scheduler when nr_hw_queues
is more than one.
It is because kblockd work queue can submit IO from all online CPUs (through
blk_mq_run_hw_queues) even though only one hctx has pending commands.
Elevator callback "has_work" for mq-deadline and bfq scheduler consider
pending work if there are any IOs on request queue and it does not account
hctx context.

I have not seen performance drop after this patch, but I will continue
further testing.

John - One more thing, I am working on megaraid_sas driver to provide both
host_tagset = 1 and 0 option through module parameter.

Kashyap

>
> Cheers,
>
> Hannes
> --
> Dr. Hannes Reinecke            Teamlead Storage & Networking
> hare@suse.de                               +49 911 74053 688
> SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg HRB 36809
> (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 45+ messages in thread

* About sbitmap_bitmap_show() and cleared bits (was Re: [PATCH RFC v7 07/12] blk-mq: Add support in hctx_tags_bitmap_show() for a shared sbitmap)
  2020-06-12  6:06       ` Hannes Reinecke
@ 2020-06-29 15:32         ` John Garry
  2020-06-30  6:33           ` Hannes Reinecke
  2020-06-30 14:55           ` Bart Van Assche
  0 siblings, 2 replies; 45+ messages in thread
From: John Garry @ 2020-06-29 15:32 UTC (permalink / raw)
  To: axboe, linux-block
  Cc: Hannes Reinecke, jejb, martin.petersen, don.brace, kashyap.desai,
	sumit.saxena, ming.lei, bvanassche, hare, hch,
	shivasharan.srikanteshwara, linux-scsi, esc.storagedev,
	chenxiang66, megaraidlinux.pdl

Hi all,

I noticed that sbitmap_bitmap_show() only shows set bits and does not 
consider cleared bits. Is that the proper thing to do?

I ask, as from trying to support sbitmap_bitmap_show() for hostwide 
shared tags feature, we currently use blk_mq_queue_tag_busy_iter() to 
find active requests (and associated tags/bits) for a particular hctx. 
So, AFAICT, would give a change in behavior for sbitmap_bitmap_show(), 
in that it would effectively show set and not cleared bits.

Any thoughts on this?

Thanks!

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH RFC v7 02/12] blk-mq: rename blk_mq_update_tag_set_depth()
  2020-06-24  8:13           ` Kashyap Desai
@ 2020-06-29 16:18             ` John Garry
  0 siblings, 0 replies; 45+ messages in thread
From: John Garry @ 2020-06-29 16:18 UTC (permalink / raw)
  To: Kashyap Desai, Hannes Reinecke, Ming Lei
  Cc: axboe, jejb, martin.petersen, don.brace, Sumit Saxena,
	bvanassche, hare, hch, Shivasharan Srikanteshwara, linux-block,
	linux-scsi, esc.storagedev, chenxiang66

On 24/06/2020 09:13, Kashyap Desai wrote:
>>> Hi Hannes,
>>>
>>> Do you have any issue with splitting the undocumented changes into
>>> another patch as so:
>>>
>> No, that's perfectly fine.
>>
>> Kashyap, I've also attached an updated patch for the elevator_count patch;
>> if
>> you agree John can include it in the next version.

ok, but I need to check it myself.

> Hannes - Patch looks good.   Header does not include problem statement. How
> about adding below in header ?
> 
> High CPU utilization on "native_queued_spin_lock_slowpath" due to lock
> contention is possible in mq-deadline and bfq io scheduler when nr_hw_queues
> is more than one.
> It is because kblockd work queue can submit IO from all online CPUs (through
> blk_mq_run_hw_queues) even though only one hctx has pending commands.
> Elevator callback "has_work" for mq-deadline and bfq scheduler consider
> pending work if there are any IOs on request queue and it does not account
> hctx context.
> 
> I have not seen performance drop after this patch, but I will continue
> further testing.
> 
> John - One more thing, I am working on megaraid_sas driver to provide both
> host_tagset = 1 and 0 option through module parameter.

I was hoping that we wouldn't have this, and have host_tagset = 1 
always. Or maybe host_tagset = 1 by default, and allow module param to 
set = 0. Your choice, though.

Thanks,
John


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: About sbitmap_bitmap_show() and cleared bits (was Re: [PATCH RFC v7 07/12] blk-mq: Add support in hctx_tags_bitmap_show() for a shared sbitmap)
  2020-06-29 15:32         ` About sbitmap_bitmap_show() and cleared bits (was Re: [PATCH RFC v7 07/12] blk-mq: Add support in hctx_tags_bitmap_show() for a shared sbitmap) John Garry
@ 2020-06-30  6:33           ` Hannes Reinecke
  2020-06-30  7:30             ` John Garry
  2020-06-30 14:55           ` Bart Van Assche
  1 sibling, 1 reply; 45+ messages in thread
From: Hannes Reinecke @ 2020-06-30  6:33 UTC (permalink / raw)
  To: John Garry, axboe, linux-block
  Cc: jejb, martin.petersen, don.brace, kashyap.desai, sumit.saxena,
	ming.lei, bvanassche, hare, hch, shivasharan.srikanteshwara,
	linux-scsi, esc.storagedev, chenxiang66, megaraidlinux.pdl

On 6/29/20 5:32 PM, John Garry wrote:
> Hi all,
> 
> I noticed that sbitmap_bitmap_show() only shows set bits and does not 
> consider cleared bits. Is that the proper thing to do?
> 
> I ask, as from trying to support sbitmap_bitmap_show() for hostwide 
> shared tags feature, we currently use blk_mq_queue_tag_busy_iter() to 
> find active requests (and associated tags/bits) for a particular hctx. 
> So, AFAICT, would give a change in behavior for sbitmap_bitmap_show(), 
> in that it would effectively show set and not cleared bits.
> 
Why would you need to do this?
Where would be the point traversing cleared bits?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                               +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: About sbitmap_bitmap_show() and cleared bits (was Re: [PATCH RFC v7 07/12] blk-mq: Add support in hctx_tags_bitmap_show() for a shared sbitmap)
  2020-06-30  6:33           ` Hannes Reinecke
@ 2020-06-30  7:30             ` John Garry
  2020-06-30 11:36               ` John Garry
  0 siblings, 1 reply; 45+ messages in thread
From: John Garry @ 2020-06-30  7:30 UTC (permalink / raw)
  To: Hannes Reinecke, axboe, linux-block
  Cc: jejb, martin.petersen, don.brace, kashyap.desai, sumit.saxena,
	ming.lei, bvanassche, hare, hch, shivasharan.srikanteshwara,
	linux-scsi, esc.storagedev, chenxiang66, megaraidlinux.pdl

On 30/06/2020 07:33, Hannes Reinecke wrote:
> On 6/29/20 5:32 PM, John Garry wrote:
>> Hi all,
>>
>> I noticed that sbitmap_bitmap_show() only shows set bits and does not 
>> consider cleared bits. Is that the proper thing to do?
>>
>> I ask, as from trying to support sbitmap_bitmap_show() for hostwide 
>> shared tags feature, we currently use blk_mq_queue_tag_busy_iter() to 
>> find active requests (and associated tags/bits) for a particular hctx. 
>> So, AFAICT, would give a change in behavior for sbitmap_bitmap_show(), 
>> in that it would effectively show set and not cleared bits.
>>
> Why would you need to do this?
> Where would be the point traversing cleared bits?

I'm not talking about traversing cleared bits specifically. I just think 
that today sbitmap_bitmap_show() only showing the bits in 
sbitmap_word.word may not be useful or even misleading, as in reality 
the "set" bits are sbitmap_word.word & ~sbitmap_word.cleared.

And for hostwide shared tags feature, iterating the busy rqs to find the 
per-hctx tags/bits would effectively give us the "set" bits, above, so 
there would be a difference.

Thanks,
John

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: About sbitmap_bitmap_show() and cleared bits (was Re: [PATCH RFC v7 07/12] blk-mq: Add support in hctx_tags_bitmap_show() for a shared sbitmap)
  2020-06-30  7:30             ` John Garry
@ 2020-06-30 11:36               ` John Garry
  0 siblings, 0 replies; 45+ messages in thread
From: John Garry @ 2020-06-30 11:36 UTC (permalink / raw)
  To: Hannes Reinecke, axboe, linux-block
  Cc: jejb, martin.petersen, don.brace, kashyap.desai, sumit.saxena,
	ming.lei, bvanassche, hare, hch, shivasharan.srikanteshwara,
	linux-scsi, esc.storagedev, chenxiang66, megaraidlinux.pdl

On 30/06/2020 08:30, John Garry wrote:
> On 30/06/2020 07:33, Hannes Reinecke wrote:
>> On 6/29/20 5:32 PM, John Garry wrote:
>>> Hi all,
>>>
>>> I noticed that sbitmap_bitmap_show() only shows set bits and does not 
>>> consider cleared bits. Is that the proper thing to do?
>>>
>>> I ask, as from trying to support sbitmap_bitmap_show() for hostwide 
>>> shared tags feature, we currently use blk_mq_queue_tag_busy_iter() to 
>>> find active requests (and associated tags/bits) for a particular 
>>> hctx. So, AFAICT, would give a change in behavior for 
>>> sbitmap_bitmap_show(), in that it would effectively show set and not 
>>> cleared bits.
>>>
>> Why would you need to do this?
>> Where would be the point traversing cleared bits?
> 
> I'm not talking about traversing cleared bits specifically. I just think 
> that today sbitmap_bitmap_show() only showing the bits in 
> sbitmap_word.word may not be useful or even misleading, as in reality 
> the "set" bits are sbitmap_word.word & ~sbitmap_word.cleared.
> 
> And for hostwide shared tags feature, iterating the busy rqs to find the 
> per-hctx tags/bits would effectively give us the "set" bits, above, so 
> there would be a difference.
> 

As an example, here's a sample tags_bitmap output:

00000000: 00f0 0fff 03c0 0000 0000 0000 efff fdff
00000010: 0000 c0f7 7fff ffff 0000 00e0 fef7 ffff
00000020: 0000 0000 f0ff ffff 0000 ffff 01d0 ffff
00000030: 0f80

And here's what we would have taking cleared bits into account:

00000000: 00f0 0fff 03c0 0000 0000 0000 0000 0000 (20 bits set)
00000010: 0000 0000 0000 0000 0000 0000 0000 0000
00000020: 0000 0000 0000 0000 0000 f8ff 0110 8000 (16 bits set)
00000030: 0f00					  (1 bit set)

Here's tags file output:

nr_tags=400
nr_reserved_tags=0
active_queues=2

bitmap_tags:
depth=400
busy=40
cleared=182
bits_per_word=64
map_nr=7
alloc_hint={22, 0, 0, 0, 103, 389, 231, 57, 377, 167, 0, 0, 69, 24, 44, 
50, 54,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0,
  0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0
, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}
wake_batch=8
wake_index=0

[snip]

20+16+1=39 more closely matches with busy=40.

So it seems sensible to go this way for whether hostwide tags are used 
or not.

thanks,
John

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: About sbitmap_bitmap_show() and cleared bits (was Re: [PATCH RFC v7 07/12] blk-mq: Add support in hctx_tags_bitmap_show() for a shared sbitmap)
  2020-06-29 15:32         ` About sbitmap_bitmap_show() and cleared bits (was Re: [PATCH RFC v7 07/12] blk-mq: Add support in hctx_tags_bitmap_show() for a shared sbitmap) John Garry
  2020-06-30  6:33           ` Hannes Reinecke
@ 2020-06-30 14:55           ` Bart Van Assche
  1 sibling, 0 replies; 45+ messages in thread
From: Bart Van Assche @ 2020-06-30 14:55 UTC (permalink / raw)
  To: John Garry, axboe, linux-block
  Cc: Hannes Reinecke, jejb, martin.petersen, don.brace, kashyap.desai,
	sumit.saxena, ming.lei, hare, hch, shivasharan.srikanteshwara,
	linux-scsi, esc.storagedev, chenxiang66, megaraidlinux.pdl

On 2020-06-29 08:32, John Garry wrote:
> I noticed that sbitmap_bitmap_show() only shows set bits and does not
> consider cleared bits. Is that the proper thing to do?
> 
> I ask, as from trying to support sbitmap_bitmap_show() for hostwide
> shared tags feature, we currently use blk_mq_queue_tag_busy_iter() to
> find active requests (and associated tags/bits) for a particular hctx.
> So, AFAICT, would give a change in behavior for sbitmap_bitmap_show(),
> in that it would effectively show set and not cleared bits.
> 
> Any thoughts on this?

Probably this is something that got overlooked when the cleared bits
were introduced? See e.g. 8c2def893afc ("sbitmap: fix
sbitmap_for_each_set()").

Bart.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* RE: [PATCH RFC v7 10/12] megaraid_sas: switch fusion adapters to MQ
  2020-06-10 17:29 ` [PATCH RFC v7 10/12] megaraid_sas: switch fusion adapters " John Garry
@ 2020-07-02 10:23   ` Kashyap Desai
  0 siblings, 0 replies; 45+ messages in thread
From: Kashyap Desai @ 2020-07-02 10:23 UTC (permalink / raw)
  To: John Garry, axboe, jejb, martin.petersen, don.brace,
	Sumit Saxena, ming.lei, bvanassche, hare, hch,
	Shivasharan Srikanteshwara
  Cc: linux-block, linux-scsi, esc.storagedev, chenxiang66, PDL,MEGARAIDLINUX

>
> From: Hannes Reinecke <hare@suse.com>
>
> Fusion adapters can steer completions to individual queues, and we now
have
> support for shared host-wide tags.
> So we can enable multiqueue support for fusion adapters and drop the
hand-
> crafted interrupt affinity settings.

Shared host tag is primarily introduced for completeness of CPU hotplug as
discussed earlier -
https://lwn.net/Articles/819419/

How shall I test CPU hotplug on megaraid_sas driver ? My understanding is
- This RFC + patch set from above link is required for it. I could not see
above series is committed.
Am I missing anything. ?

We do not want to completely move to shared host tag. It will be shared
host tag support by default, but user should have choice to go back to
legacy path.
We will completely move to shared host tag path once it is stable and no
more field issue observed over a period of time. -

Updated <megaraid_sas> patch will looks like this -

diff --git a/megaraid_sas_base.c b/megaraid_sas_base.c
index 0066833..3b503cb 100644
--- a/megaraid_sas_base.c
+++ b/megaraid_sas_base.c
@@ -37,6 +37,7 @@
 #include <linux/poll.h>
 #include <linux/vmalloc.h>
 #include <linux/irq_poll.h>
+#include <linux/blk-mq-pci.h>

 #include <scsi/scsi.h>
 #include <scsi/scsi_cmnd.h>
@@ -113,6 +114,10 @@ unsigned int enable_sdev_max_qd;
 module_param(enable_sdev_max_qd, int, 0444);
 MODULE_PARM_DESC(enable_sdev_max_qd, "Enable sdev max qd as can_queue.
Default: 0");

+int host_tagset_disabled = 0;
+module_param(host_tagset_disabled, int, 0444);
+MODULE_PARM_DESC(host_tagset_disabled, "Shared host tagset enable/disable
Default: enable(1)");
+
 MODULE_LICENSE("GPL");
 MODULE_VERSION(MEGASAS_VERSION);
 MODULE_AUTHOR("megaraidlinux.pdl@broadcom.com");
@@ -3115,6 +3120,18 @@ megasas_bios_param(struct scsi_device *sdev, struct
block_device *bdev,
        return 0;
 }

+static int megasas_map_queues(struct Scsi_Host *shost)
+{
+       struct megasas_instance *instance;
+       instance = (struct megasas_instance *)shost->hostdata;
+
+       if (instance->host->nr_hw_queues == 1)
+               return 0;
+
+       return
blk_mq_pci_map_queues(&shost->tag_set.map[HCTX_TYPE_DEFAULT],
+                       instance->pdev,
instance->low_latency_index_start);
+}
+
 static void megasas_aen_polling(struct work_struct *work);

 /**
@@ -3423,8 +3440,10 @@ static struct scsi_host_template megasas_template =
{
        .eh_timed_out = megasas_reset_timer,
        .shost_attrs = megaraid_host_attrs,
        .bios_param = megasas_bios_param,
+       .map_queues = megasas_map_queues,
        .change_queue_depth = scsi_change_queue_depth,
        .max_segment_size = 0xffffffff,
+       .host_tagset = 1,
 };

 /**
@@ -6793,7 +6812,21 @@ static int megasas_io_attach(struct
megasas_instance *instance)
        host->max_id = MEGASAS_MAX_DEV_PER_CHANNEL;
        host->max_lun = MEGASAS_MAX_LUN;
        host->max_cmd_len = 16;
+       host->nr_hw_queues = 1;

+       /* Use shared host tagset only for fusion adaptors
+        * if there are more than one managed interrupts.
+        */
+       if ((instance->adapter_type != MFI_SERIES) &&
+               (instance->msix_vectors > 0) &&
+               !host_tagset_disabled &&
+               instance->smp_affinity_enable)
+               host->nr_hw_queues = instance->msix_vectors -
+                       instance->low_latency_index_start;
+
+       dev_info(&instance->pdev->dev, "Max firmware commands: %d"
+               " for nr_hw_queues = %d\n", instance->max_fw_cmds,
+               host->nr_hw_queues);
        /*
         * Notify the mid-layer about the new controller
         */
@@ -8842,6 +8875,7 @@ static int __init megasas_init(void)
                msix_vectors = 1;
                rdpq_enable = 0;
                dual_qdepth_disable = 1;
+               host_tagset_disabled = 1;
        }

        /*
diff --git a/megaraid_sas_fusion.c b/megaraid_sas_fusion.c
index 319f241..14d4f35 100755
--- a/megaraid_sas_fusion.c
+++ b/megaraid_sas_fusion.c
@@ -373,24 +373,28 @@ megasas_get_msix_index(struct megasas_instance
*instance,
 {
        int sdev_busy;

-       /* nr_hw_queue = 1 for MegaRAID */
-       struct blk_mq_hw_ctx *hctx =
-               scmd->device->request_queue->queue_hw_ctx[0];
-
-       sdev_busy = atomic_read(&hctx->nr_active);
+       /* TBD - if sml remove device_busy in future, driver
+        * should track counter in internal structure.
+        */
+       sdev_busy = atomic_read(&scmd->device->device_busy);

        if (instance->perf_mode == MR_BALANCED_PERF_MODE &&
-           sdev_busy > (data_arms * MR_DEVICE_HIGH_IOPS_DEPTH))
+           sdev_busy > (data_arms * MR_DEVICE_HIGH_IOPS_DEPTH)) {
                cmd->request_desc->SCSIIO.MSIxIndex =
                        mega_mod64((atomic64_add_return(1,
&instance->high_iops_outstanding) /
                                        MR_HIGH_IOPS_BATCH_COUNT),
instance->low_latency_index_start);
-       else if (instance->msix_load_balance)
+       } else if (instance->msix_load_balance) {
                cmd->request_desc->SCSIIO.MSIxIndex =
                        (mega_mod64(atomic64_add_return(1,
&instance->total_io_count),
                                instance->msix_vectors));
-       else
+       } else if (instance->host->nr_hw_queues > 1) {
+               u32 tag = blk_mq_unique_tag(scmd->request);
+               cmd->request_desc->SCSIIO.MSIxIndex =
blk_mq_unique_tag_to_hwq(tag) +
+                       instance->low_latency_index_start;
+       } else {
                cmd->request_desc->SCSIIO.MSIxIndex =
                        instance->reply_map[raw_smp_processor_id()];
+       }
 }

 /**
@@ -970,9 +974,6 @@ megasas_alloc_cmds_fusion(struct megasas_instance
*instance)
        if (megasas_alloc_cmdlist_fusion(instance))
                goto fail_exit;

-       dev_info(&instance->pdev->dev, "Configured max firmware commands:
%d\n",
-                instance->max_fw_cmds);
-
        /* The first 256 bytes (SMID 0) is not used. Don't add to the cmd
list */
        io_req_base = fusion->io_request_frames +
MEGA_MPI2_RAID_DEFAULT_IO_FRAME_SIZE;
        io_req_base_phys = fusion->io_request_frames_phys +
MEGA_MPI2_RAID_DEFAULT_IO_FRAME_SIZE;

Kashyap

>
> Signed-off-by: Hannes Reinecke <hare@suse.com>
> Signed-off-by: John Garry <john.garry@huawei.com>
> ---
>  drivers/scsi/megaraid/megaraid_sas.h        |  1 -
>  drivers/scsi/megaraid/megaraid_sas_base.c   | 59 +++++++--------------
>  drivers/scsi/megaraid/megaraid_sas_fusion.c | 24 +++++----
>  3 files changed, 32 insertions(+), 52 deletions(-)
>
> diff --git a/drivers/scsi/megaraid/megaraid_sas.h
> b/drivers/scsi/megaraid/megaraid_sas.h
> index af2c7a2a9565..b27a34a5f5de 100644
> --- a/drivers/scsi/megaraid/megaraid_sas.h
> +++ b/drivers/scsi/megaraid/megaraid_sas.h
> @@ -2261,7 +2261,6 @@ enum MR_PERF_MODE {
>
>  struct megasas_instance {
>
> -	unsigned int *reply_map;
>  	__le32 *producer;
>  	dma_addr_t producer_h;
>  	__le32 *consumer;
> diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c
> b/drivers/scsi/megaraid/megaraid_sas_base.c
> index 00668335c2af..e6bb2a64d51c 100644
> --- a/drivers/scsi/megaraid/megaraid_sas_base.c
> +++ b/drivers/scsi/megaraid/megaraid_sas_base.c
> @@ -37,6 +37,7 @@
>  #include <linux/poll.h>
>  #include <linux/vmalloc.h>
>  #include <linux/irq_poll.h>
> +#include <linux/blk-mq-pci.h>
>
>  #include <scsi/scsi.h>
>  #include <scsi/scsi_cmnd.h>
> @@ -3115,6 +3116,19 @@ megasas_bios_param(struct scsi_device *sdev,
> struct block_device *bdev,
>  	return 0;
>  }
>
> +static int megasas_map_queues(struct Scsi_Host *shost) {
> +	struct megasas_instance *instance;
> +
> +	instance = (struct megasas_instance *)shost->hostdata;
> +
> +	if (!instance->smp_affinity_enable)
> +		return 0;
> +
> +	return blk_mq_pci_map_queues(&shost-
> >tag_set.map[HCTX_TYPE_DEFAULT],
> +			instance->pdev,
instance->low_latency_index_start);
> +}
> +
>  static void megasas_aen_polling(struct work_struct *work);
>
>  /**
> @@ -3423,8 +3437,10 @@ static struct scsi_host_template
> megasas_template = {
>  	.eh_timed_out = megasas_reset_timer,
>  	.shost_attrs = megaraid_host_attrs,
>  	.bios_param = megasas_bios_param,
> +	.map_queues = megasas_map_queues,
>  	.change_queue_depth = scsi_change_queue_depth,
>  	.max_segment_size = 0xffffffff,
> +	.host_tagset = 1,
>  };
>
>  /**
> @@ -5708,34 +5724,6 @@ megasas_setup_jbod_map(struct
> megasas_instance *instance)
>  		instance->use_seqnum_jbod_fp = false;  }
>
> -static void megasas_setup_reply_map(struct megasas_instance *instance)
-{
> -	const struct cpumask *mask;
> -	unsigned int queue, cpu, low_latency_index_start;
> -
> -	low_latency_index_start = instance->low_latency_index_start;
> -
> -	for (queue = low_latency_index_start; queue < instance-
> >msix_vectors; queue++) {
> -		mask = pci_irq_get_affinity(instance->pdev, queue);
> -		if (!mask)
> -			goto fallback;
> -
> -		for_each_cpu(cpu, mask)
> -			instance->reply_map[cpu] = queue;
> -	}
> -	return;
> -
> -fallback:
> -	queue = low_latency_index_start;
> -	for_each_possible_cpu(cpu) {
> -		instance->reply_map[cpu] = queue;
> -		if (queue == (instance->msix_vectors - 1))
> -			queue = low_latency_index_start;
> -		else
> -			queue++;
> -	}
> -}
> -
>  /**
>   * megasas_get_device_list -	Get the PD and LD device list from FW.
>   * @instance:			Adapter soft state
> @@ -6158,8 +6146,6 @@ static int megasas_init_fw(struct megasas_instance
> *instance)
>  			goto fail_init_adapter;
>  	}
>
> -	megasas_setup_reply_map(instance);
> -
>  	dev_info(&instance->pdev->dev,
>  		"current msix/online cpus\t: (%d/%d)\n",
>  		instance->msix_vectors, (unsigned int)num_online_cpus());
> @@ -6793,6 +6779,9 @@ static int megasas_io_attach(struct
> megasas_instance *instance)
>  	host->max_id = MEGASAS_MAX_DEV_PER_CHANNEL;
>  	host->max_lun = MEGASAS_MAX_LUN;
>  	host->max_cmd_len = 16;
> +	if (instance->adapter_type != MFI_SERIES && instance->msix_vectors
> > 0)
> +		host->nr_hw_queues = instance->msix_vectors -
> +			instance->low_latency_index_start;
>
>  	/*
>  	 * Notify the mid-layer about the new controller @@ -6960,11
> +6949,6 @@ static inline int megasas_alloc_mfi_ctrl_mem(struct
> megasas_instance *instance)
>   */
>  static int megasas_alloc_ctrl_mem(struct megasas_instance *instance)  {
> -	instance->reply_map = kcalloc(nr_cpu_ids, sizeof(unsigned int),
> -				      GFP_KERNEL);
> -	if (!instance->reply_map)
> -		return -ENOMEM;
> -
>  	switch (instance->adapter_type) {
>  	case MFI_SERIES:
>  		if (megasas_alloc_mfi_ctrl_mem(instance))
> @@ -6981,8 +6965,6 @@ static int megasas_alloc_ctrl_mem(struct
> megasas_instance *instance)
>
>  	return 0;
>   fail:
> -	kfree(instance->reply_map);
> -	instance->reply_map = NULL;
>  	return -ENOMEM;
>  }
>
> @@ -6995,7 +6977,6 @@ static int megasas_alloc_ctrl_mem(struct
> megasas_instance *instance)
>   */
>  static inline void megasas_free_ctrl_mem(struct megasas_instance
> *instance)  {
> -	kfree(instance->reply_map);
>  	if (instance->adapter_type == MFI_SERIES) {
>  		if (instance->producer)
>  			dma_free_coherent(&instance->pdev->dev,
> sizeof(u32), @@ -7683,8 +7664,6 @@ megasas_resume(struct pci_dev
> *pdev)
>  			goto fail_reenable_msix;
>  	}
>
> -	megasas_setup_reply_map(instance);
> -
>  	if (instance->adapter_type != MFI_SERIES) {
>  		megasas_reset_reply_desc(instance);
>  		if (megasas_ioc_init_fusion(instance)) { diff --git
> a/drivers/scsi/megaraid/megaraid_sas_fusion.c
> b/drivers/scsi/megaraid/megaraid_sas_fusion.c
> index 319f241da4b6..8e25b700988e 100644
> --- a/drivers/scsi/megaraid/megaraid_sas_fusion.c
> +++ b/drivers/scsi/megaraid/megaraid_sas_fusion.c
> @@ -373,24 +373,24 @@ megasas_get_msix_index(struct megasas_instance
> *instance,  {
>  	int sdev_busy;
>
> -	/* nr_hw_queue = 1 for MegaRAID */
> -	struct blk_mq_hw_ctx *hctx =
> -		scmd->device->request_queue->queue_hw_ctx[0];
> +	struct blk_mq_hw_ctx *hctx = scmd->request->mq_hctx;
>
>  	sdev_busy = atomic_read(&hctx->nr_active);
>
>  	if (instance->perf_mode == MR_BALANCED_PERF_MODE &&
> -	    sdev_busy > (data_arms * MR_DEVICE_HIGH_IOPS_DEPTH))
> +	    sdev_busy > (data_arms * MR_DEVICE_HIGH_IOPS_DEPTH)) {
>  		cmd->request_desc->SCSIIO.MSIxIndex =
>  			mega_mod64((atomic64_add_return(1, &instance-
> >high_iops_outstanding) /
>  					MR_HIGH_IOPS_BATCH_COUNT),
> instance->low_latency_index_start);
> -	else if (instance->msix_load_balance)
> +	} else if (instance->msix_load_balance) {
>  		cmd->request_desc->SCSIIO.MSIxIndex =
>  			(mega_mod64(atomic64_add_return(1, &instance-
> >total_io_count),
>  				instance->msix_vectors));
> -	else
> -		cmd->request_desc->SCSIIO.MSIxIndex =
> -			instance->reply_map[raw_smp_processor_id()];
> +	} else {
> +		u32 tag = blk_mq_unique_tag(scmd->request);
> +
> +		cmd->request_desc->SCSIIO.MSIxIndex =
> blk_mq_unique_tag_to_hwq(tag) + instance->low_latency_index_start;
> +	}
>  }
>
>  /**
> @@ -3326,7 +3326,7 @@ megasas_build_and_issue_cmd_fusion(struct
> megasas_instance *instance,  {
>  	struct megasas_cmd_fusion *cmd, *r1_cmd = NULL;
>  	union MEGASAS_REQUEST_DESCRIPTOR_UNION *req_desc;
> -	u32 index;
> +	u32 index, blk_tag, unique_tag;
>
>  	if ((megasas_cmd_type(scmd) == READ_WRITE_LDIO) &&
>  		instance->ldio_threshold &&
> @@ -3342,7 +3342,9 @@ megasas_build_and_issue_cmd_fusion(struct
> megasas_instance *instance,
>  		return SCSI_MLQUEUE_HOST_BUSY;
>  	}
>
> -	cmd = megasas_get_cmd_fusion(instance, scmd->request->tag);
> +	unique_tag = blk_mq_unique_tag(scmd->request);
> +	blk_tag = blk_mq_unique_tag_to_tag(unique_tag);
> +	cmd = megasas_get_cmd_fusion(instance, blk_tag);
>
>  	if (!cmd) {
>  		atomic_dec(&instance->fw_outstanding);
> @@ -3383,7 +3385,7 @@ megasas_build_and_issue_cmd_fusion(struct
> megasas_instance *instance,
>  	 */
>  	if (cmd->r1_alt_dev_handle != MR_DEVHANDLE_INVALID) {
>  		r1_cmd = megasas_get_cmd_fusion(instance,
> -				(scmd->request->tag + instance-
> >max_fw_cmds));
> +				(blk_tag + instance->max_fw_cmds));
>  		megasas_prepare_secondRaid1_IO(instance, cmd, r1_cmd);
>  	}
>
> --
> 2.26.2

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, back to index

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-10 17:29 [PATCH RFC v7 00/12] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs John Garry
2020-06-10 17:29 ` [PATCH RFC v7 01/12] blk-mq: rename BLK_MQ_F_TAG_SHARED as BLK_MQ_F_TAG_QUEUE_SHARED John Garry
2020-06-10 17:29 ` [PATCH RFC v7 02/12] blk-mq: rename blk_mq_update_tag_set_depth() John Garry
2020-06-11  2:57   ` Ming Lei
2020-06-11  8:26     ` John Garry
2020-06-23 11:25       ` John Garry
2020-06-23 14:23         ` Hannes Reinecke
2020-06-24  8:13           ` Kashyap Desai
2020-06-29 16:18             ` John Garry
2020-06-10 17:29 ` [PATCH RFC v7 03/12] blk-mq: Use pointers for blk_mq_tags bitmap tags John Garry
2020-06-10 17:29 ` [PATCH RFC v7 04/12] blk-mq: Facilitate a shared sbitmap per tagset John Garry
2020-06-11  3:37   ` Ming Lei
2020-06-11 10:09     ` John Garry
2020-06-10 17:29 ` [PATCH RFC v7 05/12] blk-mq: Record nr_active_requests per queue for when using shared sbitmap John Garry
2020-06-11  4:04   ` Ming Lei
2020-06-11 10:22     ` John Garry
2020-06-10 17:29 ` [PATCH RFC v7 06/12] blk-mq: Record active_queues_shared_sbitmap per tag_set " John Garry
2020-06-11 13:16   ` Hannes Reinecke
2020-06-11 14:22     ` John Garry
2020-06-10 17:29 ` [PATCH RFC v7 07/12] blk-mq: Add support in hctx_tags_bitmap_show() for a " John Garry
2020-06-11 13:19   ` Hannes Reinecke
2020-06-11 14:33     ` John Garry
2020-06-12  6:06       ` Hannes Reinecke
2020-06-29 15:32         ` About sbitmap_bitmap_show() and cleared bits (was Re: [PATCH RFC v7 07/12] blk-mq: Add support in hctx_tags_bitmap_show() for a shared sbitmap) John Garry
2020-06-30  6:33           ` Hannes Reinecke
2020-06-30  7:30             ` John Garry
2020-06-30 11:36               ` John Garry
2020-06-30 14:55           ` Bart Van Assche
2020-06-10 17:29 ` [PATCH RFC v7 08/12] scsi: Add template flag 'host_tagset' John Garry
2020-06-10 17:29 ` [PATCH RFC v7 09/12] scsi: hisi_sas: Switch v3 hw to MQ John Garry
2020-06-10 17:29 ` [PATCH RFC v7 10/12] megaraid_sas: switch fusion adapters " John Garry
2020-07-02 10:23   ` Kashyap Desai
2020-06-10 17:29 ` [PATCH RFC v7 11/12] smartpqi: enable host tagset John Garry
2020-06-10 17:29 ` [PATCH RFC v7 12/12] hpsa: enable host_tagset and switch to MQ John Garry
2020-06-11  3:07 ` [PATCH RFC v7 00/12] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs Ming Lei
2020-06-11  9:35   ` John Garry
2020-06-12 18:47     ` Kashyap Desai
2020-06-15  2:13       ` Ming Lei
2020-06-15  6:57         ` Kashyap Desai
2020-06-16  1:00           ` Ming Lei
2020-06-17 11:26             ` Kashyap Desai
2020-06-22  6:24               ` Hannes Reinecke
2020-06-23  0:55                 ` Ming Lei
2020-06-23 11:50                   ` Kashyap Desai
2020-06-23 12:11                   ` Kashyap Desai

Linux-Block Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-block/0 linux-block/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-block linux-block/ https://lore.kernel.org/linux-block \
		linux-block@vger.kernel.org
	public-inbox-index linux-block

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-block


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git