[PATCH RFC v6 00/10] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH RFC v6 00/10] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs
@ 2020-03-05 11:54 John Garry
  2020-03-05 11:54 ` [PATCH RFC v6 01/10] blk-mq: rename BLK_MQ_F_TAG_SHARED as BLK_MQ_F_TAG_QUEUE_SHARED John Garry
                   ` (9 more replies)
  0 siblings, 10 replies; 35+ messages in thread
From: John Garry @ 2020-03-05 11:54 UTC (permalink / raw)
  To: axboe, jejb, martin.petersen, ming.lei, bvanassche, hare,
	don.brace, sumit.saxena, hch, kashyap.desai,
	shivasharan.srikanteshwara
  Cc: chenxiang66, linux-block, linux-scsi, esc.storagedev, John Garry

Hi all,

Here is v6 of the patchset.

In this version of the series, we keep the shared sbitmap for driver tags,
but omit the shared scheduler tags for now - I did not think that this was
an essential feature and the premise was somewhat in doubt.

Comments welcome, thanks!

A copy of the patches can be found here:
https://github.com/hisilicon/kernel-dev/tree/private-topic-blk-mq-shared-tags-rfc-v6-upstream

Differences to v5:
- For now, drop the shared scheduler tags
- Fix megaraid SAS queue selection and rebase
- Omit minor unused arguments patch, which has now been merged
- Add separate patch to introduce sbitmap pointer
- Fixed hctx_tags_bitmap_show() for shared sbitmap
- Various tidying

Differences to v4:
- Rework scheduler tag allocations
- Revert back to the original implementation from John

Differences to v3:
- Include reviews from Ming Lei

Differences to v2:
- Drop embedded tag bitmaps
- Do not share scheduling tags
- Add patches for hpsa and smartpqi

Differences to v1:
- Use a shared sbitmap, and not a separate shared tags (a big change!)
	- Drop request.shared_tag
- Add RB tags

Hannes Reinecke (5):
  blk-mq: rename blk_mq_update_tag_set_depth()
  scsi: Add template flag 'host_tagset'
  megaraid_sas: switch fusion adapters to MQ
  smartpqi: enable host tagset
  hpsa: enable host_tagset and switch to MQ

John Garry (4):
  blk-mq: Use pointers for blk_mq_tags bitmap tags
  blk-mq: Facilitate a shared sbitmap per tagset
  blk-mq: Add support in hctx_tags_bitmap_show() for a shared sbitmap
  scsi: hisi_sas: Switch v3 hw to MQ

Ming Lei (1):
  blk-mq: rename BLK_MQ_F_TAG_SHARED as BLK_MQ_F_TAG_QUEUE_SHARED

 block/bfq-iosched.c                         |  4 +-
 block/blk-mq-debugfs.c                      | 65 ++++++++++++--
 block/blk-mq-tag.c                          | 97 ++++++++++++++-------
 block/blk-mq-tag.h                          | 21 +++--
 block/blk-mq.c                              | 60 +++++++++----
 block/blk-mq.h                              |  5 ++
 block/kyber-iosched.c                       |  4 +-
 drivers/scsi/hisi_sas/hisi_sas.h            |  3 +-
 drivers/scsi/hisi_sas/hisi_sas_main.c       | 36 ++++----
 drivers/scsi/hisi_sas/hisi_sas_v3_hw.c      | 87 ++++++++----------
 drivers/scsi/hpsa.c                         | 44 ++--------
 drivers/scsi/hpsa.h                         |  1 -
 drivers/scsi/megaraid/megaraid_sas.h        |  1 -
 drivers/scsi/megaraid/megaraid_sas_base.c   | 59 ++++---------
 drivers/scsi/megaraid/megaraid_sas_fusion.c | 24 ++---
 drivers/scsi/scsi_lib.c                     |  2 +
 drivers/scsi/smartpqi/smartpqi_init.c       | 38 +++++---
 include/linux/blk-mq.h                      |  5 +-
 include/scsi/scsi_host.h                    |  3 +
 19 files changed, 325 insertions(+), 234 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH RFC v6 01/10] blk-mq: rename BLK_MQ_F_TAG_SHARED as BLK_MQ_F_TAG_QUEUE_SHARED
  2020-03-05 11:54 [PATCH RFC v6 00/10] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs John Garry
@ 2020-03-05 11:54 ` John Garry
  2020-03-05 11:54 ` [PATCH RFC v6 02/10] blk-mq: rename blk_mq_update_tag_set_depth() John Garry
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 35+ messages in thread
From: John Garry @ 2020-03-05 11:54 UTC (permalink / raw)
  To: axboe, jejb, martin.petersen, ming.lei, bvanassche, hare,
	don.brace, sumit.saxena, hch, kashyap.desai,
	shivasharan.srikanteshwara
  Cc: chenxiang66, linux-block, linux-scsi, esc.storagedev, John Garry

From: Ming Lei <ming.lei@redhat.com>

BLK_MQ_F_TAG_SHARED actually means that tags is shared among request
queues, all of which should belong to LUNs attached to same HBA.

So rename it to make the point explicitly.

Suggested-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: John Garry <john.garry@huawei.com>
---
 block/blk-mq-debugfs.c |  2 +-
 block/blk-mq-tag.c     |  2 +-
 block/blk-mq-tag.h     |  4 ++--
 block/blk-mq.c         | 20 ++++++++++----------
 include/linux/blk-mq.h |  2 +-
 5 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index b3f2ba483992..33a40ae1d60f 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -236,7 +236,7 @@ static const char *const alloc_policy_name[] = {
 #define HCTX_FLAG_NAME(name) [ilog2(BLK_MQ_F_##name)] = #name
 static const char *const hctx_flag_name[] = {
 	HCTX_FLAG_NAME(SHOULD_MERGE),
-	HCTX_FLAG_NAME(TAG_SHARED),
+	HCTX_FLAG_NAME(TAG_QUEUE_SHARED),
 	HCTX_FLAG_NAME(BLOCKING),
 	HCTX_FLAG_NAME(NO_SCHED),
 };
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 586c9d6e904a..42792942b428 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -65,7 +65,7 @@ static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx,
 {
 	unsigned int depth, users;
 
-	if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_SHARED))
+	if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED))
 		return true;
 	if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
 		return true;
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index 2b8321efb682..52658dc5cea6 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -54,7 +54,7 @@ extern void __blk_mq_tag_idle(struct blk_mq_hw_ctx *);
 
 static inline bool blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx)
 {
-	if (!(hctx->flags & BLK_MQ_F_TAG_SHARED))
+	if (!(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED))
 		return false;
 
 	return __blk_mq_tag_busy(hctx);
@@ -62,7 +62,7 @@ static inline bool blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx)
 
 static inline void blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
 {
-	if (!(hctx->flags & BLK_MQ_F_TAG_SHARED))
+	if (!(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED))
 		return;
 
 	__blk_mq_tag_idle(hctx);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index d92088dec6c3..80c81da62174 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -280,7 +280,7 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
 		rq->tag = -1;
 		rq->internal_tag = tag;
 	} else {
-		if (data->hctx->flags & BLK_MQ_F_TAG_SHARED) {
+		if (data->hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED) {
 			rq_flags = RQF_MQ_INFLIGHT;
 			atomic_inc(&data->hctx->nr_active);
 		}
@@ -1091,7 +1091,7 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx *hctx,
 	wait_queue_entry_t *wait;
 	bool ret;
 
-	if (!(hctx->flags & BLK_MQ_F_TAG_SHARED)) {
+	if (!(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED)) {
 		blk_mq_sched_mark_restart_hctx(hctx);
 
 		/*
@@ -1222,7 +1222,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 				 * For non-shared tags, the RESTART check
 				 * will suffice.
 				 */
-				if (hctx->flags & BLK_MQ_F_TAG_SHARED)
+				if (hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED)
 					no_tag = true;
 				break;
 			}
@@ -2387,7 +2387,7 @@ blk_mq_alloc_hctx(struct request_queue *q, struct blk_mq_tag_set *set,
 	spin_lock_init(&hctx->lock);
 	INIT_LIST_HEAD(&hctx->dispatch);
 	hctx->queue = q;
-	hctx->flags = set->flags & ~BLK_MQ_F_TAG_SHARED;
+	hctx->flags = set->flags & ~BLK_MQ_F_TAG_QUEUE_SHARED;
 
 	INIT_LIST_HEAD(&hctx->hctx_list);
 
@@ -2604,9 +2604,9 @@ static void queue_set_hctx_shared(struct request_queue *q, bool shared)
 
 	queue_for_each_hw_ctx(q, hctx, i) {
 		if (shared)
-			hctx->flags |= BLK_MQ_F_TAG_SHARED;
+			hctx->flags |= BLK_MQ_F_TAG_QUEUE_SHARED;
 		else
-			hctx->flags &= ~BLK_MQ_F_TAG_SHARED;
+			hctx->flags &= ~BLK_MQ_F_TAG_QUEUE_SHARED;
 	}
 }
 
@@ -2632,7 +2632,7 @@ static void blk_mq_del_queue_tag_set(struct request_queue *q)
 	list_del_rcu(&q->tag_set_list);
 	if (list_is_singular(&set->tag_list)) {
 		/* just transitioned to unshared */
-		set->flags &= ~BLK_MQ_F_TAG_SHARED;
+		set->flags &= ~BLK_MQ_F_TAG_QUEUE_SHARED;
 		/* update existing queue */
 		blk_mq_update_tag_set_depth(set, false);
 	}
@@ -2649,12 +2649,12 @@ static void blk_mq_add_queue_tag_set(struct blk_mq_tag_set *set,
 	 * Check to see if we're transitioning to shared (from 1 to 2 queues).
 	 */
 	if (!list_empty(&set->tag_list) &&
-	    !(set->flags & BLK_MQ_F_TAG_SHARED)) {
-		set->flags |= BLK_MQ_F_TAG_SHARED;
+	    !(set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)) {
+		set->flags |= BLK_MQ_F_TAG_QUEUE_SHARED;
 		/* update existing queue */
 		blk_mq_update_tag_set_depth(set, true);
 	}
-	if (set->flags & BLK_MQ_F_TAG_SHARED)
+	if (set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)
 		queue_set_hctx_shared(q, true);
 	list_add_tail_rcu(&q->tag_set_list, &set->tag_list);
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 11cfd6470b1a..ab1618007600 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -387,7 +387,7 @@ struct blk_mq_ops {
 
 enum {
 	BLK_MQ_F_SHOULD_MERGE	= 1 << 0,
-	BLK_MQ_F_TAG_SHARED	= 1 << 1,
+	BLK_MQ_F_TAG_QUEUE_SHARED	= 1 << 1,
 	BLK_MQ_F_BLOCKING	= 1 << 5,
 	BLK_MQ_F_NO_SCHED	= 1 << 6,
 	BLK_MQ_F_ALLOC_POLICY_START_BIT = 8,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH RFC v6 02/10] blk-mq: rename blk_mq_update_tag_set_depth()
  2020-03-05 11:54 [PATCH RFC v6 00/10] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs John Garry
  2020-03-05 11:54 ` [PATCH RFC v6 01/10] blk-mq: rename BLK_MQ_F_TAG_SHARED as BLK_MQ_F_TAG_QUEUE_SHARED John Garry
@ 2020-03-05 11:54 ` John Garry
  2020-03-05 11:54 ` [PATCH RFC v6 03/10] blk-mq: Use pointers for blk_mq_tags bitmap tags John Garry
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 35+ messages in thread
From: John Garry @ 2020-03-05 11:54 UTC (permalink / raw)
  To: axboe, jejb, martin.petersen, ming.lei, bvanassche, hare,
	don.brace, sumit.saxena, hch, kashyap.desai,
	shivasharan.srikanteshwara
  Cc: chenxiang66, linux-block, linux-scsi, esc.storagedev, John Garry

From: Hannes Reinecke <hare@suse.de>

The function does not set the depth, but rather transitions from
shared to non-shared queues and vice versa.
So rename it to blk_mq_update_tag_set_shared() to better reflect
its purpose.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: John Garry <john.garry@huawei.com>
---
 block/blk-mq-tag.c | 18 ++++++++++--------
 block/blk-mq.c     |  8 ++++----
 2 files changed, 14 insertions(+), 12 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 42792942b428..4643ad86cf8b 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -432,24 +432,22 @@ static int bt_alloc(struct sbitmap_queue *bt, unsigned int depth,
 				       node);
 }
 
-static struct blk_mq_tags *blk_mq_init_bitmap_tags(struct blk_mq_tags *tags,
-						   int node, int alloc_policy)
+static int blk_mq_init_bitmap_tags(struct blk_mq_tags *tags,
+				   int node, int alloc_policy)
 {
 	unsigned int depth = tags->nr_tags - tags->nr_reserved_tags;
 	bool round_robin = alloc_policy == BLK_TAG_ALLOC_RR;
 
 	if (bt_alloc(&tags->bitmap_tags, depth, round_robin, node))
-		goto free_tags;
+		return -ENOMEM;
 	if (bt_alloc(&tags->breserved_tags, tags->nr_reserved_tags, round_robin,
 		     node))
 		goto free_bitmap_tags;
 
-	return tags;
+	return 0;
 free_bitmap_tags:
 	sbitmap_queue_free(&tags->bitmap_tags);
-free_tags:
-	kfree(tags);
-	return NULL;
+	return -ENOMEM;
 }
 
 struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
@@ -470,7 +468,11 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
 	tags->nr_tags = total_tags;
 	tags->nr_reserved_tags = reserved_tags;
 
-	return blk_mq_init_bitmap_tags(tags, node, alloc_policy);
+	if (blk_mq_init_bitmap_tags(tags, node, alloc_policy) < 0) {
+		kfree(tags);
+		tags = NULL;
+	}
+	return tags;
 }
 
 void blk_mq_free_tags(struct blk_mq_tags *tags)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 80c81da62174..9e350c646141 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2610,8 +2610,8 @@ static void queue_set_hctx_shared(struct request_queue *q, bool shared)
 	}
 }
 
-static void blk_mq_update_tag_set_depth(struct blk_mq_tag_set *set,
-					bool shared)
+static void blk_mq_update_tag_set_shared(struct blk_mq_tag_set *set,
+					 bool shared)
 {
 	struct request_queue *q;
 
@@ -2634,7 +2634,7 @@ static void blk_mq_del_queue_tag_set(struct request_queue *q)
 		/* just transitioned to unshared */
 		set->flags &= ~BLK_MQ_F_TAG_QUEUE_SHARED;
 		/* update existing queue */
-		blk_mq_update_tag_set_depth(set, false);
+		blk_mq_update_tag_set_shared(set, false);
 	}
 	mutex_unlock(&set->tag_list_lock);
 	INIT_LIST_HEAD(&q->tag_set_list);
@@ -2652,7 +2652,7 @@ static void blk_mq_add_queue_tag_set(struct blk_mq_tag_set *set,
 	    !(set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)) {
 		set->flags |= BLK_MQ_F_TAG_QUEUE_SHARED;
 		/* update existing queue */
-		blk_mq_update_tag_set_depth(set, true);
+		blk_mq_update_tag_set_shared(set, true);
 	}
 	if (set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)
 		queue_set_hctx_shared(q, true);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH RFC v6 03/10] blk-mq: Use pointers for blk_mq_tags bitmap tags
  2020-03-05 11:54 [PATCH RFC v6 00/10] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs John Garry
  2020-03-05 11:54 ` [PATCH RFC v6 01/10] blk-mq: rename BLK_MQ_F_TAG_SHARED as BLK_MQ_F_TAG_QUEUE_SHARED John Garry
  2020-03-05 11:54 ` [PATCH RFC v6 02/10] blk-mq: rename blk_mq_update_tag_set_depth() John Garry
@ 2020-03-05 11:54 ` John Garry
  2020-03-05 12:42   ` Hannes Reinecke
  2020-03-05 11:54 ` [PATCH RFC v6 04/10] blk-mq: Facilitate a shared sbitmap per tagset John Garry
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 35+ messages in thread
From: John Garry @ 2020-03-05 11:54 UTC (permalink / raw)
  To: axboe, jejb, martin.petersen, ming.lei, bvanassche, hare,
	don.brace, sumit.saxena, hch, kashyap.desai,
	shivasharan.srikanteshwara
  Cc: chenxiang66, linux-block, linux-scsi, esc.storagedev, John Garry

Introduce pointers for the blk_mq_tags regular and reserved bitmap tags,
with the goal of later being able to use a common shared tag bitmap across
all HW contexts in a set.

Signed-off-by: John Garry <john.garry@huawei.com>
---
 block/bfq-iosched.c    |  4 ++--
 block/blk-mq-debugfs.c |  8 ++++----
 block/blk-mq-tag.c     | 41 ++++++++++++++++++++++-------------------
 block/blk-mq-tag.h     |  7 +++++--
 block/blk-mq.c         |  4 ++--
 block/kyber-iosched.c  |  4 ++--
 6 files changed, 37 insertions(+), 31 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 8c436abfaf14..d1b88ebb58d6 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -6364,8 +6364,8 @@ static void bfq_depth_updated(struct blk_mq_hw_ctx *hctx)
 	struct blk_mq_tags *tags = hctx->sched_tags;
 	unsigned int min_shallow;
 
-	min_shallow = bfq_update_depths(bfqd, &tags->bitmap_tags);
-	sbitmap_queue_min_shallow_depth(&tags->bitmap_tags, min_shallow);
+	min_shallow = bfq_update_depths(bfqd, tags->bitmap_tags);
+	sbitmap_queue_min_shallow_depth(tags->bitmap_tags, min_shallow);
 }
 
 static int bfq_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int index)
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 33a40ae1d60f..fca87470e267 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -449,11 +449,11 @@ static void blk_mq_debugfs_tags_show(struct seq_file *m,
 		   atomic_read(&tags->active_queues));
 
 	seq_puts(m, "\nbitmap_tags:\n");
-	sbitmap_queue_show(&tags->bitmap_tags, m);
+	sbitmap_queue_show(tags->bitmap_tags, m);
 
 	if (tags->nr_reserved_tags) {
 		seq_puts(m, "\nbreserved_tags:\n");
-		sbitmap_queue_show(&tags->breserved_tags, m);
+		sbitmap_queue_show(tags->breserved_tags, m);
 	}
 }
 
@@ -484,7 +484,7 @@ static int hctx_tags_bitmap_show(void *data, struct seq_file *m)
 	if (res)
 		goto out;
 	if (hctx->tags)
-		sbitmap_bitmap_show(&hctx->tags->bitmap_tags.sb, m);
+		sbitmap_bitmap_show(&hctx->tags->bitmap_tags->sb, m);
 	mutex_unlock(&q->sysfs_lock);
 
 out:
@@ -518,7 +518,7 @@ static int hctx_sched_tags_bitmap_show(void *data, struct seq_file *m)
 	if (res)
 		goto out;
 	if (hctx->sched_tags)
-		sbitmap_bitmap_show(&hctx->sched_tags->bitmap_tags.sb, m);
+		sbitmap_bitmap_show(&hctx->sched_tags->bitmap_tags->sb, m);
 	mutex_unlock(&q->sysfs_lock);
 
 out:
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 4643ad86cf8b..b153ecf8c5c6 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -35,9 +35,9 @@ bool __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx)
  */
 void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags, bool include_reserve)
 {
-	sbitmap_queue_wake_all(&tags->bitmap_tags);
+	sbitmap_queue_wake_all(tags->bitmap_tags);
 	if (include_reserve)
-		sbitmap_queue_wake_all(&tags->breserved_tags);
+		sbitmap_queue_wake_all(tags->breserved_tags);
 }
 
 /*
@@ -113,10 +113,10 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 			WARN_ON_ONCE(1);
 			return BLK_MQ_TAG_FAIL;
 		}
-		bt = &tags->breserved_tags;
+		bt = tags->breserved_tags;
 		tag_offset = 0;
 	} else {
-		bt = &tags->bitmap_tags;
+		bt = tags->bitmap_tags;
 		tag_offset = tags->nr_reserved_tags;
 	}
 
@@ -162,9 +162,9 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 						data->ctx);
 		tags = blk_mq_tags_from_data(data);
 		if (data->flags & BLK_MQ_REQ_RESERVED)
-			bt = &tags->breserved_tags;
+			bt = tags->breserved_tags;
 		else
-			bt = &tags->bitmap_tags;
+			bt = tags->bitmap_tags;
 
 		/*
 		 * If destination hw queue is changed, fake wake up on
@@ -190,10 +190,10 @@ void blk_mq_put_tag(struct blk_mq_tags *tags, struct blk_mq_ctx *ctx,
 		const int real_tag = tag - tags->nr_reserved_tags;
 
 		BUG_ON(real_tag >= tags->nr_tags);
-		sbitmap_queue_clear(&tags->bitmap_tags, real_tag, ctx->cpu);
+		sbitmap_queue_clear(tags->bitmap_tags, real_tag, ctx->cpu);
 	} else {
 		BUG_ON(tag >= tags->nr_reserved_tags);
-		sbitmap_queue_clear(&tags->breserved_tags, tag, ctx->cpu);
+		sbitmap_queue_clear(tags->breserved_tags, tag, ctx->cpu);
 	}
 }
 
@@ -321,8 +321,8 @@ static void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
 		busy_tag_iter_fn *fn, void *priv)
 {
 	if (tags->nr_reserved_tags)
-		bt_tags_for_each(tags, &tags->breserved_tags, fn, priv, true);
-	bt_tags_for_each(tags, &tags->bitmap_tags, fn, priv, false);
+		bt_tags_for_each(tags, tags->breserved_tags, fn, priv, true);
+	bt_tags_for_each(tags, tags->bitmap_tags, fn, priv, false);
 }
 
 /**
@@ -419,8 +419,8 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
 			continue;
 
 		if (tags->nr_reserved_tags)
-			bt_for_each(hctx, &tags->breserved_tags, fn, priv, true);
-		bt_for_each(hctx, &tags->bitmap_tags, fn, priv, false);
+			bt_for_each(hctx, tags->breserved_tags, fn, priv, true);
+		bt_for_each(hctx, tags->bitmap_tags, fn, priv, false);
 	}
 	blk_queue_exit(q);
 }
@@ -438,15 +438,18 @@ static int blk_mq_init_bitmap_tags(struct blk_mq_tags *tags,
 	unsigned int depth = tags->nr_tags - tags->nr_reserved_tags;
 	bool round_robin = alloc_policy == BLK_TAG_ALLOC_RR;
 
-	if (bt_alloc(&tags->bitmap_tags, depth, round_robin, node))
+	if (bt_alloc(&tags->__bitmap_tags, depth, round_robin, node))
 		return -ENOMEM;
-	if (bt_alloc(&tags->breserved_tags, tags->nr_reserved_tags, round_robin,
-		     node))
+	if (bt_alloc(&tags->__breserved_tags, tags->nr_reserved_tags,
+		     round_robin, node))
 		goto free_bitmap_tags;
 
+	tags->bitmap_tags = &tags->__bitmap_tags;
+	tags->breserved_tags = &tags->__breserved_tags;
+
 	return 0;
 free_bitmap_tags:
-	sbitmap_queue_free(&tags->bitmap_tags);
+	sbitmap_queue_free(&tags->__bitmap_tags);
 	return -ENOMEM;
 }
 
@@ -477,8 +480,8 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
 
 void blk_mq_free_tags(struct blk_mq_tags *tags)
 {
-	sbitmap_queue_free(&tags->bitmap_tags);
-	sbitmap_queue_free(&tags->breserved_tags);
+	sbitmap_queue_free(&tags->__bitmap_tags);
+	sbitmap_queue_free(&tags->__breserved_tags);
 	kfree(tags);
 }
 
@@ -528,7 +531,7 @@ int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
 		 * Don't need (or can't) update reserved tags here, they
 		 * remain static and should never need resizing.
 		 */
-		sbitmap_queue_resize(&tags->bitmap_tags,
+		sbitmap_queue_resize(tags->bitmap_tags,
 				tdepth - tags->nr_reserved_tags);
 	}
 
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index 52658dc5cea6..4f866a39b926 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -13,8 +13,11 @@ struct blk_mq_tags {
 
 	atomic_t active_queues;
 
-	struct sbitmap_queue bitmap_tags;
-	struct sbitmap_queue breserved_tags;
+	struct sbitmap_queue *bitmap_tags;
+	struct sbitmap_queue *breserved_tags;
+
+	struct sbitmap_queue __bitmap_tags;
+	struct sbitmap_queue __breserved_tags;
 
 	struct request **rqs;
 	struct request **static_rqs;
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 9e350c646141..375f5064c183 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1068,7 +1068,7 @@ static int blk_mq_dispatch_wake(wait_queue_entry_t *wait, unsigned mode,
 		struct sbitmap_queue *sbq;
 
 		list_del_init(&wait->entry);
-		sbq = &hctx->tags->bitmap_tags;
+		sbq = hctx->tags->bitmap_tags;
 		atomic_dec(&sbq->ws_active);
 	}
 	spin_unlock(&hctx->dispatch_wait_lock);
@@ -1086,7 +1086,7 @@ static int blk_mq_dispatch_wake(wait_queue_entry_t *wait, unsigned mode,
 static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx *hctx,
 				 struct request *rq)
 {
-	struct sbitmap_queue *sbq = &hctx->tags->bitmap_tags;
+	struct sbitmap_queue *sbq = hctx->tags->bitmap_tags;
 	struct wait_queue_head *wq;
 	wait_queue_entry_t *wait;
 	bool ret;
diff --git a/block/kyber-iosched.c b/block/kyber-iosched.c
index 34dcea0ef637..a7a537501d70 100644
--- a/block/kyber-iosched.c
+++ b/block/kyber-iosched.c
@@ -359,7 +359,7 @@ static unsigned int kyber_sched_tags_shift(struct request_queue *q)
 	 * All of the hardware queues have the same depth, so we can just grab
 	 * the shift of the first one.
 	 */
-	return q->queue_hw_ctx[0]->sched_tags->bitmap_tags.sb.shift;
+	return q->queue_hw_ctx[0]->sched_tags->bitmap_tags->sb.shift;
 }
 
 static struct kyber_queue_data *kyber_queue_data_alloc(struct request_queue *q)
@@ -502,7 +502,7 @@ static int kyber_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int hctx_idx)
 	khd->batching = 0;
 
 	hctx->sched_data = khd;
-	sbitmap_queue_min_shallow_depth(&hctx->sched_tags->bitmap_tags,
+	sbitmap_queue_min_shallow_depth(hctx->sched_tags->bitmap_tags,
 					kqd->async_depth);
 
 	return 0;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH RFC v6 04/10] blk-mq: Facilitate a shared sbitmap per tagset
  2020-03-05 11:54 [PATCH RFC v6 00/10] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs John Garry
                   ` (2 preceding siblings ...)
  2020-03-05 11:54 ` [PATCH RFC v6 03/10] blk-mq: Use pointers for blk_mq_tags bitmap tags John Garry
@ 2020-03-05 11:54 ` John Garry
  2020-03-05 12:49   ` Hannes Reinecke
  2020-03-05 11:54 ` [PATCH RFC v6 05/10] blk-mq: Add support in hctx_tags_bitmap_show() for a shared sbitmap John Garry
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 35+ messages in thread
From: John Garry @ 2020-03-05 11:54 UTC (permalink / raw)
  To: axboe, jejb, martin.petersen, ming.lei, bvanassche, hare,
	don.brace, sumit.saxena, hch, kashyap.desai,
	shivasharan.srikanteshwara
  Cc: chenxiang66, linux-block, linux-scsi, esc.storagedev, John Garry

Some SCSI HBAs (such as HPSA, megaraid, mpt3sas, hisi_sas_v3 ..) support
multiple reply queues with single hostwide tags.

In addition, these drivers want to use interrupt assignment in
pci_alloc_irq_vectors(PCI_IRQ_AFFINITY). However, as discussed in [0],
CPU hotplug may cause in-flight IO completion to not be serviced when an
interrupt is shutdown.

To solve that problem, Ming's patchset to drain hctx's should ensure no
IOs are missed in-flight [1].

However, to take advantage of that patchset, we need to map the HBA HW
queues to blk mq hctx's; to do that, we need to expose the HBA HW queues.

In making that transition, the per-SCSI command request tags are no
longer unique per Scsi host - they are just unique per hctx. As such, the
HBA LLDD would have to generate this tag internally, which has a certain
performance overhead.

However another problem is that blk mq assumes the host may accept
(Scsi_host.can_queue * #hw queue) commands. In [2], we removed the Scsi
host busy counter, which would stop the LLDD being sent more than
.can_queue commands; however, we should still ensure that the block layer
does not issue more than .can_queue commands to the Scsi host.

To solve this problem, introduce a shared sbitmap per blk_mq_tag_set,
which may be requested at init time.

New flag BLK_MQ_F_TAG_HCTX_SHARED should be set when requesting the
tagset to indicate whether the shared sbitmap should be used.

Even when BLK_MQ_F_TAG_HCTX_SHARED is set, we still allocate a full set of
tags and requests per hctx; the reason for this is that if we only allocate
tags and requests for a single hctx - like hctx0 - we may break block
drivers which expect a request be associated with a specific hctx, i.e.
not hctx0.

This is based on work originally from Ming Lei in [3] and from Bart's
suggestion in [4].

[0] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/
[1] https://lore.kernel.org/linux-block/20191014015043.25029-1-ming.lei@redhat.com/
[2] https://lore.kernel.org/linux-scsi/20191025065855.6309-1-ming.lei@redhat.com/
[3] https://lore.kernel.org/linux-block/20190531022801.10003-1-ming.lei@redhat.com/
[4] https://lore.kernel.org/linux-block/ff77beff-5fd9-9f05-12b6-826922bace1f@huawei.com/T/#m3db0a602f095cbcbff27e9c884d6b4ae826144be

Signed-off-by: John Garry <john.garry@huawei.com>
---
 block/blk-mq-tag.c     | 36 ++++++++++++++++++++++++++++++++++--
 block/blk-mq-tag.h     | 10 +++++++++-
 block/blk-mq.c         | 28 ++++++++++++++++++++++++++--
 block/blk-mq.h         |  5 +++++
 include/linux/blk-mq.h |  3 +++
 5 files changed, 77 insertions(+), 5 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index b153ecf8c5c6..dd704118fe83 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -220,7 +220,7 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
 	 * We can hit rq == NULL here, because the tagging functions
 	 * test and set the bit before assigning ->rqs[].
 	 */
-	if (rq && rq->q == hctx->queue)
+	if (rq && rq->q == hctx->queue && rq->mq_hctx == hctx)
 		return iter_data->fn(hctx, rq, iter_data->data, reserved);
 	return true;
 }
@@ -444,6 +444,7 @@ static int blk_mq_init_bitmap_tags(struct blk_mq_tags *tags,
 		     round_robin, node))
 		goto free_bitmap_tags;
 
+	/* We later overwrite these in case of per-set shared sbitmap */
 	tags->bitmap_tags = &tags->__bitmap_tags;
 	tags->breserved_tags = &tags->__breserved_tags;
 
@@ -453,7 +454,32 @@ static int blk_mq_init_bitmap_tags(struct blk_mq_tags *tags,
 	return -ENOMEM;
 }
 
-struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
+bool blk_mq_init_shared_sbitmap(struct blk_mq_tag_set *tag_set)
+{
+	unsigned int depth = tag_set->queue_depth - tag_set->reserved_tags;
+	int alloc_policy = BLK_MQ_FLAG_TO_ALLOC_POLICY(tag_set->flags);
+	bool round_robin = alloc_policy == BLK_TAG_ALLOC_RR;
+	int node = tag_set->numa_node;
+
+	if (bt_alloc(&tag_set->__bitmap_tags, depth, round_robin, node))
+		return false;
+	if (bt_alloc(&tag_set->__breserved_tags, tag_set->reserved_tags,
+		     round_robin, node))
+		goto free_bitmap_tags;
+	return true;
+free_bitmap_tags:
+	sbitmap_queue_free(&tag_set->__bitmap_tags);
+	return false;
+}
+
+void blk_mq_exit_shared_sbitmap(struct blk_mq_tag_set *tag_set)
+{
+	sbitmap_queue_free(&tag_set->__bitmap_tags);
+	sbitmap_queue_free(&tag_set->__breserved_tags);
+}
+
+struct blk_mq_tags *blk_mq_init_tags(struct blk_mq_tag_set *set,
+				     unsigned int total_tags,
 				     unsigned int reserved_tags,
 				     int node, int alloc_policy)
 {
@@ -480,6 +506,7 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
 
 void blk_mq_free_tags(struct blk_mq_tags *tags)
 {
+	/* Do not use tags->{bitmap, breserved}_tags */
 	sbitmap_queue_free(&tags->__bitmap_tags);
 	sbitmap_queue_free(&tags->__breserved_tags);
 	kfree(tags);
@@ -538,6 +565,11 @@ int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
 	return 0;
 }
 
+void blk_mq_tag_resize_shared_sbitmap(struct blk_mq_tag_set *set, unsigned int size)
+{
+	sbitmap_queue_resize(&set->__bitmap_tags, size - set->reserved_tags);
+}
+
 /**
  * blk_mq_unique_tag() - return a tag that is unique queue-wide
  * @rq: request for which to compute a unique tag
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index 4f866a39b926..0a57e4f041a9 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -25,7 +25,12 @@ struct blk_mq_tags {
 };
 
 
-extern struct blk_mq_tags *blk_mq_init_tags(unsigned int nr_tags, unsigned int reserved_tags, int node, int alloc_policy);
+extern bool blk_mq_init_shared_sbitmap(struct blk_mq_tag_set *tag_set);
+extern void blk_mq_exit_shared_sbitmap(struct blk_mq_tag_set *tag_set);
+extern struct blk_mq_tags *blk_mq_init_tags(struct blk_mq_tag_set *tag_set,
+					    unsigned int nr_tags,
+					    unsigned int reserved_tags,
+					    int node, int alloc_policy);
 extern void blk_mq_free_tags(struct blk_mq_tags *tags);
 
 extern unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data);
@@ -34,6 +39,9 @@ extern void blk_mq_put_tag(struct blk_mq_tags *tags, struct blk_mq_ctx *ctx,
 extern int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
 					struct blk_mq_tags **tags,
 					unsigned int depth, bool can_grow);
+extern void blk_mq_tag_resize_shared_sbitmap(struct blk_mq_tag_set *set,
+					     unsigned int size);
+
 extern void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags, bool);
 void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
 		void *priv);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 375f5064c183..d10e24bee7d9 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2126,7 +2126,7 @@ struct blk_mq_tags *blk_mq_alloc_rq_map(struct blk_mq_tag_set *set,
 	if (node == NUMA_NO_NODE)
 		node = set->numa_node;
 
-	tags = blk_mq_init_tags(nr_tags, reserved_tags, node,
+	tags = blk_mq_init_tags(set, nr_tags, reserved_tags, node,
 				BLK_MQ_FLAG_TO_ALLOC_POLICY(set->flags));
 	if (!tags)
 		return NULL;
@@ -2980,8 +2980,10 @@ static int __blk_mq_alloc_rq_maps(struct blk_mq_tag_set *set)
 	return 0;
 
 out_unwind:
-	while (--i >= 0)
+	while (--i >= 0) {
 		blk_mq_free_rq_map(set->tags[i]);
+		set->tags[i] = NULL;
+	}
 
 	return -ENOMEM;
 }
@@ -3147,11 +3149,28 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
 	if (ret)
 		goto out_free_mq_map;
 
+	if (blk_mq_is_sbitmap_shared(set)) {
+		if (!blk_mq_init_shared_sbitmap(set)) {
+			ret = -ENOMEM;
+			goto out_free_mq_rq_maps;
+		}
+
+		for (i = 0; i < set->nr_hw_queues; i++) {
+			struct blk_mq_tags *tags = set->tags[i];
+
+			tags->bitmap_tags = &set->__bitmap_tags;
+			tags->breserved_tags = &set->__breserved_tags;
+		}
+	}
+
 	mutex_init(&set->tag_list_lock);
 	INIT_LIST_HEAD(&set->tag_list);
 
 	return 0;
 
+out_free_mq_rq_maps:
+	for (i = 0; i < set->nr_hw_queues; i++)
+		blk_mq_free_rq_map(set->tags[i]);
 out_free_mq_map:
 	for (i = 0; i < set->nr_maps; i++) {
 		kfree(set->map[i].mq_map);
@@ -3170,6 +3189,9 @@ void blk_mq_free_tag_set(struct blk_mq_tag_set *set)
 	for (i = 0; i < set->nr_hw_queues; i++)
 		blk_mq_free_map_and_requests(set, i);
 
+	if (blk_mq_is_sbitmap_shared(set))
+		blk_mq_exit_shared_sbitmap(set);
+
 	for (j = 0; j < set->nr_maps; j++) {
 		kfree(set->map[j].mq_map);
 		set->map[j].mq_map = NULL;
@@ -3206,6 +3228,8 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
 		if (!hctx->sched_tags) {
 			ret = blk_mq_tag_update_depth(hctx, &hctx->tags, nr,
 							false);
+			if (!ret && blk_mq_is_sbitmap_shared(set))
+				blk_mq_tag_resize_shared_sbitmap(set, nr);
 		} else {
 			ret = blk_mq_tag_update_depth(hctx, &hctx->sched_tags,
 							nr, true);
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 10bfdfb494fa..dde2d29f0ce5 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -158,6 +158,11 @@ struct blk_mq_alloc_data {
 	struct blk_mq_hw_ctx *hctx;
 };
 
+static inline bool blk_mq_is_sbitmap_shared(struct blk_mq_tag_set *tag_set)
+{
+	return tag_set->flags & BLK_MQ_F_TAG_HCTX_SHARED;
+}
+
 static inline struct blk_mq_tags *blk_mq_tags_from_data(struct blk_mq_alloc_data *data)
 {
 	if (data->flags & BLK_MQ_REQ_INTERNAL)
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index ab1618007600..f9db604c3b10 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -245,6 +245,8 @@ struct blk_mq_tag_set {
 	unsigned int		flags;
 	void			*driver_data;
 
+	struct sbitmap_queue	__bitmap_tags;
+	struct sbitmap_queue	__breserved_tags;
 	struct blk_mq_tags	**tags;
 
 	struct mutex		tag_list_lock;
@@ -388,6 +390,7 @@ struct blk_mq_ops {
 enum {
 	BLK_MQ_F_SHOULD_MERGE	= 1 << 0,
 	BLK_MQ_F_TAG_QUEUE_SHARED	= 1 << 1,
+	BLK_MQ_F_TAG_HCTX_SHARED	= 1 << 2,
 	BLK_MQ_F_BLOCKING	= 1 << 5,
 	BLK_MQ_F_NO_SCHED	= 1 << 6,
 	BLK_MQ_F_ALLOC_POLICY_START_BIT = 8,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH RFC v6 05/10] blk-mq: Add support in hctx_tags_bitmap_show() for a shared sbitmap
  2020-03-05 11:54 [PATCH RFC v6 00/10] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs John Garry
                   ` (3 preceding siblings ...)
  2020-03-05 11:54 ` [PATCH RFC v6 04/10] blk-mq: Facilitate a shared sbitmap per tagset John Garry
@ 2020-03-05 11:54 ` John Garry
  2020-03-05 12:52   ` Hannes Reinecke
  2020-03-05 11:54 ` [PATCH RFC v6 06/10] scsi: Add template flag 'host_tagset' John Garry
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 35+ messages in thread
From: John Garry @ 2020-03-05 11:54 UTC (permalink / raw)
  To: axboe, jejb, martin.petersen, ming.lei, bvanassche, hare,
	don.brace, sumit.saxena, hch, kashyap.desai,
	shivasharan.srikanteshwara
  Cc: chenxiang66, linux-block, linux-scsi, esc.storagedev, John Garry

Since a set-wide shared tag sbitmap may be used, it is no longer valid to
examine the per-hctx tagset for getting the active requests for a hctx
(when a shared sbitmap is used).

As such, add support for the shared sbitmap by using an intermediate
sbitmap per hctx, iterating all active tags for the specific hctx in the
shared sbitmap.

Originally-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: John Garry <john.garry@huawei.com>
---
 block/blk-mq-debugfs.c | 57 ++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 55 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index fca87470e267..5c0db0309ebb 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -474,20 +474,73 @@ static int hctx_tags_show(void *data, struct seq_file *m)
 	return res;
 }
 
+struct hctx_sb_data {
+	struct sbitmap		*sb;	/* output bitmap */
+	struct blk_mq_hw_ctx	*hctx;	/* input hctx */
+};
+
+static bool hctx_filter_fn(struct blk_mq_hw_ctx *hctx, struct request *req,
+			   void *priv, bool reserved)
+{
+	struct hctx_sb_data *hctx_sb_data = priv;
+
+	if (hctx == hctx_sb_data->hctx)
+		sbitmap_set_bit(hctx_sb_data->sb, req->tag);
+
+	return true;
+}
+
+static void hctx_filter_sb(struct sbitmap *sb, struct blk_mq_hw_ctx *hctx)
+{
+	struct hctx_sb_data hctx_sb_data = { .sb = sb, .hctx = hctx };
+
+	blk_mq_queue_tag_busy_iter(hctx->queue, hctx_filter_fn, &hctx_sb_data);
+}
+
 static int hctx_tags_bitmap_show(void *data, struct seq_file *m)
 {
 	struct blk_mq_hw_ctx *hctx = data;
 	struct request_queue *q = hctx->queue;
+	struct blk_mq_tag_set *set = q->tag_set;
+	struct sbitmap shared_sb, *sb;
+	bool shared_sbitmap;
 	int res;
 
+	if (!set)
+		return 0;
+
+	shared_sbitmap = blk_mq_is_sbitmap_shared(set);
+
+	if (shared_sbitmap) {
+		/*
+		 * We could use the allocated sbitmap for that hctx here, but
+		 * that would mean that we would need to clean it prior to use.
+		 */
+		res = sbitmap_init_node(&shared_sb,
+					set->__bitmap_tags.sb.depth,
+					set->__bitmap_tags.sb.shift,
+					GFP_KERNEL, NUMA_NO_NODE);
+		if (res)
+			return res;
+		sb = &shared_sb;
+	} else {
+		sb = &hctx->tags->bitmap_tags->sb;
+	}
+
 	res = mutex_lock_interruptible(&q->sysfs_lock);
 	if (res)
 		goto out;
-	if (hctx->tags)
-		sbitmap_bitmap_show(&hctx->tags->bitmap_tags->sb, m);
+	if (hctx->tags) {
+		if (shared_sbitmap)
+			hctx_filter_sb(sb, hctx);
+		sbitmap_bitmap_show(sb, m);
+	}
+
 	mutex_unlock(&q->sysfs_lock);
 
 out:
+	if (shared_sbitmap)
+		sbitmap_free(&shared_sb);
 	return res;
 }
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH RFC v6 06/10] scsi: Add template flag 'host_tagset'
  2020-03-05 11:54 [PATCH RFC v6 00/10] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs John Garry
                   ` (4 preceding siblings ...)
  2020-03-05 11:54 ` [PATCH RFC v6 05/10] blk-mq: Add support in hctx_tags_bitmap_show() for a shared sbitmap John Garry
@ 2020-03-05 11:54 ` John Garry
  2020-03-06 11:12   ` John Garry
  2020-03-05 11:54 ` [PATCH RFC v6 07/10] scsi: hisi_sas: Switch v3 hw to MQ John Garry
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 35+ messages in thread
From: John Garry @ 2020-03-05 11:54 UTC (permalink / raw)
  To: axboe, jejb, martin.petersen, ming.lei, bvanassche, hare,
	don.brace, sumit.saxena, hch, kashyap.desai,
	shivasharan.srikanteshwara
  Cc: chenxiang66, linux-block, linux-scsi, esc.storagedev,
	Hannes Reinecke, John Garry

From: Hannes Reinecke <hare@suse.com>

Add a host template flag 'host_tagset' so hostwide tagset can be
shared on multiple reply queues after the SCSI device's reply queue
is converted to blk-mq hw queue.

Signed-off-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: John Garry <john.garry@huawei.com>
---
 drivers/scsi/scsi_lib.c  | 2 ++
 include/scsi/scsi_host.h | 3 +++
 2 files changed, 5 insertions(+)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 610ee41fa54c..84788ccc2672 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1901,6 +1901,8 @@ int scsi_mq_setup_tags(struct Scsi_Host *shost)
 	shost->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
 	shost->tag_set.flags |=
 		BLK_ALLOC_POLICY_TO_MQ_FLAG(shost->hostt->tag_alloc_policy);
+	if (shost->hostt->host_tagset)
+		shost->tag_set.flags |= BLK_MQ_F_TAG_HCTX_SHARED;
 	shost->tag_set.driver_data = shost;
 
 	return blk_mq_alloc_tag_set(&shost->tag_set);
diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h
index f577647bf5f2..4fd0af0883dd 100644
--- a/include/scsi/scsi_host.h
+++ b/include/scsi/scsi_host.h
@@ -429,6 +429,9 @@ struct scsi_host_template {
 	/* True if the low-level driver supports blk-mq only */
 	unsigned force_blk_mq:1;
 
+	/* True if the host uses host-wide tagspace */
+	unsigned host_tagset:1;
+
 	/*
 	 * Countdown for host blocking with no commands outstanding.
 	 */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH RFC v6 07/10] scsi: hisi_sas: Switch v3 hw to MQ
  2020-03-05 11:54 [PATCH RFC v6 00/10] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs John Garry
                   ` (5 preceding siblings ...)
  2020-03-05 11:54 ` [PATCH RFC v6 06/10] scsi: Add template flag 'host_tagset' John Garry
@ 2020-03-05 11:54 ` John Garry
  2020-03-05 12:52   ` Hannes Reinecke
  2020-03-05 11:54 ` [PATCH RFC v6 08/10] megaraid_sas: switch fusion adapters " John Garry
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 35+ messages in thread
From: John Garry @ 2020-03-05 11:54 UTC (permalink / raw)
  To: axboe, jejb, martin.petersen, ming.lei, bvanassche, hare,
	don.brace, sumit.saxena, hch, kashyap.desai,
	shivasharan.srikanteshwara
  Cc: chenxiang66, linux-block, linux-scsi, esc.storagedev, John Garry

Now that the block layer provides a shared tag, we can switch the driver
to expose all HW queues.

Signed-off-by: John Garry <john.garry@huawei.com>
---
 drivers/scsi/hisi_sas/hisi_sas.h       |  3 +-
 drivers/scsi/hisi_sas/hisi_sas_main.c  | 36 ++++++-----
 drivers/scsi/hisi_sas/hisi_sas_v3_hw.c | 87 +++++++++++---------------
 3 files changed, 56 insertions(+), 70 deletions(-)

diff --git a/drivers/scsi/hisi_sas/hisi_sas.h b/drivers/scsi/hisi_sas/hisi_sas.h
index 2bdd64648ef0..e6acbf940712 100644
--- a/drivers/scsi/hisi_sas/hisi_sas.h
+++ b/drivers/scsi/hisi_sas/hisi_sas.h
@@ -8,6 +8,8 @@
 #define _HISI_SAS_H_
 
 #include <linux/acpi.h>
+#include <linux/blk-mq.h>
+#include <linux/blk-mq-pci.h>
 #include <linux/clk.h>
 #include <linux/debugfs.h>
 #include <linux/dmapool.h>
@@ -431,7 +433,6 @@ struct hisi_hba {
 	u32 intr_coal_count;	/* Interrupt count to coalesce */
 
 	int cq_nvecs;
-	unsigned int *reply_map;
 
 	/* bist */
 	enum sas_linkrate debugfs_bist_linkrate;
diff --git a/drivers/scsi/hisi_sas/hisi_sas_main.c b/drivers/scsi/hisi_sas/hisi_sas_main.c
index 9a6deb21fe4d..3dbf6c27dd64 100644
--- a/drivers/scsi/hisi_sas/hisi_sas_main.c
+++ b/drivers/scsi/hisi_sas/hisi_sas_main.c
@@ -417,6 +417,7 @@ static int hisi_sas_task_prep(struct sas_task *task,
 	struct device *dev = hisi_hba->dev;
 	int dlvry_queue_slot, dlvry_queue, rc, slot_idx;
 	int n_elem = 0, n_elem_dif = 0, n_elem_req = 0;
+	struct scsi_cmnd *scmd = NULL;
 	struct hisi_sas_dq *dq;
 	unsigned long flags;
 	int wr_q_index;
@@ -432,10 +433,23 @@ static int hisi_sas_task_prep(struct sas_task *task,
 		return -ECOMM;
 	}
 
-	if (hisi_hba->reply_map) {
-		int cpu = raw_smp_processor_id();
-		unsigned int dq_index = hisi_hba->reply_map[cpu];
+	if (task->uldd_task) {
+		struct ata_queued_cmd *qc;
 
+		if (dev_is_sata(device)) {
+			qc = task->uldd_task;
+			scmd = qc->scsicmd;
+		} else {
+			scmd = task->uldd_task;
+		}
+	}
+
+	if (scmd) {
+		unsigned int dq_index;
+		u32 blk_tag;
+
+		blk_tag = blk_mq_unique_tag(scmd->request);
+		dq_index = blk_mq_unique_tag_to_hwq(blk_tag);
 		*dq_pointer = dq = &hisi_hba->dq[dq_index];
 	} else {
 		*dq_pointer = dq = sas_dev->dq;
@@ -464,21 +478,9 @@ static int hisi_sas_task_prep(struct sas_task *task,
 
 	if (hisi_hba->hw->slot_index_alloc)
 		rc = hisi_hba->hw->slot_index_alloc(hisi_hba, device);
-	else {
-		struct scsi_cmnd *scsi_cmnd = NULL;
-
-		if (task->uldd_task) {
-			struct ata_queued_cmd *qc;
+	else
+		rc = hisi_sas_slot_index_alloc(hisi_hba, scmd);
 
-			if (dev_is_sata(device)) {
-				qc = task->uldd_task;
-				scsi_cmnd = qc->scsicmd;
-			} else {
-				scsi_cmnd = task->uldd_task;
-			}
-		}
-		rc  = hisi_sas_slot_index_alloc(hisi_hba, scsi_cmnd);
-	}
 	if (rc < 0)
 		goto err_out_dif_dma_unmap;
 
diff --git a/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c b/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
index a2debe0c8185..9d421c2ddce4 100644
--- a/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
+++ b/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
@@ -2360,68 +2360,36 @@ static irqreturn_t cq_interrupt_v3_hw(int irq_no, void *p)
 	return IRQ_WAKE_THREAD;
 }
 
-static void setup_reply_map_v3_hw(struct hisi_hba *hisi_hba, int nvecs)
+static int interrupt_preinit_v3_hw(struct hisi_hba *hisi_hba)
 {
-	const struct cpumask *mask;
-	int queue, cpu;
+	int vectors;
+	int max_msi = HISI_SAS_MSI_COUNT_V3_HW, min_msi;
+	struct Scsi_Host *shost = hisi_hba->shost;
+	struct irq_affinity desc = {
+		.pre_vectors = BASE_VECTORS_V3_HW,
+	};
 
-	for (queue = 0; queue < nvecs; queue++) {
-		struct hisi_sas_cq *cq = &hisi_hba->cq[queue];
+	min_msi = MIN_AFFINE_VECTORS_V3_HW;
+	vectors = pci_alloc_irq_vectors_affinity(hisi_hba->pci_dev,
+						 min_msi, max_msi,
+						 PCI_IRQ_MSI |
+						 PCI_IRQ_AFFINITY,
+						 &desc);
+	if (vectors < 0)
+		return -ENOENT;
 
-		mask = pci_irq_get_affinity(hisi_hba->pci_dev, queue +
-					    BASE_VECTORS_V3_HW);
-		if (!mask)
-			goto fallback;
-		cq->irq_mask = mask;
-		for_each_cpu(cpu, mask)
-			hisi_hba->reply_map[cpu] = queue;
-	}
-	return;
 
-fallback:
-	for_each_possible_cpu(cpu)
-		hisi_hba->reply_map[cpu] = cpu % hisi_hba->queue_count;
-	/* Don't clean all CQ masks */
+	hisi_hba->cq_nvecs = vectors - BASE_VECTORS_V3_HW;
+	shost->nr_hw_queues = hisi_hba->cq_nvecs;
+
+	return 0;
 }
 
 static int interrupt_init_v3_hw(struct hisi_hba *hisi_hba)
 {
 	struct device *dev = hisi_hba->dev;
 	struct pci_dev *pdev = hisi_hba->pci_dev;
-	int vectors, rc, i;
-	int max_msi = HISI_SAS_MSI_COUNT_V3_HW, min_msi;
-
-	if (auto_affine_msi_experimental) {
-		struct irq_affinity desc = {
-			.pre_vectors = BASE_VECTORS_V3_HW,
-		};
-
-		dev_info(dev, "Enable MSI auto-affinity\n");
-
-		min_msi = MIN_AFFINE_VECTORS_V3_HW;
-
-		hisi_hba->reply_map = devm_kcalloc(dev, nr_cpu_ids,
-						   sizeof(unsigned int),
-						   GFP_KERNEL);
-		if (!hisi_hba->reply_map)
-			return -ENOMEM;
-		vectors = pci_alloc_irq_vectors_affinity(hisi_hba->pci_dev,
-							 min_msi, max_msi,
-							 PCI_IRQ_MSI |
-							 PCI_IRQ_AFFINITY,
-							 &desc);
-		if (vectors < 0)
-			return -ENOENT;
-		setup_reply_map_v3_hw(hisi_hba, vectors - BASE_VECTORS_V3_HW);
-	} else {
-		min_msi = max_msi;
-		vectors = pci_alloc_irq_vectors(hisi_hba->pci_dev, min_msi,
-						max_msi, PCI_IRQ_MSI);
-		if (vectors < 0)
-			return vectors;
-	}
-
-	hisi_hba->cq_nvecs = vectors - BASE_VECTORS_V3_HW;
+	int rc, i;
 
 	rc = devm_request_irq(dev, pci_irq_vector(pdev, 1),
 			      int_phy_up_down_bcast_v3_hw, 0,
@@ -3068,6 +3036,15 @@ static int debugfs_set_bist_v3_hw(struct hisi_hba *hisi_hba, bool enable)
 	return 0;
 }
 
+static int hisi_sas_map_queues(struct Scsi_Host *shost)
+{
+	struct hisi_hba *hisi_hba = shost_priv(shost);
+	struct blk_mq_queue_map *qmap = &shost->tag_set.map[HCTX_TYPE_DEFAULT];
+
+	return blk_mq_pci_map_queues(qmap, hisi_hba->pci_dev,
+				     BASE_VECTORS_V3_HW);
+}
+
 static struct scsi_host_template sht_v3_hw = {
 	.name			= DRV_NAME,
 	.module			= THIS_MODULE,
@@ -3076,6 +3053,7 @@ static struct scsi_host_template sht_v3_hw = {
 	.slave_configure	= hisi_sas_slave_configure,
 	.scan_finished		= hisi_sas_scan_finished,
 	.scan_start		= hisi_sas_scan_start,
+	.map_queues		= hisi_sas_map_queues,
 	.change_queue_depth	= sas_change_queue_depth,
 	.bios_param		= sas_bios_param,
 	.this_id		= -1,
@@ -3092,6 +3070,7 @@ static struct scsi_host_template sht_v3_hw = {
 	.shost_attrs		= host_attrs_v3_hw,
 	.tag_alloc_policy	= BLK_TAG_ALLOC_RR,
 	.host_reset             = hisi_sas_host_reset,
+	.host_tagset		= 1,
 };
 
 static const struct hisi_sas_hw hisi_sas_v3_hw = {
@@ -3263,6 +3242,10 @@ hisi_sas_v3_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	if (hisi_sas_debugfs_enable)
 		hisi_sas_debugfs_init(hisi_hba);
 
+	rc = interrupt_preinit_v3_hw(hisi_hba);
+	if (rc)
+		goto err_out_ha;
+	dev_err(dev, "%d hw qeues\n", shost->nr_hw_queues);
 	rc = scsi_add_host(shost, dev);
 	if (rc)
 		goto err_out_ha;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH RFC v6 08/10] megaraid_sas: switch fusion adapters to MQ
  2020-03-05 11:54 [PATCH RFC v6 00/10] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs John Garry
                   ` (6 preceding siblings ...)
  2020-03-05 11:54 ` [PATCH RFC v6 07/10] scsi: hisi_sas: Switch v3 hw to MQ John Garry
@ 2020-03-05 11:54 ` John Garry
  2020-04-07 11:14   ` Kashyap Desai
  2020-03-05 11:54 ` [PATCH RFC v6 09/10] smartpqi: enable host tagset John Garry
  2020-03-05 11:54 ` [PATCH RFC v6 10/10] hpsa: enable host_tagset and switch to MQ John Garry
  9 siblings, 1 reply; 35+ messages in thread
From: John Garry @ 2020-03-05 11:54 UTC (permalink / raw)
  To: axboe, jejb, martin.petersen, ming.lei, bvanassche, hare,
	don.brace, sumit.saxena, hch, kashyap.desai,
	shivasharan.srikanteshwara
  Cc: chenxiang66, linux-block, linux-scsi, esc.storagedev,
	Hannes Reinecke, John Garry

From: Hannes Reinecke <hare@suse.com>

Fusion adapters can steer completions to individual queues, and
we now have support for shared host-wide tags.
So we can enable multiqueue support for fusion adapters and
drop the hand-crafted interrupt affinity settings.

Signed-off-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: John Garry <john.garry@huawei.com>
---
 drivers/scsi/megaraid/megaraid_sas.h        |  1 -
 drivers/scsi/megaraid/megaraid_sas_base.c   | 59 +++++++--------------
 drivers/scsi/megaraid/megaraid_sas_fusion.c | 24 +++++----
 3 files changed, 32 insertions(+), 52 deletions(-)

diff --git a/drivers/scsi/megaraid/megaraid_sas.h b/drivers/scsi/megaraid/megaraid_sas.h
index 83d8c4cb1ad5..04ef440073f1 100644
--- a/drivers/scsi/megaraid/megaraid_sas.h
+++ b/drivers/scsi/megaraid/megaraid_sas.h
@@ -2261,7 +2261,6 @@ enum MR_PERF_MODE {
 
 struct megasas_instance {
 
-	unsigned int *reply_map;
 	__le32 *producer;
 	dma_addr_t producer_h;
 	__le32 *consumer;
diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c b/drivers/scsi/megaraid/megaraid_sas_base.c
index fd4b5ac6ac5b..cb788a14e35e 100644
--- a/drivers/scsi/megaraid/megaraid_sas_base.c
+++ b/drivers/scsi/megaraid/megaraid_sas_base.c
@@ -37,6 +37,7 @@
 #include <linux/poll.h>
 #include <linux/vmalloc.h>
 #include <linux/irq_poll.h>
+#include <linux/blk-mq-pci.h>
 
 #include <scsi/scsi.h>
 #include <scsi/scsi_cmnd.h>
@@ -3114,6 +3115,19 @@ megasas_bios_param(struct scsi_device *sdev, struct block_device *bdev,
 	return 0;
 }
 
+static int megasas_map_queues(struct Scsi_Host *shost)
+{
+	struct megasas_instance *instance;
+
+	instance = (struct megasas_instance *)shost->hostdata;
+
+	if (!instance->smp_affinity_enable)
+		return 0;
+
+	return blk_mq_pci_map_queues(&shost->tag_set.map[HCTX_TYPE_DEFAULT],
+			instance->pdev, instance->low_latency_index_start);
+}
+
 static void megasas_aen_polling(struct work_struct *work);
 
 /**
@@ -3422,8 +3436,10 @@ static struct scsi_host_template megasas_template = {
 	.eh_timed_out = megasas_reset_timer,
 	.shost_attrs = megaraid_host_attrs,
 	.bios_param = megasas_bios_param,
+	.map_queues = megasas_map_queues,
 	.change_queue_depth = scsi_change_queue_depth,
 	.max_segment_size = 0xffffffff,
+	.host_tagset = 1,
 };
 
 /**
@@ -5707,34 +5723,6 @@ megasas_setup_jbod_map(struct megasas_instance *instance)
 		instance->use_seqnum_jbod_fp = false;
 }
 
-static void megasas_setup_reply_map(struct megasas_instance *instance)
-{
-	const struct cpumask *mask;
-	unsigned int queue, cpu, low_latency_index_start;
-
-	low_latency_index_start = instance->low_latency_index_start;
-
-	for (queue = low_latency_index_start; queue < instance->msix_vectors; queue++) {
-		mask = pci_irq_get_affinity(instance->pdev, queue);
-		if (!mask)
-			goto fallback;
-
-		for_each_cpu(cpu, mask)
-			instance->reply_map[cpu] = queue;
-	}
-	return;
-
-fallback:
-	queue = low_latency_index_start;
-	for_each_possible_cpu(cpu) {
-		instance->reply_map[cpu] = queue;
-		if (queue == (instance->msix_vectors - 1))
-			queue = low_latency_index_start;
-		else
-			queue++;
-	}
-}
-
 /**
  * megasas_get_device_list -	Get the PD and LD device list from FW.
  * @instance:			Adapter soft state
@@ -6157,8 +6145,6 @@ static int megasas_init_fw(struct megasas_instance *instance)
 			goto fail_init_adapter;
 	}
 
-	megasas_setup_reply_map(instance);
-
 	dev_info(&instance->pdev->dev,
 		"current msix/online cpus\t: (%d/%d)\n",
 		instance->msix_vectors, (unsigned int)num_online_cpus());
@@ -6792,6 +6778,9 @@ static int megasas_io_attach(struct megasas_instance *instance)
 	host->max_id = MEGASAS_MAX_DEV_PER_CHANNEL;
 	host->max_lun = MEGASAS_MAX_LUN;
 	host->max_cmd_len = 16;
+	if (instance->adapter_type != MFI_SERIES && instance->msix_vectors > 0)
+		host->nr_hw_queues = instance->msix_vectors -
+			instance->low_latency_index_start;
 
 	/*
 	 * Notify the mid-layer about the new controller
@@ -6959,11 +6948,6 @@ static inline int megasas_alloc_mfi_ctrl_mem(struct megasas_instance *instance)
  */
 static int megasas_alloc_ctrl_mem(struct megasas_instance *instance)
 {
-	instance->reply_map = kcalloc(nr_cpu_ids, sizeof(unsigned int),
-				      GFP_KERNEL);
-	if (!instance->reply_map)
-		return -ENOMEM;
-
 	switch (instance->adapter_type) {
 	case MFI_SERIES:
 		if (megasas_alloc_mfi_ctrl_mem(instance))
@@ -6980,8 +6964,6 @@ static int megasas_alloc_ctrl_mem(struct megasas_instance *instance)
 
 	return 0;
  fail:
-	kfree(instance->reply_map);
-	instance->reply_map = NULL;
 	return -ENOMEM;
 }
 
@@ -6994,7 +6976,6 @@ static int megasas_alloc_ctrl_mem(struct megasas_instance *instance)
  */
 static inline void megasas_free_ctrl_mem(struct megasas_instance *instance)
 {
-	kfree(instance->reply_map);
 	if (instance->adapter_type == MFI_SERIES) {
 		if (instance->producer)
 			dma_free_coherent(&instance->pdev->dev, sizeof(u32),
@@ -7682,8 +7663,6 @@ megasas_resume(struct pci_dev *pdev)
 			goto fail_reenable_msix;
 	}
 
-	megasas_setup_reply_map(instance);
-
 	if (instance->adapter_type != MFI_SERIES) {
 		megasas_reset_reply_desc(instance);
 		if (megasas_ioc_init_fusion(instance)) {
diff --git a/drivers/scsi/megaraid/megaraid_sas_fusion.c b/drivers/scsi/megaraid/megaraid_sas_fusion.c
index b2ad96564484..b7adf800a958 100644
--- a/drivers/scsi/megaraid/megaraid_sas_fusion.c
+++ b/drivers/scsi/megaraid/megaraid_sas_fusion.c
@@ -373,24 +373,24 @@ megasas_get_msix_index(struct megasas_instance *instance,
 {
 	int sdev_busy;
 
-	/* nr_hw_queue = 1 for MegaRAID */
-	struct blk_mq_hw_ctx *hctx =
-		scmd->device->request_queue->queue_hw_ctx[0];
+	struct blk_mq_hw_ctx *hctx = scmd->request->mq_hctx;
 
 	sdev_busy = atomic_read(&hctx->nr_active);
 
 	if (instance->perf_mode == MR_BALANCED_PERF_MODE &&
-	    sdev_busy > (data_arms * MR_DEVICE_HIGH_IOPS_DEPTH))
+	    sdev_busy > (data_arms * MR_DEVICE_HIGH_IOPS_DEPTH)) {
 		cmd->request_desc->SCSIIO.MSIxIndex =
 			mega_mod64((atomic64_add_return(1, &instance->high_iops_outstanding) /
 					MR_HIGH_IOPS_BATCH_COUNT), instance->low_latency_index_start);
-	else if (instance->msix_load_balance)
+	} else if (instance->msix_load_balance) {
 		cmd->request_desc->SCSIIO.MSIxIndex =
 			(mega_mod64(atomic64_add_return(1, &instance->total_io_count),
 				instance->msix_vectors));
-	else
-		cmd->request_desc->SCSIIO.MSIxIndex =
-			instance->reply_map[raw_smp_processor_id()];
+	} else {
+		u32 tag = blk_mq_unique_tag(scmd->request);
+
+		cmd->request_desc->SCSIIO.MSIxIndex = blk_mq_unique_tag_to_hwq(tag) + instance->low_latency_index_start;
+	}
 }
 
 /**
@@ -3384,7 +3384,7 @@ megasas_build_and_issue_cmd_fusion(struct megasas_instance *instance,
 {
 	struct megasas_cmd_fusion *cmd, *r1_cmd = NULL;
 	union MEGASAS_REQUEST_DESCRIPTOR_UNION *req_desc;
-	u32 index;
+	u32 index, blk_tag, unique_tag;
 
 	if ((megasas_cmd_type(scmd) == READ_WRITE_LDIO) &&
 		instance->ldio_threshold &&
@@ -3400,7 +3400,9 @@ megasas_build_and_issue_cmd_fusion(struct megasas_instance *instance,
 		return SCSI_MLQUEUE_HOST_BUSY;
 	}
 
-	cmd = megasas_get_cmd_fusion(instance, scmd->request->tag);
+	unique_tag = blk_mq_unique_tag(scmd->request);
+	blk_tag = blk_mq_unique_tag_to_tag(unique_tag);
+	cmd = megasas_get_cmd_fusion(instance, blk_tag);
 
 	if (!cmd) {
 		atomic_dec(&instance->fw_outstanding);
@@ -3441,7 +3443,7 @@ megasas_build_and_issue_cmd_fusion(struct megasas_instance *instance,
 	 */
 	if (cmd->r1_alt_dev_handle != MR_DEVHANDLE_INVALID) {
 		r1_cmd = megasas_get_cmd_fusion(instance,
-				(scmd->request->tag + instance->max_fw_cmds));
+				(blk_tag + instance->max_fw_cmds));
 		megasas_prepare_secondRaid1_IO(instance, cmd, r1_cmd);
 	}
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH RFC v6 09/10] smartpqi: enable host tagset
  2020-03-05 11:54 [PATCH RFC v6 00/10] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs John Garry
                   ` (7 preceding siblings ...)
  2020-03-05 11:54 ` [PATCH RFC v6 08/10] megaraid_sas: switch fusion adapters " John Garry
@ 2020-03-05 11:54 ` John Garry
  2020-03-05 11:54 ` [PATCH RFC v6 10/10] hpsa: enable host_tagset and switch to MQ John Garry
  9 siblings, 0 replies; 35+ messages in thread
From: John Garry @ 2020-03-05 11:54 UTC (permalink / raw)
  To: axboe, jejb, martin.petersen, ming.lei, bvanassche, hare,
	don.brace, sumit.saxena, hch, kashyap.desai,
	shivasharan.srikanteshwara
  Cc: chenxiang66, linux-block, linux-scsi, esc.storagedev

From: Hannes Reinecke <hare@suse.de>

Enable host tagset for smartpqi; with this we can use the request
tag to look command from the pool avoiding the list iteration in
the hot path.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 drivers/scsi/smartpqi/smartpqi_init.c | 38 ++++++++++++++++++++-------
 1 file changed, 28 insertions(+), 10 deletions(-)

diff --git a/drivers/scsi/smartpqi/smartpqi_init.c b/drivers/scsi/smartpqi/smartpqi_init.c
index b7492568e02f..106335bf5d07 100644
--- a/drivers/scsi/smartpqi/smartpqi_init.c
+++ b/drivers/scsi/smartpqi/smartpqi_init.c
@@ -575,17 +575,29 @@ static inline void pqi_reinit_io_request(struct pqi_io_request *io_request)
 }
 
 static struct pqi_io_request *pqi_alloc_io_request(
-	struct pqi_ctrl_info *ctrl_info)
+	struct pqi_ctrl_info *ctrl_info, struct scsi_cmnd *scmd)
 {
 	struct pqi_io_request *io_request;
+	unsigned int limit = PQI_RESERVED_IO_SLOTS;
 	u16 i = ctrl_info->next_io_request_slot;	/* benignly racy */
 
-	while (1) {
+	if (scmd) {
+		u32 blk_tag = blk_mq_unique_tag(scmd->request);
+
+		i = blk_mq_unique_tag_to_tag(blk_tag) + limit;
 		io_request = &ctrl_info->io_request_pool[i];
-		if (atomic_inc_return(&io_request->refcount) == 1)
-			break;
-		atomic_dec(&io_request->refcount);
-		i = (i + 1) % ctrl_info->max_io_slots;
+		if (WARN_ON(atomic_inc_return(&io_request->refcount) > 1)) {
+			atomic_dec(&io_request->refcount);
+			return NULL;
+		}
+	} else {
+		while (1) {
+			io_request = &ctrl_info->io_request_pool[i];
+			if (atomic_inc_return(&io_request->refcount) == 1)
+				break;
+			atomic_dec(&io_request->refcount);
+			i = (i + 1) % limit;
+		}
 	}
 
 	/* benignly racy */
@@ -4075,7 +4087,7 @@ static int pqi_submit_raid_request_synchronous(struct pqi_ctrl_info *ctrl_info,
 
 	atomic_inc(&ctrl_info->sync_cmds_outstanding);
 
-	io_request = pqi_alloc_io_request(ctrl_info);
+	io_request = pqi_alloc_io_request(ctrl_info, NULL);
 
 	put_unaligned_le16(io_request->index,
 		&(((struct pqi_raid_path_request *)request)->request_id));
@@ -5032,7 +5044,9 @@ static inline int pqi_raid_submit_scsi_cmd(struct pqi_ctrl_info *ctrl_info,
 {
 	struct pqi_io_request *io_request;
 
-	io_request = pqi_alloc_io_request(ctrl_info);
+	io_request = pqi_alloc_io_request(ctrl_info, scmd);
+	if (!io_request)
+		return SCSI_MLQUEUE_HOST_BUSY;
 
 	return pqi_raid_submit_scsi_cmd_with_io_request(ctrl_info, io_request,
 		device, scmd, queue_group);
@@ -5230,7 +5244,10 @@ static int pqi_aio_submit_io(struct pqi_ctrl_info *ctrl_info,
 	struct pqi_io_request *io_request;
 	struct pqi_aio_path_request *request;
 
-	io_request = pqi_alloc_io_request(ctrl_info);
+	io_request = pqi_alloc_io_request(ctrl_info, scmd);
+	if (!io_request)
+		return SCSI_MLQUEUE_HOST_BUSY;
+
 	io_request->io_complete_callback = pqi_aio_io_complete;
 	io_request->scmd = scmd;
 	io_request->raid_bypass = raid_bypass;
@@ -5657,7 +5674,7 @@ static int pqi_lun_reset(struct pqi_ctrl_info *ctrl_info,
 	DECLARE_COMPLETION_ONSTACK(wait);
 	struct pqi_task_management_request *request;
 
-	io_request = pqi_alloc_io_request(ctrl_info);
+	io_request = pqi_alloc_io_request(ctrl_info, NULL);
 	io_request->io_complete_callback = pqi_lun_reset_complete;
 	io_request->context = &wait;
 
@@ -6504,6 +6521,7 @@ static struct scsi_host_template pqi_driver_template = {
 	.map_queues = pqi_map_queues,
 	.sdev_attrs = pqi_sdev_attrs,
 	.shost_attrs = pqi_shost_attrs,
+	.host_tagset = 1,
 };
 
 static int pqi_register_scsi(struct pqi_ctrl_info *ctrl_info)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH RFC v6 10/10] hpsa: enable host_tagset and switch to MQ
  2020-03-05 11:54 [PATCH RFC v6 00/10] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs John Garry
                   ` (8 preceding siblings ...)
  2020-03-05 11:54 ` [PATCH RFC v6 09/10] smartpqi: enable host tagset John Garry
@ 2020-03-05 11:54 ` John Garry
  9 siblings, 0 replies; 35+ messages in thread
From: John Garry @ 2020-03-05 11:54 UTC (permalink / raw)
  To: axboe, jejb, martin.petersen, ming.lei, bvanassche, hare,
	don.brace, sumit.saxena, hch, kashyap.desai,
	shivasharan.srikanteshwara
  Cc: chenxiang66, linux-block, linux-scsi, esc.storagedev

From: Hannes Reinecke <hare@suse.de>

The smart array HBAs can steer interrupt completion, so this
patch switches the implementation to use multiqueue and enables
'host_tagset' as the HBA has a shared host-wide tagset.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 drivers/scsi/hpsa.c | 44 +++++++-------------------------------------
 drivers/scsi/hpsa.h |  1 -
 2 files changed, 7 insertions(+), 38 deletions(-)

diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c
index 1a4ddfacb458..a1d0e41a240a 100644
--- a/drivers/scsi/hpsa.c
+++ b/drivers/scsi/hpsa.c
@@ -974,6 +974,7 @@ static struct scsi_host_template hpsa_driver_template = {
 	.shost_attrs = hpsa_shost_attrs,
 	.max_sectors = 2048,
 	.no_write_same = 1,
+	.host_tagset = 1,
 };
 
 static inline u32 next_command(struct ctlr_info *h, u8 q)
@@ -1138,12 +1139,14 @@ static void dial_up_lockup_detection_on_fw_flash_complete(struct ctlr_info *h,
 static void __enqueue_cmd_and_start_io(struct ctlr_info *h,
 	struct CommandList *c, int reply_queue)
 {
+	u32 blk_tag = blk_mq_unique_tag(c->scsi_cmd->request);
+
 	dial_down_lockup_detection_during_fw_flash(h, c);
 	atomic_inc(&h->commands_outstanding);
 	if (c->device)
 		atomic_inc(&c->device->commands_outstanding);
 
-	reply_queue = h->reply_map[raw_smp_processor_id()];
+	reply_queue = blk_mq_unique_tag_to_hwq(blk_tag);
 	switch (c->cmd_type) {
 	case CMD_IOACCEL1:
 		set_ioaccel1_performant_mode(h, c, reply_queue);
@@ -5632,8 +5635,6 @@ static int hpsa_scsi_queue_command(struct Scsi_Host *sh, struct scsi_cmnd *cmd)
 	/* Get the ptr to our adapter structure out of cmd->host. */
 	h = sdev_to_hba(cmd->device);
 
-	BUG_ON(cmd->request->tag < 0);
-
 	dev = cmd->device->hostdata;
 	if (!dev) {
 		cmd->result = DID_NO_CONNECT << 16;
@@ -5809,7 +5810,7 @@ static int hpsa_scsi_host_alloc(struct ctlr_info *h)
 	sh->hostdata[0] = (unsigned long) h;
 	sh->irq = pci_irq_vector(h->pdev, 0);
 	sh->unique_id = sh->irq;
-
+	sh->nr_hw_queues = h->msix_vectors > 0 ? h->msix_vectors : 1;
 	h->scsi_host = sh;
 	return 0;
 }
@@ -5835,7 +5836,8 @@ static int hpsa_scsi_add_host(struct ctlr_info *h)
  */
 static int hpsa_get_cmd_index(struct scsi_cmnd *scmd)
 {
-	int idx = scmd->request->tag;
+	u32 blk_tag = blk_mq_unique_tag(scmd->request);
+	int idx = blk_mq_unique_tag_to_tag(blk_tag);
 
 	if (idx < 0)
 		return idx;
@@ -7435,26 +7437,6 @@ static void hpsa_disable_interrupt_mode(struct ctlr_info *h)
 	h->msix_vectors = 0;
 }
 
-static void hpsa_setup_reply_map(struct ctlr_info *h)
-{
-	const struct cpumask *mask;
-	unsigned int queue, cpu;
-
-	for (queue = 0; queue < h->msix_vectors; queue++) {
-		mask = pci_irq_get_affinity(h->pdev, queue);
-		if (!mask)
-			goto fallback;
-
-		for_each_cpu(cpu, mask)
-			h->reply_map[cpu] = queue;
-	}
-	return;
-
-fallback:
-	for_each_possible_cpu(cpu)
-		h->reply_map[cpu] = 0;
-}
-
 /* If MSI/MSI-X is supported by the kernel we will try to enable it on
  * controllers that are capable. If not, we use legacy INTx mode.
  */
@@ -7851,9 +7833,6 @@ static int hpsa_pci_init(struct ctlr_info *h)
 	if (err)
 		goto clean1;
 
-	/* setup mapping between CPU and reply queue */
-	hpsa_setup_reply_map(h);
-
 	err = hpsa_pci_find_memory_BAR(h->pdev, &h->paddr);
 	if (err)
 		goto clean2;	/* intmode+region, pci */
@@ -8579,7 +8558,6 @@ static struct workqueue_struct *hpsa_create_controller_wq(struct ctlr_info *h,
 
 static void hpda_free_ctlr_info(struct ctlr_info *h)
 {
-	kfree(h->reply_map);
 	kfree(h);
 }
 
@@ -8588,14 +8566,6 @@ static struct ctlr_info *hpda_alloc_ctlr_info(void)
 	struct ctlr_info *h;
 
 	h = kzalloc(sizeof(*h), GFP_KERNEL);
-	if (!h)
-		return NULL;
-
-	h->reply_map = kcalloc(nr_cpu_ids, sizeof(*h->reply_map), GFP_KERNEL);
-	if (!h->reply_map) {
-		kfree(h);
-		return NULL;
-	}
 	return h;
 }
 
diff --git a/drivers/scsi/hpsa.h b/drivers/scsi/hpsa.h
index f8c88fc7b80a..ea4a609e3eb7 100644
--- a/drivers/scsi/hpsa.h
+++ b/drivers/scsi/hpsa.h
@@ -161,7 +161,6 @@ struct bmic_controller_parameters {
 #pragma pack()
 
 struct ctlr_info {
-	unsigned int *reply_map;
 	int	ctlr;
 	char	devname[8];
 	char    *product_name;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC v6 03/10] blk-mq: Use pointers for blk_mq_tags bitmap tags
  2020-03-05 11:54 ` [PATCH RFC v6 03/10] blk-mq: Use pointers for blk_mq_tags bitmap tags John Garry
@ 2020-03-05 12:42   ` Hannes Reinecke
  0 siblings, 0 replies; 35+ messages in thread
From: Hannes Reinecke @ 2020-03-05 12:42 UTC (permalink / raw)
  To: John Garry, axboe, jejb, martin.petersen, ming.lei, bvanassche,
	don.brace, sumit.saxena, hch, kashyap.desai,
	shivasharan.srikanteshwara
  Cc: chenxiang66, linux-block, linux-scsi, esc.storagedev

On 3/5/20 12:54 PM, John Garry wrote:
> Introduce pointers for the blk_mq_tags regular and reserved bitmap tags,
> with the goal of later being able to use a common shared tag bitmap across
> all HW contexts in a set.
> 
> Signed-off-by: John Garry <john.garry@huawei.com>
> ---
>   block/bfq-iosched.c    |  4 ++--
>   block/blk-mq-debugfs.c |  8 ++++----
>   block/blk-mq-tag.c     | 41 ++++++++++++++++++++++-------------------
>   block/blk-mq-tag.h     |  7 +++++--
>   block/blk-mq.c         |  4 ++--
>   block/kyber-iosched.c  |  4 ++--
>   6 files changed, 37 insertions(+), 31 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                               +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC v6 04/10] blk-mq: Facilitate a shared sbitmap per tagset
  2020-03-05 11:54 ` [PATCH RFC v6 04/10] blk-mq: Facilitate a shared sbitmap per tagset John Garry
@ 2020-03-05 12:49   ` Hannes Reinecke
  2020-03-05 13:52     ` John Garry
  0 siblings, 1 reply; 35+ messages in thread
From: Hannes Reinecke @ 2020-03-05 12:49 UTC (permalink / raw)
  To: John Garry, axboe, jejb, martin.petersen, ming.lei, bvanassche,
	don.brace, sumit.saxena, hch, kashyap.desai,
	shivasharan.srikanteshwara
  Cc: chenxiang66, linux-block, linux-scsi, esc.storagedev

On 3/5/20 12:54 PM, John Garry wrote:
> Some SCSI HBAs (such as HPSA, megaraid, mpt3sas, hisi_sas_v3 ..) support
> multiple reply queues with single hostwide tags.
> 
> In addition, these drivers want to use interrupt assignment in
> pci_alloc_irq_vectors(PCI_IRQ_AFFINITY). However, as discussed in [0],
> CPU hotplug may cause in-flight IO completion to not be serviced when an
> interrupt is shutdown.
> 
> To solve that problem, Ming's patchset to drain hctx's should ensure no
> IOs are missed in-flight [1].
> 
> However, to take advantage of that patchset, we need to map the HBA HW
> queues to blk mq hctx's; to do that, we need to expose the HBA HW queues.
> 
> In making that transition, the per-SCSI command request tags are no
> longer unique per Scsi host - they are just unique per hctx. As such, the
> HBA LLDD would have to generate this tag internally, which has a certain
> performance overhead.
> 
> However another problem is that blk mq assumes the host may accept
> (Scsi_host.can_queue * #hw queue) commands. In [2], we removed the Scsi
> host busy counter, which would stop the LLDD being sent more than
> .can_queue commands; however, we should still ensure that the block layer
> does not issue more than .can_queue commands to the Scsi host.
> 
> To solve this problem, introduce a shared sbitmap per blk_mq_tag_set,
> which may be requested at init time.
> 
> New flag BLK_MQ_F_TAG_HCTX_SHARED should be set when requesting the
> tagset to indicate whether the shared sbitmap should be used.
> 
> Even when BLK_MQ_F_TAG_HCTX_SHARED is set, we still allocate a full set of
> tags and requests per hctx; the reason for this is that if we only allocate
> tags and requests for a single hctx - like hctx0 - we may break block
> drivers which expect a request be associated with a specific hctx, i.e.
> not hctx0.
> 
> This is based on work originally from Ming Lei in [3] and from Bart's
> suggestion in [4].
> 
> [0] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/
> [1] https://lore.kernel.org/linux-block/20191014015043.25029-1-ming.lei@redhat.com/
> [2] https://lore.kernel.org/linux-scsi/20191025065855.6309-1-ming.lei@redhat.com/
> [3] https://lore.kernel.org/linux-block/20190531022801.10003-1-ming.lei@redhat.com/
> [4] https://lore.kernel.org/linux-block/ff77beff-5fd9-9f05-12b6-826922bace1f@huawei.com/T/#m3db0a602f095cbcbff27e9c884d6b4ae826144be
> 
> Signed-off-by: John Garry <john.garry@huawei.com>
> ---
>   block/blk-mq-tag.c     | 36 ++++++++++++++++++++++++++++++++++--
>   block/blk-mq-tag.h     | 10 +++++++++-
>   block/blk-mq.c         | 28 ++++++++++++++++++++++++++--
>   block/blk-mq.h         |  5 +++++
>   include/linux/blk-mq.h |  3 +++
>   5 files changed, 77 insertions(+), 5 deletions(-)
> 
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index b153ecf8c5c6..dd704118fe83 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -220,7 +220,7 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
>   	 * We can hit rq == NULL here, because the tagging functions
>   	 * test and set the bit before assigning ->rqs[].
>   	 */
> -	if (rq && rq->q == hctx->queue)
> +	if (rq && rq->q == hctx->queue && rq->mq_hctx == hctx)
>   		return iter_data->fn(hctx, rq, iter_data->data, reserved);
>   	return true;
>   }
> @@ -444,6 +444,7 @@ static int blk_mq_init_bitmap_tags(struct blk_mq_tags *tags,
>   		     round_robin, node))
>   		goto free_bitmap_tags;
>   
> +	/* We later overwrite these in case of per-set shared sbitmap */
>   	tags->bitmap_tags = &tags->__bitmap_tags;
>   	tags->breserved_tags = &tags->__breserved_tags;
>   
> @@ -453,7 +454,32 @@ static int blk_mq_init_bitmap_tags(struct blk_mq_tags *tags,
>   	return -ENOMEM;
>   }
>   
> -struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
> +bool blk_mq_init_shared_sbitmap(struct blk_mq_tag_set *tag_set)
> +{
> +	unsigned int depth = tag_set->queue_depth - tag_set->reserved_tags;
> +	int alloc_policy = BLK_MQ_FLAG_TO_ALLOC_POLICY(tag_set->flags);
> +	bool round_robin = alloc_policy == BLK_TAG_ALLOC_RR;
> +	int node = tag_set->numa_node;
> +
> +	if (bt_alloc(&tag_set->__bitmap_tags, depth, round_robin, node))
> +		return false;
> +	if (bt_alloc(&tag_set->__breserved_tags, tag_set->reserved_tags,
> +		     round_robin, node))
> +		goto free_bitmap_tags;
> +	return true;
> +free_bitmap_tags:
> +	sbitmap_queue_free(&tag_set->__bitmap_tags);
> +	return false;
> +}
> +
> +void blk_mq_exit_shared_sbitmap(struct blk_mq_tag_set *tag_set)
> +{
> +	sbitmap_queue_free(&tag_set->__bitmap_tags);
> +	sbitmap_queue_free(&tag_set->__breserved_tags);
> +}
> +
> +struct blk_mq_tags *blk_mq_init_tags(struct blk_mq_tag_set *set,
> +				     unsigned int total_tags,
>   				     unsigned int reserved_tags,
>   				     int node, int alloc_policy)
>   {
> @@ -480,6 +506,7 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
>   
>   void blk_mq_free_tags(struct blk_mq_tags *tags)
>   {
> +	/* Do not use tags->{bitmap, breserved}_tags */
>   	sbitmap_queue_free(&tags->__bitmap_tags);
>   	sbitmap_queue_free(&tags->__breserved_tags);
>   	kfree(tags);
Hmm. This is essentially the same as blk_mq_exit_shared_sbitmap().
Please call into that function or clarify the relationship between both.
And modify the comment; it's not about the 'use' but rather the freeing 
of the structures, and ->bitmap/breserved just point to the __ variants.

> @@ -538,6 +565,11 @@ int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
>   	return 0;
>   }
>   
> +void blk_mq_tag_resize_shared_sbitmap(struct blk_mq_tag_set *set, unsigned int size)
> +{
> +	sbitmap_queue_resize(&set->__bitmap_tags, size - set->reserved_tags);
> +}
> +
>   /**
>    * blk_mq_unique_tag() - return a tag that is unique queue-wide
>    * @rq: request for which to compute a unique tag
> diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
> index 4f866a39b926..0a57e4f041a9 100644
> --- a/block/blk-mq-tag.h
> +++ b/block/blk-mq-tag.h
> @@ -25,7 +25,12 @@ struct blk_mq_tags {
>   };
>   
>   
> -extern struct blk_mq_tags *blk_mq_init_tags(unsigned int nr_tags, unsigned int reserved_tags, int node, int alloc_policy);
> +extern bool blk_mq_init_shared_sbitmap(struct blk_mq_tag_set *tag_set);
> +extern void blk_mq_exit_shared_sbitmap(struct blk_mq_tag_set *tag_set);
> +extern struct blk_mq_tags *blk_mq_init_tags(struct blk_mq_tag_set *tag_set,
> +					    unsigned int nr_tags,
> +					    unsigned int reserved_tags,
> +					    int node, int alloc_policy);
>   extern void blk_mq_free_tags(struct blk_mq_tags *tags);
>   
>   extern unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data);
> @@ -34,6 +39,9 @@ extern void blk_mq_put_tag(struct blk_mq_tags *tags, struct blk_mq_ctx *ctx,
>   extern int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
>   					struct blk_mq_tags **tags,
>   					unsigned int depth, bool can_grow);
> +extern void blk_mq_tag_resize_shared_sbitmap(struct blk_mq_tag_set *set,
> +					     unsigned int size);
> +
>   extern void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags, bool);
>   void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
>   		void *priv);
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 375f5064c183..d10e24bee7d9 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2126,7 +2126,7 @@ struct blk_mq_tags *blk_mq_alloc_rq_map(struct blk_mq_tag_set *set,
>   	if (node == NUMA_NO_NODE)
>   		node = set->numa_node;
>   
> -	tags = blk_mq_init_tags(nr_tags, reserved_tags, node,
> +	tags = blk_mq_init_tags(set, nr_tags, reserved_tags, node,
>   				BLK_MQ_FLAG_TO_ALLOC_POLICY(set->flags));
>   	if (!tags)
>   		return NULL;
> @@ -2980,8 +2980,10 @@ static int __blk_mq_alloc_rq_maps(struct blk_mq_tag_set *set)
>   	return 0;
>   
>   out_unwind:
> -	while (--i >= 0)
> +	while (--i >= 0) {
>   		blk_mq_free_rq_map(set->tags[i]);
> +		set->tags[i] = NULL;
> +	}
>   
>   	return -ENOMEM;
>   }
> @@ -3147,11 +3149,28 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
>   	if (ret)
>   		goto out_free_mq_map;
>   
> +	if (blk_mq_is_sbitmap_shared(set)) {
> +		if (!blk_mq_init_shared_sbitmap(set)) {
> +			ret = -ENOMEM;
> +			goto out_free_mq_rq_maps;
> +		}
> +
> +		for (i = 0; i < set->nr_hw_queues; i++) {
> +			struct blk_mq_tags *tags = set->tags[i];
> +
> +			tags->bitmap_tags = &set->__bitmap_tags;
> +			tags->breserved_tags = &set->__breserved_tags;
> +		}
> +	}
> +
>   	mutex_init(&set->tag_list_lock);
>   	INIT_LIST_HEAD(&set->tag_list);
>   
>   	return 0;
>   
> +out_free_mq_rq_maps:
> +	for (i = 0; i < set->nr_hw_queues; i++)
> +		blk_mq_free_rq_map(set->tags[i]);
>   out_free_mq_map:
>   	for (i = 0; i < set->nr_maps; i++) {
>   		kfree(set->map[i].mq_map);
> @@ -3170,6 +3189,9 @@ void blk_mq_free_tag_set(struct blk_mq_tag_set *set)
>   	for (i = 0; i < set->nr_hw_queues; i++)
>   		blk_mq_free_map_and_requests(set, i);
>   
> +	if (blk_mq_is_sbitmap_shared(set))
> +		blk_mq_exit_shared_sbitmap(set);
> +
>   	for (j = 0; j < set->nr_maps; j++) {
>   		kfree(set->map[j].mq_map);
>   		set->map[j].mq_map = NULL;
> @@ -3206,6 +3228,8 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
>   		if (!hctx->sched_tags) {
>   			ret = blk_mq_tag_update_depth(hctx, &hctx->tags, nr,
>   							false);
> +			if (!ret && blk_mq_is_sbitmap_shared(set))
> +				blk_mq_tag_resize_shared_sbitmap(set, nr);
>   		} else {
>   			ret = blk_mq_tag_update_depth(hctx, &hctx->sched_tags,
>   							nr, true);
> diff --git a/block/blk-mq.h b/block/blk-mq.h
> index 10bfdfb494fa..dde2d29f0ce5 100644
> --- a/block/blk-mq.h
> +++ b/block/blk-mq.h
> @@ -158,6 +158,11 @@ struct blk_mq_alloc_data {
>   	struct blk_mq_hw_ctx *hctx;
>   };
>   
> +static inline bool blk_mq_is_sbitmap_shared(struct blk_mq_tag_set *tag_set)
> +{
> +	return tag_set->flags & BLK_MQ_F_TAG_HCTX_SHARED;
> +}
> +
When is that set?
And to my understanding it contingent on the ->bitmap pointer _not_ 
pointing to ->__bitmap.
Can we use this as a test and drop this flag?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                               +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC v6 05/10] blk-mq: Add support in hctx_tags_bitmap_show() for a shared sbitmap
  2020-03-05 11:54 ` [PATCH RFC v6 05/10] blk-mq: Add support in hctx_tags_bitmap_show() for a shared sbitmap John Garry
@ 2020-03-05 12:52   ` Hannes Reinecke
  0 siblings, 0 replies; 35+ messages in thread
From: Hannes Reinecke @ 2020-03-05 12:52 UTC (permalink / raw)
  To: John Garry, axboe, jejb, martin.petersen, ming.lei, bvanassche,
	don.brace, sumit.saxena, hch, kashyap.desai,
	shivasharan.srikanteshwara
  Cc: chenxiang66, linux-block, linux-scsi, esc.storagedev

On 3/5/20 12:54 PM, John Garry wrote:
> Since a set-wide shared tag sbitmap may be used, it is no longer valid to
> examine the per-hctx tagset for getting the active requests for a hctx
> (when a shared sbitmap is used).
> 
> As such, add support for the shared sbitmap by using an intermediate
> sbitmap per hctx, iterating all active tags for the specific hctx in the
> shared sbitmap.
> 
> Originally-by: Bart Van Assche <bvanassche@acm.org>
> Signed-off-by: John Garry <john.garry@huawei.com>
> ---
>   block/blk-mq-debugfs.c | 57 ++++++++++++++++++++++++++++++++++++++++--
>   1 file changed, 55 insertions(+), 2 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                               +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC v6 07/10] scsi: hisi_sas: Switch v3 hw to MQ
  2020-03-05 11:54 ` [PATCH RFC v6 07/10] scsi: hisi_sas: Switch v3 hw to MQ John Garry
@ 2020-03-05 12:52   ` Hannes Reinecke
  0 siblings, 0 replies; 35+ messages in thread
From: Hannes Reinecke @ 2020-03-05 12:52 UTC (permalink / raw)
  To: John Garry, axboe, jejb, martin.petersen, ming.lei, bvanassche,
	don.brace, sumit.saxena, hch, kashyap.desai,
	shivasharan.srikanteshwara
  Cc: chenxiang66, linux-block, linux-scsi, esc.storagedev

On 3/5/20 12:54 PM, John Garry wrote:
> Now that the block layer provides a shared tag, we can switch the driver
> to expose all HW queues.
> 
> Signed-off-by: John Garry <john.garry@huawei.com>
> ---
>   drivers/scsi/hisi_sas/hisi_sas.h       |  3 +-
>   drivers/scsi/hisi_sas/hisi_sas_main.c  | 36 ++++++-----
>   drivers/scsi/hisi_sas/hisi_sas_v3_hw.c | 87 +++++++++++---------------
>   3 files changed, 56 insertions(+), 70 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                               +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC v6 04/10] blk-mq: Facilitate a shared sbitmap per tagset
  2020-03-05 12:49   ` Hannes Reinecke
@ 2020-03-05 13:52     ` John Garry
  0 siblings, 0 replies; 35+ messages in thread
From: John Garry @ 2020-03-05 13:52 UTC (permalink / raw)
  To: Hannes Reinecke, axboe, jejb, martin.petersen, ming.lei,
	bvanassche, don.brace, sumit.saxena, hch, kashyap.desai,
	shivasharan.srikanteshwara
  Cc: chenxiang66, linux-block, linux-scsi, esc.storagedev

Hi Hannes,

>> +
>> +void blk_mq_exit_shared_sbitmap(struct blk_mq_tag_set *tag_set)
>> +{
>> +    sbitmap_queue_free(&tag_set->__bitmap_tags);
>> +    sbitmap_queue_free(&tag_set->__breserved_tags);
>> +}
>> +
>> +struct blk_mq_tags *blk_mq_init_tags(struct blk_mq_tag_set *set,
>> +                     unsigned int total_tags,
>>                        unsigned int reserved_tags,
>>                        int node, int alloc_policy)
>>   {
>> @@ -480,6 +506,7 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int 
>> total_tags,
>>   void blk_mq_free_tags(struct blk_mq_tags *tags)
>>   {
>> +    /* Do not use tags->{bitmap, breserved}_tags */
>>       sbitmap_queue_free(&tags->__bitmap_tags);
>>       sbitmap_queue_free(&tags->__breserved_tags);
>>       kfree(tags);
> Hmm. This is essentially the same as blk_mq_exit_shared_sbitmap().

It does the same thing, but takes different argument types: struct 
blk_mq_tag_set * vs struct blk_mq_tags *

> Please call into that function or clarify the relationship between both.
> And modify the comment; it's not about the 'use' but rather the freeing 

ok

> of the structures, and ->bitmap/breserved just point to the __ variants.

Please note that I had to change blk_mq_init_tags() from v5 to now 
continue to init the blk_mq_tags.__{bitmap, breserved}_tags always. The 
reason is that the following would now be broken for scheduler tags when 
BLK_MQ_F_TAG_HCTX_SHARED is set, ***:

struct blk_mq_tags *blk_mq_init_tags(struct blk_mq_tag_set *set, ...)
{
	struct blk_mq_tags *tags;
	...

	tags->nr_tags = total_tags;
	tags->nr_reserved_tags = reserved_tags;

	if (blk_mq_is_sbitmap_shared(set)) ***
		return tags;

	if (blk_mq_init_bitmap_tags(tags, node, alloc_policy) < 0) {
		...
	}
	return tags;
}

Keeping the blk_mq_is_sbitmap_shared() check would mean that we never 
call blk_mq_init_bitmap_tags() for scheduler tags for when 
BLK_MQ_F_TAG_HCTX_SHARED is set.

As such, I removed the blk_mq_is_sbitmap_shared() check, and we still 
init the blk_mq_tags bitmaps always. Hence we still need to free them.

And as I noted in the hctx_tags_bitmap_show() change (patch #5), we 
could actually reuse blk_mq_tags.__bitmap_tags.sb in that function 
instead of using a local temp sbitmap.

> 
>> @@ -538,6 +565,11 @@ int blk_mq_tag_update_depth(struct blk_mq_hw_ctx 
>> *hctx,
>>       return 0;
>>   }
>> +void blk_mq_tag_resize_shared_sbitmap(struct blk_mq_tag_set *set, 
>> unsigned int size)
>> +{
>> +    sbitmap_queue_resize(&set->__bitmap_tags, size - 
>> set->reserved_tags);
>> +}
>> +

[...]

>>       return 0;
>> +out_free_mq_rq_maps:
>> +    for (i = 0; i < set->nr_hw_queues; i++)
>> +        blk_mq_free_rq_map(set->tags[i]);
>>   out_free_mq_map:
>>       for (i = 0; i < set->nr_maps; i++) {
>>           kfree(set->map[i].mq_map);
>> @@ -3170,6 +3189,9 @@ void blk_mq_free_tag_set(struct blk_mq_tag_set 
>> *set)
>>       for (i = 0; i < set->nr_hw_queues; i++)
>>           blk_mq_free_map_and_requests(set, i);
>> +    if (blk_mq_is_sbitmap_shared(set))
>> +        blk_mq_exit_shared_sbitmap(set);
>> +
>>       for (j = 0; j < set->nr_maps; j++) {
>>           kfree(set->map[j].mq_map);
>>           set->map[j].mq_map = NULL;
>> @@ -3206,6 +3228,8 @@ int blk_mq_update_nr_requests(struct 
>> request_queue *q, unsigned int nr)
>>           if (!hctx->sched_tags) {
>>               ret = blk_mq_tag_update_depth(hctx, &hctx->tags, nr,
>>                               false);
>> +            if (!ret && blk_mq_is_sbitmap_shared(set))
>> +                blk_mq_tag_resize_shared_sbitmap(set, nr);
>>           } else {
>>               ret = blk_mq_tag_update_depth(hctx, &hctx->sched_tags,
>>                               nr, true);
>> diff --git a/block/blk-mq.h b/block/blk-mq.h
>> index 10bfdfb494fa..dde2d29f0ce5 100644
>> --- a/block/blk-mq.h
>> +++ b/block/blk-mq.h
>> @@ -158,6 +158,11 @@ struct blk_mq_alloc_data {
>>       struct blk_mq_hw_ctx *hctx;
>>   };
>> +static inline bool blk_mq_is_sbitmap_shared(struct blk_mq_tag_set 
>> *tag_set)
>> +{
>> +    return tag_set->flags & BLK_MQ_F_TAG_HCTX_SHARED;
>> +}
>> +
> When is that set?

It set later for SCSI hosts in scsi_mq_setup_tags(), for when 
.host_tagset is set.

> And to my understanding it contingent on the ->bitmap pointer _not_ 
> pointing to ->__bitmap.
> Can we use this as a test and drop this flag?

Maybe, but we need some flag passed to blk_mq_alloc_set() to request to 
use the shared sbitmap.

Thanks,
John

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC v6 06/10] scsi: Add template flag 'host_tagset'
  2020-03-05 11:54 ` [PATCH RFC v6 06/10] scsi: Add template flag 'host_tagset' John Garry
@ 2020-03-06 11:12   ` John Garry
  0 siblings, 0 replies; 35+ messages in thread
From: John Garry @ 2020-03-06 11:12 UTC (permalink / raw)
  To: axboe, jejb, martin.petersen, ming.lei, bvanassche, hare,
	don.brace, sumit.saxena, hch, kashyap.desai,
	shivasharan.srikanteshwara
  Cc: chenxiang66, linux-block, linux-scsi, esc.storagedev, Hannes Reinecke

On 05/03/2020 11:54, John Garry wrote:
> From: Hannes Reinecke <hare@suse.com>
> 
> Add a host template flag 'host_tagset' so hostwide tagset can be
> shared on multiple reply queues after the SCSI device's reply queue
> is converted to blk-mq hw queue.

We should also change the comment about Scsi_host.nr_hw_queues in 
include/scsi/scsi_host.h also, like this:

* Note: it is assumed that each hardware queue has a queue depth of
* can_queue. In other words, the total queue depth per host
-* is nr_hw_queues * can_queue.
+* is nr_hw_queues * can_queue. However, in the case of .host_tagset
+* being set, the total queue depth per host is can_queue.

> 
> Signed-off-by: Hannes Reinecke <hare@suse.com>
> Signed-off-by: John Garry <john.garry@huawei.com>
> ---
>   drivers/scsi/scsi_lib.c  | 2 ++
>   include/scsi/scsi_host.h | 3 +++
>   2 files changed, 5 insertions(+)
> 
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index 610ee41fa54c..84788ccc2672 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -1901,6 +1901,8 @@ int scsi_mq_setup_tags(struct Scsi_Host *shost)
>   	shost->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
>   	shost->tag_set.flags |=
>   		BLK_ALLOC_POLICY_TO_MQ_FLAG(shost->hostt->tag_alloc_policy);
> +	if (shost->hostt->host_tagset)
> +		shost->tag_set.flags |= BLK_MQ_F_TAG_HCTX_SHARED;
>   	shost->tag_set.driver_data = shost;
>   
>   	return blk_mq_alloc_tag_set(&shost->tag_set);
> diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h
> index f577647bf5f2..4fd0af0883dd 100644
> --- a/include/scsi/scsi_host.h
> +++ b/include/scsi/scsi_host.h
> @@ -429,6 +429,9 @@ struct scsi_host_template {
>   	/* True if the low-level driver supports blk-mq only */
>   	unsigned force_blk_mq:1;
>   
> +	/* True if the host uses host-wide tagspace */
> +	unsigned host_tagset:1;
> +
>   	/*
>   	 * Countdown for host blocking with no commands outstanding.
>   	 */
> 


^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: [PATCH RFC v6 08/10] megaraid_sas: switch fusion adapters to MQ
  2020-03-05 11:54 ` [PATCH RFC v6 08/10] megaraid_sas: switch fusion adapters " John Garry
@ 2020-04-07 11:14   ` Kashyap Desai
  2020-04-08  9:33     ` John Garry
  0 siblings, 1 reply; 35+ messages in thread
From: Kashyap Desai @ 2020-04-07 11:14 UTC (permalink / raw)
  To: John Garry, axboe, jejb, martin.petersen, ming.lei, bvanassche,
	hare, don.brace, Sumit Saxena, hch, Shivasharan Srikanteshwara
  Cc: chenxiang66, linux-block, linux-scsi, esc.storagedev, Hannes Reinecke

> --- a/drivers/scsi/megaraid/megaraid_sas_fusion.c
> +++ b/drivers/scsi/megaraid/megaraid_sas_fusion.c
> @@ -373,24 +373,24 @@ megasas_get_msix_index(struct megasas_instance
> *instance,  {
>  	int sdev_busy;
>
> -	/* nr_hw_queue = 1 for MegaRAID */
> -	struct blk_mq_hw_ctx *hctx =
> -		scmd->device->request_queue->queue_hw_ctx[0];
> +	struct blk_mq_hw_ctx *hctx = scmd->request->mq_hctx;

Hi John,

There is one outstanding patch which will eventually remove device_busy
from sdev. To fix this interface, we may have to track per scsi device
outstanding within a driver.
For my testing I used below since we still have below interface available.

        sdev_busy = atomic_read(&scmd->device->device_busy);

We have done some level of testing to know performance impact on SAS SSDs
and HDD setup. Here is my finding -
My testing used - Two socket Intel Skylake/Lewisburg/Purley
Output of numactl --hardware

available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 36 37 38 39 40 41
42 43 44 45 46 47 48 49 50 51 52 53
node 0 size: 31820 MB
node 0 free: 21958 MB
node 1 cpus: 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 54 55
56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
node 1 size: 32247 MB
node 1 free: 21068 MB
node distances:
node   0   1
  0:  10  21
  1:  21  10


64 HDD setup -

With higher QD and io schedulder = mq-deadline, shared host tag is not
scaling well. If I use ioscheduler = none, I can see consistent 2.0M IOPs.
This issue is seen only with RFC. Without RFC mq-deadline scales up to
2.0M IOPS.

Perf Top result of RFC - (IOPS = 1.4M IOPS)

   78.20%  [kernel]        [k] native_queued_spin_lock_slowpath
     1.46%  [kernel]        [k] sbitmap_any_bit_set
     1.14%  [kernel]        [k] blk_mq_run_hw_queue
     0.90%  [kernel]        [k] _mix_pool_bytes
     0.63%  [kernel]        [k] _raw_spin_lock
     0.57%  [kernel]        [k] blk_mq_run_hw_queues
     0.56%  [megaraid_sas]  [k] complete_cmd_fusion
     0.54%  [megaraid_sas]  [k] megasas_build_and_issue_cmd_fusion
     0.50%  [kernel]        [k] dd_has_work
     0.38%  [kernel]        [k] _raw_spin_lock_irqsave
     0.36%  [kernel]        [k] gup_pgd_range
     0.35%  [megaraid_sas]  [k] megasas_build_ldio_fusion
     0.31%  [kernel]        [k] io_submit_one
     0.29%  [kernel]        [k] hctx_lock
     0.26%  [kernel]        [k] try_to_grab_pending
     0.24%  [kernel]        [k] scsi_queue_rq
     0.22%  fio             [.] __fio_gettime
     0.22%  [kernel]        [k] insert_work
     0.20%  [kernel]        [k] native_irq_return_iret

Perf top without RFC driver - (IOPS = 2.0 M IOPS)

    58.40%  [kernel]          [k] native_queued_spin_lock_slowpath
     2.06%  [kernel]          [k] _mix_pool_bytes
     1.38%  [kernel]          [k] _raw_spin_lock_irqsave
     0.97%  [kernel]          [k] _raw_spin_lock
     0.91%  [kernel]          [k] scsi_queue_rq
     0.82%  [kernel]          [k] __sbq_wake_up
     0.77%  [kernel]          [k] _raw_spin_unlock_irqrestore
     0.74%  [kernel]          [k] scsi_mq_get_budget
     0.61%  [kernel]          [k] gup_pgd_range
     0.58%  [kernel]          [k] aio_complete_rw
     0.52%  [kernel]          [k] elv_rb_add
     0.50%  [kernel]          [k] llist_add_batch
     0.50%  [kernel]          [k] native_irq_return_iret
     0.48%  [kernel]          [k] blk_rq_map_sg
     0.48%  fio               [.] __fio_gettime
     0.47%  [kernel]          [k] blk_mq_get_tag
     0.44%  [kernel]          [k] blk_mq_dispatch_rq_list
     0.40%  fio               [.] io_u_queued_complete
     0.39%  fio               [.] get_io_u


If you want me to test any top up patch, please let me know.  BTW, we also
wants to provide module parameter for user to switch back to older
nr_hw_queue = 1 mode. I will work on that part.

24 SSD setup -

I am able to see performance using RFC and without RFC is almost same.
There is one specific drop, but that is generic kernel issue. Not related
to RFC.
We can discuss this issue separately. -

5.6 kernel is not able to scale very well if there is heavy outstanding
from application.
Example -
24 SSD setup and BS = 8K QD = 128 gives 1.73M IOPs which is h/w max, but
at QD = 256 it gives 1.4M IOPs. It looks like there are some overhead  of
finding free tags at sdev or shost level which leads drops in IOPs.

Kashyap

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC v6 08/10] megaraid_sas: switch fusion adapters to MQ
  2020-04-07 11:14   ` Kashyap Desai
@ 2020-04-08  9:33     ` John Garry
  2020-04-08  9:59       ` Kashyap Desai
  0 siblings, 1 reply; 35+ messages in thread
From: John Garry @ 2020-04-08  9:33 UTC (permalink / raw)
  To: Kashyap Desai, axboe, jejb, martin.petersen, ming.lei,
	bvanassche, hare, don.brace, Sumit Saxena, hch,
	Shivasharan Srikanteshwara
  Cc: chenxiang66, linux-block, linux-scsi, esc.storagedev, Hannes Reinecke

On 07/04/2020 12:14, Kashyap Desai wrote:
>> --- a/drivers/scsi/megaraid/megaraid_sas_fusion.c
>> +++ b/drivers/scsi/megaraid/megaraid_sas_fusion.c
>> @@ -373,24 +373,24 @@ megasas_get_msix_index(struct megasas_instance
>> *instance,  {
>>   	int sdev_busy;
>>
>> -	/* nr_hw_queue = 1 for MegaRAID */
>> -	struct blk_mq_hw_ctx *hctx =
>> -		scmd->device->request_queue->queue_hw_ctx[0];
>> +	struct blk_mq_hw_ctx *hctx = scmd->request->mq_hctx;
> 

Hi Kashyap,

> 
> There is one outstanding patch which will eventually remove device_busy
> from sdev. To fix this interface, we may have to track per scsi device
> outstanding within a driver.
> For my testing I used below since we still have below interface available.
> 
>          sdev_busy = atomic_read(&scmd->device->device_busy);

So please confirm that this is your change in megasas_get_msix_index():

- sdev_busy = atomic_read(&hctx->nr_active);
+ sdev_busy = atomic_read(&scmd->device->device_busy);

> 
> We have done some level of testing to know performance impact on SAS SSDs
> and HDD setup. Here is my finding -
> My testing used - Two socket Intel Skylake/Lewisburg/Purley
> Output of numactl --hardware
> 
> available: 2 nodes (0-1)
> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 36 37 38 39 40 41
> 42 43 44 45 46 47 48 49 50 51 52 53
> node 0 size: 31820 MB
> node 0 free: 21958 MB
> node 1 cpus: 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 54 55
> 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
> node 1 size: 32247 MB
> node 1 free: 21068 MB
> node distances:
> node   0   1
>    0:  10  21
>    1:  21  10
> 
> 
> 64 HDD setup -
> 
> With higher QD 

what's OD?

> and io schedulder = mq-deadline, shared host tag is not
> scaling well. If I use ioscheduler = none, I can see consistent 2.0M IOPs.
> This issue is seen only with RFC. Without RFC mq-deadline scales up to
> 2.0M IOPS.

I didn't try any scheduler. I can have a look at that.

> 
> Perf Top result of RFC - (IOPS = 1.4M IOPS)
> 
>     78.20%  [kernel]        [k] native_queued_spin_lock_slowpath
>       1.46%  [kernel]        [k] sbitmap_any_bit_set
>       1.14%  [kernel]        [k] blk_mq_run_hw_queue
>       0.90%  [kernel]        [k] _mix_pool_bytes
>       0.63%  [kernel]        [k] _raw_spin_lock
>       0.57%  [kernel]        [k] blk_mq_run_hw_queues
>       0.56%  [megaraid_sas]  [k] complete_cmd_fusion
>       0.54%  [megaraid_sas]  [k] megasas_build_and_issue_cmd_fusion
>       0.50%  [kernel]        [k] dd_has_work
>       0.38%  [kernel]        [k] _raw_spin_lock_irqsave
>       0.36%  [kernel]        [k] gup_pgd_range
>       0.35%  [megaraid_sas]  [k] megasas_build_ldio_fusion
>       0.31%  [kernel]        [k] io_submit_one
>       0.29%  [kernel]        [k] hctx_lock
>       0.26%  [kernel]        [k] try_to_grab_pending
>       0.24%  [kernel]        [k] scsi_queue_rq
>       0.22%  fio             [.] __fio_gettime
>       0.22%  [kernel]        [k] insert_work
>       0.20%  [kernel]        [k] native_irq_return_iret
> 
> Perf top without RFC driver - (IOPS = 2.0 M IOPS)
> 
>      58.40%  [kernel]          [k] native_queued_spin_lock_slowpath
>       2.06%  [kernel]          [k] _mix_pool_bytes
>       1.38%  [kernel]          [k] _raw_spin_lock_irqsave
>       0.97%  [kernel]          [k] _raw_spin_lock
>       0.91%  [kernel]          [k] scsi_queue_rq
>       0.82%  [kernel]          [k] __sbq_wake_up
>       0.77%  [kernel]          [k] _raw_spin_unlock_irqrestore
>       0.74%  [kernel]          [k] scsi_mq_get_budget
>       0.61%  [kernel]          [k] gup_pgd_range
>       0.58%  [kernel]          [k] aio_complete_rw
>       0.52%  [kernel]          [k] elv_rb_add
>       0.50%  [kernel]          [k] llist_add_batch
>       0.50%  [kernel]          [k] native_irq_return_iret
>       0.48%  [kernel]          [k] blk_rq_map_sg
>       0.48%  fio               [.] __fio_gettime
>       0.47%  [kernel]          [k] blk_mq_get_tag
>       0.44%  [kernel]          [k] blk_mq_dispatch_rq_list
>       0.40%  fio               [.] io_u_queued_complete
>       0.39%  fio               [.] get_io_u
> 
> 
> If you want me to test any top up patch, please let me know.  BTW, we also
> wants to provide module parameter for user to switch back to older
> nr_hw_queue = 1 mode. I will work on that part.

ok, but I would just like to reiterate the point that you will not see 
the full benefit of blk-mq draining hw queues for cpu hotplug since you 
hide hw queues from blk-mq.

> 
> 24 SSD setup -
> 
> I am able to see performance using RFC and without RFC is almost same.
> There is one specific drop, but that is generic kernel issue. Not related
> to RFC.
> We can discuss this issue separately. -
> 
> 5.6 kernel is not able to scale very well if there is heavy outstanding
> from application.
> Example -
> 24 SSD setup and BS = 8K QD = 128 gives 1.73M IOPs which is h/w max, but
> at QD = 256 it gives 1.4M IOPs. It looks like there are some overhead  of
> finding free tags at sdev or shost level which leads drops in IOPs.
> 

Thanks for testing,
John

> 


^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: [PATCH RFC v6 08/10] megaraid_sas: switch fusion adapters to MQ
  2020-04-08  9:33     ` John Garry
@ 2020-04-08  9:59       ` Kashyap Desai
  2020-04-17 16:46         ` John Garry
  0 siblings, 1 reply; 35+ messages in thread
From: Kashyap Desai @ 2020-04-08  9:59 UTC (permalink / raw)
  To: John Garry, axboe, jejb, martin.petersen, ming.lei, bvanassche,
	hare, don.brace, Sumit Saxena, hch, Shivasharan Srikanteshwara
  Cc: chenxiang66, linux-block, linux-scsi, esc.storagedev, Hannes Reinecke

> Hi Kashyap,
>
> >
> > There is one outstanding patch which will eventually remove
> > device_busy from sdev. To fix this interface, we may have to track per
> > scsi device outstanding within a driver.
> > For my testing I used below since we still have below interface
> > available.
> >
> >          sdev_busy = atomic_read(&scmd->device->device_busy);
>
> So please confirm that this is your change in megasas_get_msix_index():
>
> - sdev_busy = atomic_read(&hctx->nr_active);
> + sdev_busy = atomic_read(&scmd->device->device_busy);

That  is correct.

>
> >
> > We have done some level of testing to know performance impact on SAS
> > SSDs and HDD setup. Here is my finding - My testing used - Two socket
> > Intel Skylake/Lewisburg/Purley Output of numactl --hardware
> >
> > available: 2 nodes (0-1)
> > node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 36 37 38 39
> > 40 41
> > 42 43 44 45 46 47 48 49 50 51 52 53
> > node 0 size: 31820 MB
> > node 0 free: 21958 MB
> > node 1 cpus: 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 54
> > 55
> > 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 node 1 size: 32247 MB
> > node 1 free: 21068 MB node distances:
> > node   0   1
> >    0:  10  21
> >    1:  21  10
> >
> >
> > 64 HDD setup -
> >
> > With higher QD
>
> what's OD?
>
> > and io schedulder = mq-deadline, shared host tag is not scaling well.
> > If I use ioscheduler = none, I can see consistent 2.0M IOPs.
> > This issue is seen only with RFC. Without RFC mq-deadline scales up to
> > 2.0M IOPS.
>
> I didn't try any scheduler. I can have a look at that.
>
> >
> > Perf Top result of RFC - (IOPS = 1.4M IOPS)
> >
> >     78.20%  [kernel]        [k] native_queued_spin_lock_slowpath
> >       1.46%  [kernel]        [k] sbitmap_any_bit_set
> >       1.14%  [kernel]        [k] blk_mq_run_hw_queue
> >       0.90%  [kernel]        [k] _mix_pool_bytes
> >       0.63%  [kernel]        [k] _raw_spin_lock
> >       0.57%  [kernel]        [k] blk_mq_run_hw_queues
> >       0.56%  [megaraid_sas]  [k] complete_cmd_fusion
> >       0.54%  [megaraid_sas]  [k] megasas_build_and_issue_cmd_fusion
> >       0.50%  [kernel]        [k] dd_has_work
> >       0.38%  [kernel]        [k] _raw_spin_lock_irqsave
> >       0.36%  [kernel]        [k] gup_pgd_range
> >       0.35%  [megaraid_sas]  [k] megasas_build_ldio_fusion
> >       0.31%  [kernel]        [k] io_submit_one
> >       0.29%  [kernel]        [k] hctx_lock
> >       0.26%  [kernel]        [k] try_to_grab_pending
> >       0.24%  [kernel]        [k] scsi_queue_rq
> >       0.22%  fio             [.] __fio_gettime
> >       0.22%  [kernel]        [k] insert_work
> >       0.20%  [kernel]        [k] native_irq_return_iret
> >
> > Perf top without RFC driver - (IOPS = 2.0 M IOPS)
> >
> >      58.40%  [kernel]          [k] native_queued_spin_lock_slowpath
> >       2.06%  [kernel]          [k] _mix_pool_bytes
> >       1.38%  [kernel]          [k] _raw_spin_lock_irqsave
> >       0.97%  [kernel]          [k] _raw_spin_lock
> >       0.91%  [kernel]          [k] scsi_queue_rq
> >       0.82%  [kernel]          [k] __sbq_wake_up
> >       0.77%  [kernel]          [k] _raw_spin_unlock_irqrestore
> >       0.74%  [kernel]          [k] scsi_mq_get_budget
> >       0.61%  [kernel]          [k] gup_pgd_range
> >       0.58%  [kernel]          [k] aio_complete_rw
> >       0.52%  [kernel]          [k] elv_rb_add
> >       0.50%  [kernel]          [k] llist_add_batch
> >       0.50%  [kernel]          [k] native_irq_return_iret
> >       0.48%  [kernel]          [k] blk_rq_map_sg
> >       0.48%  fio               [.] __fio_gettime
> >       0.47%  [kernel]          [k] blk_mq_get_tag
> >       0.44%  [kernel]          [k] blk_mq_dispatch_rq_list
> >       0.40%  fio               [.] io_u_queued_complete
> >       0.39%  fio               [.] get_io_u
> >
> >
> > If you want me to test any top up patch, please let me know.  BTW, we
> > also wants to provide module parameter for user to switch back to
> > older nr_hw_queue = 1 mode. I will work on that part.
>
> ok, but I would just like to reiterate the point that you will not see the
> full
> benefit of blk-mq draining hw queues for cpu hotplug since you hide hw
> queues from blk-mq.

Agree. We have done  minimal testing using this RFC. We want to ACK this RFC
as long as primary performance goal is achieved.

We have done full testing on nr_hw_queue =1 (and that is what customer is
using) so we at least want to give that interface available for customer for
some time (assuming they may find some performance gap between two interface
which we may not have encountered during smoke testing.).
Over a period of time, if nr_hw_queue = N works for (Broadcom will conduct
full performance once RFC is committed in upstream) all the IO profiles, we
will share the information with customer about benefit of using nr_hw_queues
=  N.

Kashyap

>
> >
> > 24 SSD setup -
> >
> > I am able to see performance using RFC and without RFC is almost same.
> > There is one specific drop, but that is generic kernel issue. Not
> > related to RFC.
> > We can discuss this issue separately. -
> >
> > 5.6 kernel is not able to scale very well if there is heavy
> > outstanding from application.
> > Example -
> > 24 SSD setup and BS = 8K QD = 128 gives 1.73M IOPs which is h/w max,
> > but at QD = 256 it gives 1.4M IOPs. It looks like there are some
> > overhead  of finding free tags at sdev or shost level which leads drops
> > in
> IOPs.
> >
>
> Thanks for testing,
> John
>
> >

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC v6 08/10] megaraid_sas: switch fusion adapters to MQ
  2020-04-08  9:59       ` Kashyap Desai
@ 2020-04-17 16:46         ` John Garry
  2020-04-20 17:47           ` Kashyap Desai
  0 siblings, 1 reply; 35+ messages in thread
From: John Garry @ 2020-04-17 16:46 UTC (permalink / raw)
  To: Kashyap Desai, axboe, jejb, martin.petersen, ming.lei,
	bvanassche, hare, don.brace, Sumit Saxena, hch,
	Shivasharan Srikanteshwara
  Cc: chenxiang66, linux-block, linux-scsi, esc.storagedev, Hannes Reinecke

On 08/04/2020 10:59, Kashyap Desai wrote:

Hi Kashyap,

> 
>>> We have done some level of testing to know performance impact on SAS
>>> SSDs and HDD setup. Here is my finding - My testing used - Two socket
>>> Intel Skylake/Lewisburg/Purley Output of numactl --hardware
>>>
>>> available: 2 nodes (0-1)
>>> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 36 37 38 39
>>> 40 41
>>> 42 43 44 45 46 47 48 49 50 51 52 53
>>> node 0 size: 31820 MB
>>> node 0 free: 21958 MB
>>> node 1 cpus: 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 54
>>> 55
>>> 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 node 1 size: 32247 MB
>>> node 1 free: 21068 MB node distances:
>>> node   0   1
>>>     0:  10  21
>>>     1:  21  10

Do you have other info, like IRQ-CPU affinity dump and controller PCI 
vendor+device ID? Also /proc/interrupts info would be good after a run, 
like supplied by Sumit here:

https://lore.kernel.org/linux-scsi/CAL2rwxotoWakFS4DPe85hZ4VAgd_zw8pL+B5ckHR9NwEf+-L=g@mail.gmail.com/

Are you enabling some special driver perf mode?

>>>
>>>
>>> 64 HDD setup -
>>>
>>> With higher QD
>> what's OD?
>>
>>> and io schedulder = mq-deadline, shared host tag is not scaling well. >>> If I use ioscheduler = none, I can see consistent 2.0M IOPs.
>>> This issue is seen only with RFC. Without RFC mq-deadline scales up to
>>> 2.0M IOPS.

In theory, from this driver perspective, we should not be making a 
difference. That's after your change to use sdev-> device busy count, 
rather than the hctx nr_active count. As I understand, that's the only 
difference you made.

But I will try an IO scheduler on hisi sas for ssd to see if any difference.

>> I didn't try any scheduler. I can have a look at that.
>>
>>> Perf Top result of RFC - (IOPS = 1.4M IOPS)
>>>
>>>      78.20%  [kernel]        [k] native_queued_spin_lock_slowpath
>>>        1.46%  [kernel]        [k] sbitmap_any_bit_set
>>>        1.14%  [kernel]        [k] blk_mq_run_hw_queue
>>>        0.90%  [kernel]        [k] _mix_pool_bytes
>>>        0.63%  [kernel]        [k] _raw_spin_lock
>>>        0.57%  [kernel]        [k] blk_mq_run_hw_queues
>>>        0.56%  [megaraid_sas]  [k] complete_cmd_fusion
>>>        0.54%  [megaraid_sas]  [k] megasas_build_and_issue_cmd_fusion
>>>        0.50%  [kernel]        [k] dd_has_work
>>>        0.38%  [kernel]        [k] _raw_spin_lock_irqsave
>>>        0.36%  [kernel]        [k] gup_pgd_range
>>>        0.35%  [megaraid_sas]  [k] megasas_build_ldio_fusion
>>>        0.31%  [kernel]        [k] io_submit_one
>>>        0.29%  [kernel]        [k] hctx_lock
>>>        0.26%  [kernel]        [k] try_to_grab_pending
>>>        0.24%  [kernel]        [k] scsi_queue_rq
>>>        0.22%  fio             [.] __fio_gettime
>>>        0.22%  [kernel]        [k] insert_work
>>>        0.20%  [kernel]        [k] native_irq_return_iret
>>>
>>> Perf top without RFC driver - (IOPS = 2.0 M IOPS)
>>>
>>>       58.40%  [kernel]          [k] native_queued_spin_lock_slowpath
>>>        2.06%  [kernel]          [k] _mix_pool_bytes
>>>        1.38%  [kernel]          [k] _raw_spin_lock_irqsave
>>>        0.97%  [kernel]          [k] _raw_spin_lock
>>>        0.91%  [kernel]          [k] scsi_queue_rq
>>>        0.82%  [kernel]          [k] __sbq_wake_up
>>>        0.77%  [kernel]          [k] _raw_spin_unlock_irqrestore
>>>        0.74%  [kernel]          [k] scsi_mq_get_budget
>>>        0.61%  [kernel]          [k] gup_pgd_range
>>>        0.58%  [kernel]          [k] aio_complete_rw
>>>        0.52%  [kernel]          [k] elv_rb_add
>>>        0.50%  [kernel]          [k] llist_add_batch
>>>        0.50%  [kernel]          [k] native_irq_return_iret
>>>        0.48%  [kernel]          [k] blk_rq_map_sg
>>>        0.48%  fio               [.] __fio_gettime
>>>        0.47%  [kernel]          [k] blk_mq_get_tag
>>>        0.44%  [kernel]          [k] blk_mq_dispatch_rq_list
>>>        0.40%  fio               [.] io_u_queued_complete
>>>        0.39%  fio               [.] get_io_u
>>>
>>>
>>> If you want me to test any top up patch, please let me know.  BTW, we
>>> also wants to provide module parameter for user to switch back to
>>> older nr_hw_queue = 1 mode. I will work on that part.
>> ok, but I would just like to reiterate the point that you will not see the
>> full
>> benefit of blk-mq draining hw queues for cpu hotplug since you hide hw
>> queues from blk-mq.
> Agree. We have done  minimal testing using this RFC. We want to ACK this RFC
> as long as primary performance goal is achieved.
> 
> We have done full testing on nr_hw_queue =1 (and that is what customer is
> using) so we at least want to give that interface available for customer for
> some time (assuming they may find some performance gap between two interface
> which we may not have encountered during smoke testing.).
> Over a period of time, if nr_hw_queue = N works for (Broadcom will conduct
> full performance once RFC is committed in upstream) all the IO profiles, we
> will share the information with customer about benefit of using nr_hw_queues
> =  N.

Hopefully you can use nr_hw_queues = N always.

> 

Thanks,
john

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: [PATCH RFC v6 08/10] megaraid_sas: switch fusion adapters to MQ
  2020-04-17 16:46         ` John Garry
@ 2020-04-20 17:47           ` Kashyap Desai
  2020-04-21 12:35             ` John Garry
  0 siblings, 1 reply; 35+ messages in thread
From: Kashyap Desai @ 2020-04-20 17:47 UTC (permalink / raw)
  To: John Garry, axboe, jejb, martin.petersen, ming.lei, bvanassche,
	hare, don.brace, Sumit Saxena, hch, Shivasharan Srikanteshwara
  Cc: chenxiang66, linux-block, linux-scsi, esc.storagedev, Hannes Reinecke

> On 08/04/2020 10:59, Kashyap Desai wrote:
>
> Hi Kashyap,
>
> >
> >>> We have done some level of testing to know performance impact on SAS
> >>> SSDs and HDD setup. Here is my finding - My testing used - Two
> >>> socket Intel Skylake/Lewisburg/Purley Output of numactl --hardware
> >>>
> >>> available: 2 nodes (0-1)
> >>> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 36 37 38 39
> >>> 40 41
> >>> 42 43 44 45 46 47 48 49 50 51 52 53
> >>> node 0 size: 31820 MB
> >>> node 0 free: 21958 MB
> >>> node 1 cpus: 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
> >>> 54
> >>> 55
> >>> 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 node 1 size: 32247
> >>> MB node 1 free: 21068 MB node distances:
> >>> node   0   1
> >>>     0:  10  21
> >>>     1:  21  10
>
> Do you have other info, like IRQ-CPU affinity dump and controller PCI
> vendor+device ID? Also /proc/interrupts info would be good after a run,
> like supplied by Sumit here:
>
> https://lore.kernel.org/linux-
> scsi/CAL2rwxotoWakFS4DPe85hZ4VAgd_zw8pL+B5ckHR9NwEf+-
> L=g@mail.gmail.com/

Controller performance mode is = IOPs which will create 8 extra reply
queues.
In this case it is 72 online CPU + 8 = 80 reply queue (MSIx) driver will
create.

First 8 vectors are non-managed and they are mapped to local numa node -1.

Here is IRQ-CPU affinity -

irq 256, cpu list 18-35,54-71
irq 257, cpu list 18-35,54-71
irq 258, cpu list 18-35,54-71
irq 259, cpu list 18-35,54-71
irq 260, cpu list 18-35,54-71
irq 261, cpu list 18-35,54-71
irq 262, cpu list 18-35,54-71
irq 263, cpu list 18-35,54-71
irq 264, cpu list 18
irq 265, cpu list 19
irq 266, cpu list 20
irq 267, cpu list 21
irq 268, cpu list 22
irq 269, cpu list 23
irq 270, cpu list 24
irq 271, cpu list 25
irq 272, cpu list 26
irq 273, cpu list 27
irq 274, cpu list 28
irq 275, cpu list 29
irq 276, cpu list 30
irq 277, cpu list 31
irq 278, cpu list 32
irq 279, cpu list 33
irq 280, cpu list 34
irq 281, cpu list 35
irq 282, cpu list 54
irq 283, cpu list 55
irq 284, cpu list 56
irq 285, cpu list 57
irq 286, cpu list 58
irq 287, cpu list 59
irq 288, cpu list 60
irq 289, cpu list 61
irq 290, cpu list 62
irq 291, cpu list 63
irq 292, cpu list 64
irq 293, cpu list 65
irq 294, cpu list 66
irq 295, cpu list 67
irq 296, cpu list 68
irq 297, cpu list 69
irq 298, cpu list 70
irq 299, cpu list 71
irq 300, cpu list 0
irq 301, cpu list 1
irq 302, cpu list 2
irq 303, cpu list 3
irq 304, cpu list 4
irq 305, cpu list 5
irq 306, cpu list 6
irq 307, cpu list 7
irq 308, cpu list 8
irq 309, cpu list 9
irq 310, cpu list 10
irq 311, cpu list 11
irq 312, cpu list 12
irq 313, cpu list 13
irq 314, cpu list 14
irq 315, cpu list 15
irq 316, cpu list 16
irq 317, cpu list 17
irq 318, cpu list 36
irq 319, cpu list 37
irq 320, cpu list 38
irq 321, cpu list 39
irq 322, cpu list 40
irq 323, cpu list 41
irq 324, cpu list 42
irq 325, cpu list 43
irq 326, cpu list 44
irq 327, cpu list 45
irq 328, cpu list 46
irq 329, cpu list 47
irq 330, cpu list 48
irq 331, cpu list 49
irq 332, cpu list 50
irq 333, cpu list 51
irq 334, cpu list 52
irq 335, cpu list 53

>
> Are you enabling some special driver perf mode?
>
> >>>
> >>>
> >>> 64 HDD setup -
> >>>
> >>> With higher QD
> >> what's OD?

I mean higher Queue Depth (QD). Higher Queue Depth is required because
congestion will happen at Sdev->queue_depth and shost->can_queue level.
If outstanding is not hitting per sdev queue depth OR shost can queue depth,
we will not see performance issue.

> >>
> >>> and io schedulder = mq-deadline, shared host tag is not scaling well.
> >>>  >>> If
> I use ioscheduler = none, I can see consistent 2.0M IOPs.
> >>> This issue is seen only with RFC. Without RFC mq-deadline scales up to
> >>> 2.0M IOPS.
>
> In theory, from this driver perspective, we should not be making a
> difference. That's after your change to use sdev-> device busy count,
> rather than the hctx nr_active count. As I understand, that's the only
> difference you made.
>
> But I will try an IO scheduler on hisi sas for ssd to see if any
> difference.
>
> >> I didn't try any scheduler. I can have a look at that.
> >>
> >>> Perf Top result of RFC - (IOPS = 1.4M IOPS)
> >>>
> >>>      78.20%  [kernel]        [k] native_queued_spin_lock_slowpath
> >>>        1.46%  [kernel]        [k] sbitmap_any_bit_set
> >>>        1.14%  [kernel]        [k] blk_mq_run_hw_queue
> >>>        0.90%  [kernel]        [k] _mix_pool_bytes
> >>>        0.63%  [kernel]        [k] _raw_spin_lock
> >>>        0.57%  [kernel]        [k] blk_mq_run_hw_queues
> >>>        0.56%  [megaraid_sas]  [k] complete_cmd_fusion
> >>>        0.54%  [megaraid_sas]  [k] megasas_build_and_issue_cmd_fusion
> >>>        0.50%  [kernel]        [k] dd_has_work
> >>>        0.38%  [kernel]        [k] _raw_spin_lock_irqsave
> >>>        0.36%  [kernel]        [k] gup_pgd_range
> >>>        0.35%  [megaraid_sas]  [k] megasas_build_ldio_fusion
> >>>        0.31%  [kernel]        [k] io_submit_one
> >>>        0.29%  [kernel]        [k] hctx_lock
> >>>        0.26%  [kernel]        [k] try_to_grab_pending
> >>>        0.24%  [kernel]        [k] scsi_queue_rq
> >>>        0.22%  fio             [.] __fio_gettime
> >>>        0.22%  [kernel]        [k] insert_work
> >>>        0.20%  [kernel]        [k] native_irq_return_iret
> >>>
> >>> Perf top without RFC driver - (IOPS = 2.0 M IOPS)
> >>>
> >>>       58.40%  [kernel]          [k] native_queued_spin_lock_slowpath
> >>>        2.06%  [kernel]          [k] _mix_pool_bytes
> >>>        1.38%  [kernel]          [k] _raw_spin_lock_irqsave
> >>>        0.97%  [kernel]          [k] _raw_spin_lock
> >>>        0.91%  [kernel]          [k] scsi_queue_rq
> >>>        0.82%  [kernel]          [k] __sbq_wake_up
> >>>        0.77%  [kernel]          [k] _raw_spin_unlock_irqrestore
> >>>        0.74%  [kernel]          [k] scsi_mq_get_budget
> >>>        0.61%  [kernel]          [k] gup_pgd_range
> >>>        0.58%  [kernel]          [k] aio_complete_rw
> >>>        0.52%  [kernel]          [k] elv_rb_add
> >>>        0.50%  [kernel]          [k] llist_add_batch
> >>>        0.50%  [kernel]          [k] native_irq_return_iret
> >>>        0.48%  [kernel]          [k] blk_rq_map_sg
> >>>        0.48%  fio               [.] __fio_gettime
> >>>        0.47%  [kernel]          [k] blk_mq_get_tag
> >>>        0.44%  [kernel]          [k] blk_mq_dispatch_rq_list
> >>>        0.40%  fio               [.] io_u_queued_complete
> >>>        0.39%  fio               [.] get_io_u
> >>>
> >>>
> >>> If you want me to test any top up patch, please let me know.  BTW, we
> >>> also wants to provide module parameter for user to switch back to
> >>> older nr_hw_queue = 1 mode. I will work on that part.
> >> ok, but I would just like to reiterate the point that you will not see
> >> the
> >> full
> >> benefit of blk-mq draining hw queues for cpu hotplug since you hide hw
> >> queues from blk-mq.
> > Agree. We have done  minimal testing using this RFC. We want to ACK this
> RFC
> > as long as primary performance goal is achieved.
> >
> > We have done full testing on nr_hw_queue =1 (and that is what customer
> > is
> > using) so we at least want to give that interface available for customer
> > for
> > some time (assuming they may find some performance gap between two
> interface
> > which we may not have encountered during smoke testing.).
> > Over a period of time, if nr_hw_queue = N works for (Broadcom will
> > conduct
> > full performance once RFC is committed in upstream) all the IO profiles,
> > we
> > will share the information with customer about benefit of using
> nr_hw_queues
> > =  N.
>
> Hopefully you can use nr_hw_queues = N always.
>
> >
>
> Thanks,
> john

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC v6 08/10] megaraid_sas: switch fusion adapters to MQ
  2020-04-20 17:47           ` Kashyap Desai
@ 2020-04-21 12:35             ` John Garry
  2020-04-22 18:59               ` Kashyap Desai
  0 siblings, 1 reply; 35+ messages in thread
From: John Garry @ 2020-04-21 12:35 UTC (permalink / raw)
  To: Kashyap Desai, axboe, jejb, martin.petersen, ming.lei,
	bvanassche, hare, don.brace, Sumit Saxena, hch,
	Shivasharan Srikanteshwara
  Cc: chenxiang66, linux-block, linux-scsi, esc.storagedev, Hannes Reinecke

On 20/04/2020 18:47, Kashyap Desai wrote:
>> ther info, like IRQ-CPU affinity dump and controller PCI
>> vendor+device ID? Also /proc/interrupts info would be good after a run,
>> like supplied by Sumit here:
>>
>> https://lore.kernel.org/linux-
>> scsi/CAL2rwxotoWakFS4DPe85hZ4VAgd_zw8pL+B5ckHR9NwEf+-
>> L=g@mail.gmail.com/
> Controller performance mode is = IOPs which will create 8 extra reply
> queues.
> In this case it is 72 online CPU + 8 = 80 reply queue (MSIx) driver will
> create.
> 
> First 8 vectors are non-managed and they are mapped to local numa node -1.
> 
> Here is IRQ-CPU affinity -
> 
> irq 256, cpu list 18-35,54-71
> irq 257, cpu list 18-35,54-71
> irq 258, cpu list 18-35,54-71
> irq 259, cpu list 18-35,54-71
> irq 260, cpu list 18-35,54-71
> irq 261, cpu list 18-35,54-71
> irq 262, cpu list 18-35,54-71
> irq 263, cpu list 18-35,54-71
> irq 264, cpu list 18
> irq 265, cpu list 19
> irq 266, cpu list 20
> irq 267, cpu list 21
> irq 268, cpu list 22
> irq 269, cpu list 23
> irq 270, cpu list 24
> irq 271, cpu list 25
> irq 272, cpu list 26
> irq 273, cpu list 27
> irq 274, cpu list 28
> irq 275, cpu list 29
> irq 276, cpu list 30
> irq 277, cpu list 31
> irq 278, cpu list 32
> irq 279, cpu list 33
> irq 280, cpu list 34
> irq 281, cpu list 35
> irq 282, cpu list 54
> irq 283, cpu list 55
> irq 284, cpu list 56
> irq 285, cpu list 57
> irq 286, cpu list 58
> irq 287, cpu list 59
> irq 288, cpu list 60
> irq 289, cpu list 61
> irq 290, cpu list 62
> irq 291, cpu list 63
> irq 292, cpu list 64
> irq 293, cpu list 65
> irq 294, cpu list 66
> irq 295, cpu list 67
> irq 296, cpu list 68
> irq 297, cpu list 69
> irq 298, cpu list 70
> irq 299, cpu list 71
> irq 300, cpu list 0
> irq 301, cpu list 1
> irq 302, cpu list 2
> irq 303, cpu list 3
> irq 304, cpu list 4
> irq 305, cpu list 5
> irq 306, cpu list 6
> irq 307, cpu list 7
> irq 308, cpu list 8
> irq 309, cpu list 9
> irq 310, cpu list 10
> irq 311, cpu list 11
> irq 312, cpu list 12
> irq 313, cpu list 13
> irq 314, cpu list 14
> irq 315, cpu list 15
> irq 316, cpu list 16
> irq 317, cpu list 17
> irq 318, cpu list 36
> irq 319, cpu list 37
> irq 320, cpu list 38
> irq 321, cpu list 39
> irq 322, cpu list 40
> irq 323, cpu list 41
> irq 324, cpu list 42
> irq 325, cpu list 43
> irq 326, cpu list 44
> irq 327, cpu list 45
> irq 328, cpu list 46
> irq 329, cpu list 47
> irq 330, cpu list 48
> irq 331, cpu list 49
> irq 332, cpu list 50
> irq 333, cpu list 51
> irq 334, cpu list 52
> irq 335, cpu list 53
> 
>> Are you enabling some special driver perf mode?
>>

ok, thanks.

So I tested this on hisi_sas with x12 SAS SSDs, and performance with 
"mq-deadline" is comparable with "none" @ ~ 2M IOPs. But after a while 
performance drops alot, to maybe 700K IOPS. Do you have a similar 
experience?

Anyway, I'll have a look.

Thanks,
John

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: [PATCH RFC v6 08/10] megaraid_sas: switch fusion adapters to MQ
  2020-04-21 12:35             ` John Garry
@ 2020-04-22 18:59               ` Kashyap Desai
  2020-04-22 21:28                 ` John Garry
  0 siblings, 1 reply; 35+ messages in thread
From: Kashyap Desai @ 2020-04-22 18:59 UTC (permalink / raw)
  To: John Garry, axboe, jejb, martin.petersen, ming.lei, bvanassche,
	hare, don.brace, Sumit Saxena, hch, Shivasharan Srikanteshwara
  Cc: chenxiang66, linux-block, linux-scsi, esc.storagedev, Hannes Reinecke

>
> So I tested this on hisi_sas with x12 SAS SSDs, and performance with "mq-
> deadline" is comparable with "none" @ ~ 2M IOPs. But after a while
> performance drops alot, to maybe 700K IOPS. Do you have a similar
> experience?

I am using mq-deadline only for HDD. I have not tried on SSD since it is not
useful scheduler for SSDs.

I noticed that when I used mq-deadline, performance drop starts if I have
more number of drives.
I am running <fio> script which has 64 Drives, 64 thread and all treads are
bound to local numa node which has 36 logical cores.
I noticed that lock contention is in " dd_dispatch_request". I am not sure
why there is a no penalty of same lock in nr_hw_queue  = 1 mode.

static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
{
        struct deadline_data *dd = hctx->queue->elevator->elevator_data;
        struct request *rq;

        spin_lock(&dd->lock);
        rq = __dd_dispatch_request(dd);
        spin_unlock(&dd->lock);

        return rq;
}

Here is perf report -

-    1.04%     0.99%  kworker/18:1H+k  [kernel.vmlinux]  [k]
native_queued_spin_lock_slowpath
     0.99% ret_from_fork
    -   kthread
      - worker_thread
         - 0.98% process_one_work
            - 0.98% __blk_mq_run_hw_queue
               - blk_mq_sched_dispatch_requests
                  - 0.98% blk_mq_do_dispatch_sched
                     - 0.97% dd_dispatch_request
                        + 0.97% queued_spin_lock_slowpath
+    1.04%     0.00%  kworker/18:1H+k  [kernel.vmlinux]  [k]
queued_spin_lock_slowpath
+    1.03%     0.95%  kworker/19:1H-k  [kernel.vmlinux]  [k]
native_queued_spin_lock_slowpath
+    1.03%     0.00%  kworker/19:1H-k  [kernel.vmlinux]  [k]
queued_spin_lock_slowpath
+    1.02%     0.97%  kworker/20:1H+k  [kernel.vmlinux]  [k]
native_queued_spin_lock_slowpath
+    1.02%     0.00%  kworker/20:1H+k  [kernel.vmlinux]  [k]
queued_spin_lock_slowpath
+    1.01%     0.96%  kworker/21:1H+k  [kernel.vmlinux]  [k]
native_queued_spin_lock_slowpath


>
> Anyway, I'll have a look.
>
> Thanks,
> John

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC v6 08/10] megaraid_sas: switch fusion adapters to MQ
  2020-04-22 18:59               ` Kashyap Desai
@ 2020-04-22 21:28                 ` John Garry
  2020-04-23 16:31                   ` John Garry
  0 siblings, 1 reply; 35+ messages in thread
From: John Garry @ 2020-04-22 21:28 UTC (permalink / raw)
  To: Kashyap Desai, axboe, jejb, martin.petersen, ming.lei,
	bvanassche, hare, don.brace, Sumit Saxena, hch,
	Shivasharan Srikanteshwara
  Cc: chenxiang66, linux-block, linux-scsi, esc.storagedev, Hannes Reinecke

On 22/04/2020 19:59, Kashyap Desai wrote:
>>

Hi Kashyap,

>> So I tested this on hisi_sas with x12 SAS SSDs, and performance with "mq-
>> deadline" is comparable with "none" @ ~ 2M IOPs. But after a while
>> performance drops alot, to maybe 700K IOPS. Do you have a similar
>> experience?
> 
> I am using mq-deadline only for HDD. I have not tried on SSD since it is not
> useful scheduler for SSDs.
> 

I ask as I only have SAS SSDs to test.

> I noticed that when I used mq-deadline, performance drop starts if I have
> more number of drives.
> I am running <fio> script which has 64 Drives, 64 thread and all treads are
> bound to local numa node which has 36 logical cores.
> I noticed that lock contention is in " dd_dispatch_request". I am not sure
> why there is a no penalty of same lock in nr_hw_queue  = 1 mode.

So this could be just pre-existing issue of exposing multiple queues for 
SCSI HBAs combined with mq-deadline iosched. I mean, that's really the 
only significant change in this series, apart from the shared sbitmap, 
and, at this point, I don't think that is the issue.

> 
> static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
> {
>          struct deadline_data *dd = hctx->queue->elevator->elevator_data;
>          struct request *rq;
> 
>          spin_lock(&dd->lock);

So if multiple hctx's are accessing this lock, then much contention 
possible.

>          rq = __dd_dispatch_request(dd);
>          spin_unlock(&dd->lock);
> 
>          return rq;
> }
> 
> Here is perf report -
> 
> -    1.04%     0.99%  kworker/18:1H+k  [kernel.vmlinux]  [k]
> native_queued_spin_lock_slowpath
>       0.99% ret_from_fork
>      -   kthread
>        - worker_thread
>           - 0.98% process_one_work
>              - 0.98% __blk_mq_run_hw_queue
>                 - blk_mq_sched_dispatch_requests
>                    - 0.98% blk_mq_do_dispatch_sched
>                       - 0.97% dd_dispatch_request
>                          + 0.97% queued_spin_lock_slowpath
> +    1.04%     0.00%  kworker/18:1H+k  [kernel.vmlinux]  [k]
> queued_spin_lock_slowpath
> +    1.03%     0.95%  kworker/19:1H-k  [kernel.vmlinux]  [k]
> native_queued_spin_lock_slowpath
> +    1.03%     0.00%  kworker/19:1H-k  [kernel.vmlinux]  [k]
> queued_spin_lock_slowpath
> +    1.02%     0.97%  kworker/20:1H+k  [kernel.vmlinux]  [k]
> native_queued_spin_lock_slowpath
> +    1.02%     0.00%  kworker/20:1H+k  [kernel.vmlinux]  [k]
> queued_spin_lock_slowpath
> +    1.01%     0.96%  kworker/21:1H+k  [kernel.vmlinux]  [k]
> native_queued_spin_lock_slowpath
> 

I'll try to capture a perf report and compare to mine.

Thanks very much,
john

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC v6 08/10] megaraid_sas: switch fusion adapters to MQ
  2020-04-22 21:28                 ` John Garry
@ 2020-04-23 16:31                   ` John Garry
  2020-04-24 16:31                     ` Kashyap Desai
  0 siblings, 1 reply; 35+ messages in thread
From: John Garry @ 2020-04-23 16:31 UTC (permalink / raw)
  To: Kashyap Desai, axboe, jejb, martin.petersen, ming.lei,
	bvanassche, hare, don.brace, Sumit Saxena, hch,
	Shivasharan Srikanteshwara
  Cc: chenxiang66, linux-block, linux-scsi, esc.storagedev, Hannes Reinecke

> 
>>> So I tested this on hisi_sas with x12 SAS SSDs, and performance with 
>>> "mq-
>>> deadline" is comparable with "none" @ ~ 2M IOPs. But after a while
>>> performance drops alot, to maybe 700K IOPS. Do you have a similar
>>> experience?
>>
>> I am using mq-deadline only for HDD. I have not tried on SSD since it 
>> is not
>> useful scheduler for SSDs.
>>
> 
> I ask as I only have SAS SSDs to test.
> 
>> I noticed that when I used mq-deadline, performance drop starts if I have
>> more number of drives.
>> I am running <fio> script which has 64 Drives, 64 thread and all 
>> treads are
>> bound to local numa node which has 36 logical cores.
>> I noticed that lock contention is in " dd_dispatch_request". I am not 
>> sure
>> why there is a no penalty of same lock in nr_hw_queue  = 1 mode.
> 
> So this could be just pre-existing issue of exposing multiple queues for 
> SCSI HBAs combined with mq-deadline iosched. I mean, that's really the 
> only significant change in this series, apart from the shared sbitmap, 
> and, at this point, I don't think that is the issue.

As an experiment, I modified hisi_sas mainline driver to expose hw 
queues and manage tags itself, and I see the same issue I mentioned:

Jobs: 12 (f=12): [R(12)] [14.8% done] [7592MB/0KB/0KB /s] [1943K/0/0 
iops] [eta
Jobs: 12 (f=12): [R(12)] [16.4% done] [7949MB/0KB/0KB /s] [2035K/0/0 
iops] [eta
Jobs: 12 (f=12): [R(12)] [18.0% done] [7940MB/0KB/0KB /s] [2033K/0/0 
iops] [eta
Jobs: 12 (f=12): [R(12)] [19.7% done] [7984MB/0KB/0KB /s] [2044K/0/0 
iops] [eta
Jobs: 12 (f=12): [R(12)] [21.3% done] [7984MB/0KB/0KB /s] [2044K/0/0 
iops] [eta
Jobs: 12 (f=12): [R(12)] [23.0% done] [2964MB/0KB/0KB /s] [759K/0/0 
iops] [eta 0
Jobs: 12 (f=12): [R(12)] [24.6% done] [2417MB/0KB/0KB /s] [619K/0/0 
iops] [eta 0
Jobs: 12 (f=12): [R(12)] [26.2% done] [2909MB/0KB/0KB /s] [745K/0/0 
iops] [eta 0
Jobs: 12 (f=12): [R(12)] [27.9% done] [2366MB/0KB/0KB /s] [606K/0/0 
iops] [eta 0

The odd time I see "sched: RT throttling activated" around the time the 
throughput falls. I think issue is the per-queue threaded irq threaded 
handlers consuming too many cycles. With "none" io scheduler, IOPS is 
flat at around 2M.

> 
>>
>> static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
>> {
>>          struct deadline_data *dd = hctx->queue->elevator->elevator_data;
>>          struct request *rq;
>>
>>          spin_lock(&dd->lock);
> 
> So if multiple hctx's are accessing this lock, then much contention 
> possible.
> 
>>          rq = __dd_dispatch_request(dd);
>>          spin_unlock(&dd->lock);
>>
>>          return rq;
>> }
>>
>> Here is perf report -
>>
>> -    1.04%     0.99%  kworker/18:1H+k  [kernel.vmlinux]  [k]
>> native_queued_spin_lock_slowpath
>>       0.99% ret_from_fork
>>      -   kthread
>>        - worker_thread
>>           - 0.98% process_one_work
>>              - 0.98% __blk_mq_run_hw_queue
>>                 - blk_mq_sched_dispatch_requests
>>                    - 0.98% blk_mq_do_dispatch_sched
>>                       - 0.97% dd_dispatch_request
>>                          + 0.97% queued_spin_lock_slowpath
>> +    1.04%     0.00%  kworker/18:1H+k  [kernel.vmlinux]  [k]
>> queued_spin_lock_slowpath
>> +    1.03%     0.95%  kworker/19:1H-k  [kernel.vmlinux]  [k]
>> native_queued_spin_lock_slowpath
>> +    1.03%     0.00%  kworker/19:1H-k  [kernel.vmlinux]  [k]
>> queued_spin_lock_slowpath
>> +    1.02%     0.97%  kworker/20:1H+k  [kernel.vmlinux]  [k]
>> native_queued_spin_lock_slowpath
>> +    1.02%     0.00%  kworker/20:1H+k  [kernel.vmlinux]  [k]
>> queued_spin_lock_slowpath
>> +    1.01%     0.96%  kworker/21:1H+k  [kernel.vmlinux]  [k]
>> native_queued_spin_lock_slowpath
>>
> 
> I'll try to capture a perf report and compare to mine.

Mine is spending a huge amount of time (circa 33% on a cpu servicing 
completion irqs) in mod_delayed_work_on():

--79.89%--sas_scsi_task_done |
    |--76.72%--scsi_mq_done
    |    |
    |     --76.53%--blk_mq_complete_request
    |    |
    |    |--74.81%--scsi_softirq_done
    |    |    |
    |    |     --73.91%--scsi_finish_command
    |    |    |
    |    |    |--72.11%--scsi_io_completion
    |    |    |    |
    |    |    |     --71.89%--scsi_end_request
    |    |    |    |
    |    |    |    |--40.82%--blk_mq_run_hw_queues
    |    |    |    |    |
    |    |    |    |    |--35.86%--blk_mq_run_hw_queue
    |    |    |    |    |    |
    |    |    |    |    |     --33.59%--__blk_mq_delay_run_hw_queue
    |    |    |    |    |    |
    |    |    |    |    |     --33.38%--kblockd_mod_delayed_work_on
    |    |    |    |    |          |
    |    |    |    |    |                --33.31%--mod_delayed_work_on

hmmmm...

Thanks,
John

^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: [PATCH RFC v6 08/10] megaraid_sas: switch fusion adapters to MQ
  2020-04-23 16:31                   ` John Garry
@ 2020-04-24 16:31                     ` Kashyap Desai
  2020-04-27 17:06                       ` John Garry
  0 siblings, 1 reply; 35+ messages in thread
From: Kashyap Desai @ 2020-04-24 16:31 UTC (permalink / raw)
  To: John Garry, axboe, jejb, martin.petersen, ming.lei, bvanassche,
	hare, don.brace, Sumit Saxena, hch, Shivasharan Srikanteshwara
  Cc: chenxiang66, linux-block, linux-scsi, esc.storagedev, Hannes Reinecke

> >
> >>> So I tested this on hisi_sas with x12 SAS SSDs, and performance with
> >>> "mq-
> >>> deadline" is comparable with "none" @ ~ 2M IOPs. But after a while
> >>> performance drops alot, to maybe 700K IOPS. Do you have a similar
> >>> experience?
> >>
> >> I am using mq-deadline only for HDD. I have not tried on SSD since it
> >> is not useful scheduler for SSDs.
> >>
> >
> > I ask as I only have SAS SSDs to test.
> >
> >> I noticed that when I used mq-deadline, performance drop starts if I
> >> have
> >> more number of drives.
> >> I am running <fio> script which has 64 Drives, 64 thread and all
> >> treads are
> >> bound to local numa node which has 36 logical cores.
> >> I noticed that lock contention is in " dd_dispatch_request". I am not
> >> sure
> >> why there is a no penalty of same lock in nr_hw_queue  = 1 mode.
> >
> > So this could be just pre-existing issue of exposing multiple queues for
> > SCSI HBAs combined with mq-deadline iosched. I mean, that's really the
> > only significant change in this series, apart from the shared sbitmap,
> > and, at this point, I don't think that is the issue.
>
> As an experiment, I modified hisi_sas mainline driver to expose hw
> queues and manage tags itself, and I see the same issue I mentioned:
>
> Jobs: 12 (f=12): [R(12)] [14.8% done] [7592MB/0KB/0KB /s] [1943K/0/0
> iops] [eta
> Jobs: 12 (f=12): [R(12)] [16.4% done] [7949MB/0KB/0KB /s] [2035K/0/0
> iops] [eta
> Jobs: 12 (f=12): [R(12)] [18.0% done] [7940MB/0KB/0KB /s] [2033K/0/0
> iops] [eta
> Jobs: 12 (f=12): [R(12)] [19.7% done] [7984MB/0KB/0KB /s] [2044K/0/0
> iops] [eta
> Jobs: 12 (f=12): [R(12)] [21.3% done] [7984MB/0KB/0KB /s] [2044K/0/0
> iops] [eta
> Jobs: 12 (f=12): [R(12)] [23.0% done] [2964MB/0KB/0KB /s] [759K/0/0
> iops] [eta 0
> Jobs: 12 (f=12): [R(12)] [24.6% done] [2417MB/0KB/0KB /s] [619K/0/0
> iops] [eta 0
> Jobs: 12 (f=12): [R(12)] [26.2% done] [2909MB/0KB/0KB /s] [745K/0/0
> iops] [eta 0
> Jobs: 12 (f=12): [R(12)] [27.9% done] [2366MB/0KB/0KB /s] [606K/0/0
> iops] [eta 0
>
> The odd time I see "sched: RT throttling activated" around the time the
> throughput falls. I think issue is the per-queue threaded irq threaded
> handlers consuming too many cycles. With "none" io scheduler, IOPS is
> flat at around 2M.
>
> >
> >>
> >> static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
> >> {
> >>          struct deadline_data *dd =
> >> hctx->queue->elevator->elevator_data;
> >>          struct request *rq;
> >>
> >>          spin_lock(&dd->lock);
> >
> > So if multiple hctx's are accessing this lock, then much contention
> > possible.
> >
> >>          rq = __dd_dispatch_request(dd);
> >>          spin_unlock(&dd->lock);
> >>
> >>          return rq;
> >> }
> >>
> >> Here is perf report -
> >>
> >> -    1.04%     0.99%  kworker/18:1H+k  [kernel.vmlinux]  [k]
> >> native_queued_spin_lock_slowpath
> >>       0.99% ret_from_fork
> >>      -   kthread
> >>        - worker_thread
> >>           - 0.98% process_one_work
> >>              - 0.98% __blk_mq_run_hw_queue
> >>                 - blk_mq_sched_dispatch_requests
> >>                    - 0.98% blk_mq_do_dispatch_sched
> >>                       - 0.97% dd_dispatch_request
> >>                          + 0.97% queued_spin_lock_slowpath
> >> +    1.04%     0.00%  kworker/18:1H+k  [kernel.vmlinux]  [k]
> >> queued_spin_lock_slowpath
> >> +    1.03%     0.95%  kworker/19:1H-k  [kernel.vmlinux]  [k]
> >> native_queued_spin_lock_slowpath
> >> +    1.03%     0.00%  kworker/19:1H-k  [kernel.vmlinux]  [k]
> >> queued_spin_lock_slowpath
> >> +    1.02%     0.97%  kworker/20:1H+k  [kernel.vmlinux]  [k]
> >> native_queued_spin_lock_slowpath
> >> +    1.02%     0.00%  kworker/20:1H+k  [kernel.vmlinux]  [k]
> >> queued_spin_lock_slowpath
> >> +    1.01%     0.96%  kworker/21:1H+k  [kernel.vmlinux]  [k]
> >> native_queued_spin_lock_slowpath
> >>
> >
> > I'll try to capture a perf report and compare to mine.
>
> Mine is spending a huge amount of time (circa 33% on a cpu servicing
> completion irqs) in mod_delayed_work_on():
>
> --79.89%--sas_scsi_task_done |
>     |--76.72%--scsi_mq_done
>     |    |
>     |     --76.53%--blk_mq_complete_request
>     |    |
>     |    |--74.81%--scsi_softirq_done
>     |    |    |
>     |    |     --73.91%--scsi_finish_command
>     |    |    |
>     |    |    |--72.11%--scsi_io_completion
>     |    |    |    |
>     |    |    |     --71.89%--scsi_end_request
>     |    |    |    |
>     |    |    |    |--40.82%--blk_mq_run_hw_queues
>     |    |    |    |    |
>     |    |    |    |    |--35.86%--blk_mq_run_hw_queue
>     |    |    |    |    |    |
>     |    |    |    |    |     --33.59%--__blk_mq_delay_run_hw_queue
>     |    |    |    |    |    |
>     |    |    |    |    |     --33.38%--kblockd_mod_delayed_work_on
>     |    |    |    |    |          |
>     |    |    |    |    |                --33.31%--mod_delayed_work_on
>
> hmmmm...

I did some more experiments. It looks like issue is with both <none> and
<mq-deadline> scheduler.  Let me simplify what happens with ioscheduler =
<none>.

Old Driver which has nr_hw_queue = 1 and I issue IOs from <fio>  queue depth
= 128. We get 3.1M IOPS in this config. This eventually exhaust host
can_queue.
Note - Very low contention in sbitmap_get()

-   23.58%     0.25%  fio              [kernel.vmlinux]            [k]
blk_mq_make_request
   - 23.33% blk_mq_make_request
      - 21.68% blk_mq_get_request
         - 20.19% blk_mq_get_tag
            + 10.08% prepare_to_wait_exclusive
            + 4.51% io_schedule
            - 3.59% __sbitmap_queue_get
               - 2.82% sbitmap_get
                    0.86% __sbitmap_get_word
                    0.75% _raw_spin_lock_irqsave
                    0.55% _raw_spin_unlock_irqrestore

Driver with RFC which has nr_hw_queue = N and I issue IOs from <fio>  queue
depth = 128. We get 2.3 M IOPS in this config. This eventually exhaust host
can_queue.
Note - Very high contention in sbitmap_get()

-   42.39%     0.12%  fio              [kernel.vmlinux]            [k]
generic_make_request
   - 42.27% generic_make_request
      - 41.00% blk_mq_make_request
         - 38.28% blk_mq_get_request
            - 33.76% blk_mq_get_tag
               - 30.25% __sbitmap_queue_get
                  - 29.90% sbitmap_get
                     + 9.06% _raw_spin_lock_irqsave
                     + 7.94% _raw_spin_unlock_irqrestore
                     + 3.86% __sbitmap_get_word
                     + 1.78% call_function_single_interrupt
                     + 0.67% ret_from_intr
               + 1.69% io_schedule
                 0.59% prepare_to_wait_exclusive
                 0.55% __blk_mq_get_tag

In this particular case, I observed alloc_hint = zeros which means,
sbitmap_get is not able to find free tags from hint. That may lead to
contention.
This condition is not happening with nr_hw_queue=1 (without RFC) driver.

alloc_hint=
{663, 2425, 3060, 54, 3149, 4319, 4175, 4867, 543, 2481, 0, 4779, 377,
***0***, 2010, 0, 909, 3350, 1546, 2179, 2875, 659, 3902, 2224, 3212, 836,
1892, 1669, 2420,
3415, 1904, 512, 3027, 4810, 2845, 4690, 712, 3105, 0, 0, 0, 3268, 4915,
3897, 1349, 547, 4, 733, 1765, 2068, 979, 51, 880, 0, 370, 3520, 2877, 4097,
418, 4501, 3717,
2893, 604, 508, 759, 3329, 4038, 4829, 715, 842, 1443, 556}

Driver with RFC which has nr_hw_queue = N and I issue IOs from <fio>  queue
depth = 32. We get 3.1M IOPS in this config. This workload does *not*
exhaust host can_queue.

-    5.07%     0.14%  fio              [kernel.vmlinux]  [k]
generic_make_request
   - 4.93% generic_make_request
      - 3.61% blk_mq_make_request
         - 2.04% blk_mq_get_request
            - 1.08% blk_mq_get_tag
               - 0.70% __sbitmap_queue_get
                    0.67% sbitmap_get

In summary, RFC has some performance bottleneck in sbitmap_get () if
outstanding per shost is about to exhaust.  Without this RFC also driver
works in nr_hw_queue = 1, but that case is managed very well.
I am not sure why it happens only with shared host tag ? Theoretically all
the hctx is sharing the same bitmaptag which is same as nr_hw_queue=1, so
why contention is only visible in shared host tag case.

If you want to reproduce this issue, may be you have to reduce the can_queue
in hisi_sas driver.

Kashyap

>
> Thanks,
> John

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC v6 08/10] megaraid_sas: switch fusion adapters to MQ
  2020-04-24 16:31                     ` Kashyap Desai
@ 2020-04-27 17:06                       ` John Garry
  2020-04-27 18:58                         ` Kashyap Desai
  0 siblings, 1 reply; 35+ messages in thread
From: John Garry @ 2020-04-27 17:06 UTC (permalink / raw)
  To: Kashyap Desai, axboe, jejb, martin.petersen, ming.lei,
	bvanassche, hare, don.brace, Sumit Saxena, hch,
	Shivasharan Srikanteshwara
  Cc: chenxiang (M), linux-block, linux-scsi, esc.storagedev, Hannes Reinecke

Hi Kashyap,

>> hmmmm...
> 
> I did some more experiments. It looks like issue is with both <none> and
> <mq-deadline> scheduler.  Let me simplify what happens with ioscheduler =
> <none>.

I know it's good to compare like-for-like, but, as I understand, "none" 
is more suited for MQ host, while deadline is more suited for SQ host.

> 
> Old Driver which has nr_hw_queue = 1 and I issue IOs from <fio>  queue depth
> = 128. We get 3.1M IOPS in this config. This eventually exhaust host
> can_queue.

So I think I need to find a point where we start to get throttled.

> Note - Very low contention in sbitmap_get()
> 
> -   23.58%     0.25%  fio              [kernel.vmlinux]            [k]
> blk_mq_make_request
>     - 23.33% blk_mq_make_request
>        - 21.68% blk_mq_get_request
>           - 20.19% blk_mq_get_tag
>              + 10.08% prepare_to_wait_exclusive
>              + 4.51% io_schedule
>              - 3.59% __sbitmap_queue_get
>                 - 2.82% sbitmap_get
>                      0.86% __sbitmap_get_word
>                      0.75% _raw_spin_lock_irqsave
>                      0.55% _raw_spin_unlock_irqrestore
> 
> Driver with RFC which has nr_hw_queue = N and I issue IOs from <fio>  queue
> depth = 128. We get 2.3 M IOPS in this config. This eventually exhaust host
> can_queue.
> Note - Very high contention in sbitmap_get()
> 
> -   42.39%     0.12%  fio              [kernel.vmlinux]            [k]
> generic_make_request
>     - 42.27% generic_make_request
>        - 41.00% blk_mq_make_request
>           - 38.28% blk_mq_get_request
>              - 33.76% blk_mq_get_tag
>                 - 30.25% __sbitmap_queue_get
>                    - 29.90% sbitmap_get
>                       + 9.06% _raw_spin_lock_irqsave
>                       + 7.94% _raw_spin_unlock_irqrestore
>                       + 3.86% __sbitmap_get_word
>                       + 1.78% call_function_single_interrupt
>                       + 0.67% ret_from_intr
>                 + 1.69% io_schedule
>                   0.59% prepare_to_wait_exclusive
>                   0.55% __blk_mq_get_tag
> 
> In this particular case, I observed alloc_hint = zeros which means,
> sbitmap_get is not able to find free tags from hint. That may lead to
> contention.
> This condition is not happening with nr_hw_queue=1 (without RFC) driver.
> 
> alloc_hint=
> {663, 2425, 3060, 54, 3149, 4319, 4175, 4867, 543, 2481, 0, 4779, 377,
> ***0***, 2010, 0, 909, 3350, 1546, 2179, 2875, 659, 3902, 2224, 3212, 836,
> 1892, 1669, 2420,
> 3415, 1904, 512, 3027, 4810, 2845, 4690, 712, 3105, 0, 0, 0, 3268, 4915,
> 3897, 1349, 547, 4, 733, 1765, 2068, 979, 51, 880, 0, 370, 3520, 2877, 4097,
> 418, 4501, 3717,
> 2893, 604, 508, 759, 3329, 4038, 4829, 715, 842, 1443, 556}
> 
> Driver with RFC which has nr_hw_queue = N and I issue IOs from <fio>  queue
> depth = 32. We get 3.1M IOPS in this config. This workload does *not*
> exhaust host can_queue.

Please ensure .host_tagset is set for whenever nr_hw_queue = N. This is 
as per RFC, and I don't think you modified from the RFC for your test. 
But I just wanted to mention that to be crystal clear.

> 
> -    5.07%     0.14%  fio              [kernel.vmlinux]  [k]
> generic_make_request
>     - 4.93% generic_make_request
>        - 3.61% blk_mq_make_request
>           - 2.04% blk_mq_get_request
>              - 1.08% blk_mq_get_tag
>                 - 0.70% __sbitmap_queue_get
>                      0.67% sbitmap_get
> 
> In summary, RFC has some performance bottleneck in sbitmap_get () if
> outstanding per shost is about to exhaust.  Without this RFC also driver
> works in nr_hw_queue = 1, but that case is managed very well.
> I am not sure why it happens only with shared host tag ? Theoretically all
> the hctx is sharing the same bitmaptag which is same as nr_hw_queue=1, so
> why contention is only visible in shared host tag case.

Let me check this.

> 
> If you want to reproduce this issue, may be you have to reduce the can_queue
> in hisi_sas driver.
> 

Thanks,
John


^ permalink raw reply	[flat|nested] 35+ messages in thread

* RE: [PATCH RFC v6 08/10] megaraid_sas: switch fusion adapters to MQ
  2020-04-27 17:06                       ` John Garry
@ 2020-04-27 18:58                         ` Kashyap Desai
  2020-04-28 15:55                           ` John Garry
  0 siblings, 1 reply; 35+ messages in thread
From: Kashyap Desai @ 2020-04-27 18:58 UTC (permalink / raw)
  To: John Garry, axboe, jejb, martin.petersen, ming.lei, bvanassche,
	hare, don.brace, Sumit Saxena, hch, Shivasharan Srikanteshwara
  Cc: chenxiang (M), linux-block, linux-scsi, esc.storagedev, Hannes Reinecke

> Hi Kashyap,
>
> >> hmmmm...
> >
> > I did some more experiments. It looks like issue is with both <none>
> > and <mq-deadline> scheduler.  Let me simplify what happens with
> > ioscheduler = <none>.
>
> I know it's good to compare like-for-like, but, as I understand, "none"
> is more suited for MQ host, while deadline is more suited for SQ host.

I am using <mq-deadline> which is MQ version of SQ deadline.
For now we can just consider case without scheduler.

I have some more findings. May be someone from upstream community can
connect the dots.

#1. hctx_may_queue() throttle the IO at hctx level. This is eventually per
sdev throttling for megaraid_sas driver because It creates only one
context - hctx0 for each scsi device.

If driver is using only one h/w queue,  active_queues will be always steady.
In my test it was 64 thread, so active_queues=64.
Even though <fio> thread is shared among allowed cpumask
(cpus_allowed_policy=shared option in fio),  active_queues will be always 64
because we have only one h/w queue.
All the logical cpus are mapped to one h/w queue. It means, thread moving
from one cpu to another cpu will not change active_queues per hctx.

In case of this RFC, active_queues are now per hctx and there are multiple
hctx, but tags are shared. This can create unwanted throttling and
eventually more lock contention in sbitmap.
I added below patch and things improved a bit, but not a full proof.

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 586c9d6..c708fbc 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -60,10 +60,12 @@ void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
  * For shared tag users, we track the number of currently active users
  * and attempt to provide a fair share of the tag depth for each of them.
  */
-static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx,
+static inline bool hctx_may_queue(struct request_queue *q,
+                                 struct blk_mq_hw_ctx *hctx,
                                  struct sbitmap_queue *bt)
 {
-       unsigned int depth, users;
+       unsigned int depth, users, i, outstanding = 0;
+       struct blk_mq_hw_ctx *hctx_iter;

        if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_SHARED))
                return true;
@@ -84,14 +86,18 @@ static inline bool hctx_may_queue(struct blk_mq_hw_ctx
*hctx,
         * Allow at least some tags
         */
        depth = max((bt->sb.depth + users - 1) / users, 4U);
-       return atomic_read(&hctx->nr_active) < depth;
+
+       queue_for_each_hw_ctx(q, hctx_iter, i)
+               outstanding += atomic_read(&hctx_iter->nr_active);
+
+       return outstanding < depth;
 }

 static int __blk_mq_get_tag(struct blk_mq_alloc_data *data,
                            struct sbitmap_queue *bt)
 {
        if (!(data->flags & BLK_MQ_REQ_INTERNAL) &&
-           !hctx_may_queue(data->hctx, bt))


#2 - In completion path, scsi module call blk_mq_run_hw_queues() upon IO
completion.  I am not sure if it is good to run all the h/w queue or just
h/w queue of current reference is good enough.
Below patch helped to reduce contention in hcxt_lock().

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 610ee41..f72de2a 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -572,6 +572,7 @@ static bool scsi_end_request(struct request *req,
blk_status_t error,
        struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(req);
        struct scsi_device *sdev = cmd->device;
        struct request_queue *q = sdev->request_queue;
+       struct blk_mq_hw_ctx *mq_hctx = req->mq_hctx;

        if (blk_update_request(req, error, bytes))
                return true;
@@ -613,7 +614,7 @@ static bool scsi_end_request(struct request *req,
blk_status_t error,
            !list_empty(&sdev->host->starved_list))
                kblockd_schedule_work(&sdev->requeue_work);
        else
-               blk_mq_run_hw_queues(q, true);
+               blk_mq_run_hw_queue(mq_hctx, true);

        percpu_ref_put(&q->q_usage_counter);
        return false;

#3 -  __blk_mq_tag_idle() calls blk_mq_tag_wakeup_all which may not be
optimal for shared queue.
There is a penalty if we are calling __sbq_wake_up() frequently. In case of
nr_hw_queue = 1, things are better because one hctx and hctx->state will
avoid multiple calls.
If blk_mq_tag_wakeup_all is called from hctx0 context, there is no need to
call from hctx1, hctx2 etc.

I have added below patch in my testing.

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 586c9d6..5b331e5 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -53,7 +53,9 @@ void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)

        atomic_dec(&tags->active_queues);

-       blk_mq_tag_wakeup_all(tags, false);
+       /* TBD - Do this only for hctx->flags & BLK_MQ_F_TAG_HCTX_SHARED */
+       if (hctx->queue_num == 0)
+               blk_mq_tag_wakeup_all(tags, false);
 }

 /*


With all above mentioned changes, I see performance improved from 2.2M IOPS
to 2.7M on same workload and profile.


Kashyap

>
> >
> > Old Driver which has nr_hw_queue = 1 and I issue IOs from <fio>  queue
> > depth = 128. We get 3.1M IOPS in this config. This eventually exhaust
> > host can_queue.
>
> So I think I need to find a point where we start to get throttled.
>
> > Note - Very low contention in sbitmap_get()
> >
> > -   23.58%     0.25%  fio              [kernel.vmlinux]            [k]
> > blk_mq_make_request
> >     - 23.33% blk_mq_make_request
> >        - 21.68% blk_mq_get_request
> >           - 20.19% blk_mq_get_tag
> >              + 10.08% prepare_to_wait_exclusive
> >              + 4.51% io_schedule
> >              - 3.59% __sbitmap_queue_get
> >                 - 2.82% sbitmap_get
> >                      0.86% __sbitmap_get_word
> >                      0.75% _raw_spin_lock_irqsave
> >                      0.55% _raw_spin_unlock_irqrestore
> >
> > Driver with RFC which has nr_hw_queue = N and I issue IOs from <fio>
> > queue depth = 128. We get 2.3 M IOPS in this config. This eventually
> > exhaust host can_queue.
> > Note - Very high contention in sbitmap_get()
> >
> > -   42.39%     0.12%  fio              [kernel.vmlinux]            [k]
> > generic_make_request
> >     - 42.27% generic_make_request
> >        - 41.00% blk_mq_make_request
> >           - 38.28% blk_mq_get_request
> >              - 33.76% blk_mq_get_tag
> >                 - 30.25% __sbitmap_queue_get
> >                    - 29.90% sbitmap_get
> >                       + 9.06% _raw_spin_lock_irqsave
> >                       + 7.94% _raw_spin_unlock_irqrestore
> >                       + 3.86% __sbitmap_get_word
> >                       + 1.78% call_function_single_interrupt
> >                       + 0.67% ret_from_intr
> >                 + 1.69% io_schedule
> >                   0.59% prepare_to_wait_exclusive
> >                   0.55% __blk_mq_get_tag
> >
> > In this particular case, I observed alloc_hint = zeros which means,
> > sbitmap_get is not able to find free tags from hint. That may lead to
> > contention.
> > This condition is not happening with nr_hw_queue=1 (without RFC) driver.
> >
> > alloc_hint=
> > {663, 2425, 3060, 54, 3149, 4319, 4175, 4867, 543, 2481, 0, 4779, 377,
> > ***0***, 2010, 0, 909, 3350, 1546, 2179, 2875, 659, 3902, 2224, 3212,
> > 836, 1892, 1669, 2420, 3415, 1904, 512, 3027, 4810, 2845, 4690, 712,
> > 3105, 0, 0, 0, 3268, 4915, 3897, 1349, 547, 4, 733, 1765, 2068, 979,
> > 51, 880, 0, 370, 3520, 2877, 4097, 418, 4501, 3717, 2893, 604, 508,
> > 759, 3329, 4038, 4829, 715, 842, 1443, 556}
> >
> > Driver with RFC which has nr_hw_queue = N and I issue IOs from <fio>
> > queue depth = 32. We get 3.1M IOPS in this config. This workload does
> > *not* exhaust host can_queue.
>
> Please ensure .host_tagset is set for whenever nr_hw_queue = N. This is as
> per
> RFC, and I don't think you modified from the RFC for your test.
> But I just wanted to mention that to be crystal clear.

Yes I have two separate driver copy. One with RFC change and another without
RFC.
>
> >
> > -    5.07%     0.14%  fio              [kernel.vmlinux]  [k]
> > generic_make_request
> >     - 4.93% generic_make_request
> >        - 3.61% blk_mq_make_request
> >           - 2.04% blk_mq_get_request
> >              - 1.08% blk_mq_get_tag
> >                 - 0.70% __sbitmap_queue_get
> >                      0.67% sbitmap_get
> >
> > In summary, RFC has some performance bottleneck in sbitmap_get () if
> > outstanding per shost is about to exhaust.  Without this RFC also
> > driver works in nr_hw_queue = 1, but that case is managed very well.
> > I am not sure why it happens only with shared host tag ? Theoretically
> > all the hctx is sharing the same bitmaptag which is same as
> > nr_hw_queue=1, so why contention is only visible in shared host tag
> > case.
>
> Let me check this.
>
> >
> > If you want to reproduce this issue, may be you have to reduce the
> > can_queue in hisi_sas driver.
> >
>
> Thanks,
> John

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC v6 08/10] megaraid_sas: switch fusion adapters to MQ
  2020-04-27 18:58                         ` Kashyap Desai
@ 2020-04-28 15:55                           ` John Garry
  2020-04-29 11:29                             ` John Garry
  0 siblings, 1 reply; 35+ messages in thread
From: John Garry @ 2020-04-28 15:55 UTC (permalink / raw)
  To: Kashyap Desai, axboe, jejb, martin.petersen, ming.lei,
	bvanassche, hare, don.brace, Sumit Saxena, hch,
	Shivasharan Srikanteshwara
  Cc: chenxiang (M), linux-block, linux-scsi, esc.storagedev, Hannes Reinecke

Hi Kashyap,

> I am using <mq-deadline> which is MQ version of SQ deadline.
> For now we can just consider case without scheduler.
> 
> I have some more findings. May be someone from upstream community can
> connect the dots.
> 
> #1. hctx_may_queue() throttle the IO at hctx level. This is eventually per
> sdev throttling for megaraid_sas driver because It creates only one
> context - hctx0 for each scsi device.
> 
> If driver is using only one h/w queue,  active_queues will be always steady.
> In my test it was 64 thread, so active_queues=64.

So I figure that 64 threads comes from 64 having disks.

> Even though <fio> thread is shared among allowed cpumask
> (cpus_allowed_policy=shared option in fio),  active_queues will be always 64
> because we have only one h/w queue.
> All the logical cpus are mapped to one h/w queue. It means, thread moving
> from one cpu to another cpu will not change active_queues per hctx.
> 
> In case of this RFC, active_queues are now per hctx and there are multiple
> hctx, but tags are shared. 

Right, so we need a policy to divide up the shared tags across request 
queues, based on principle of fairness.

This can create unwanted throttling and
> eventually more lock contention in sbitmap.
> I added below patch and things improved a bit, but not a full proof.
> 
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index 586c9d6..c708fbc 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -60,10 +60,12 @@ void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
>    * For shared tag users, we track the number of currently active users
>    * and attempt to provide a fair share of the tag depth for each of them.
>    */
> -static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx,
> +static inline bool hctx_may_queue(struct request_queue *q,
> +                                 struct blk_mq_hw_ctx *hctx,
>                                    struct sbitmap_queue *bt)
>   {
> -       unsigned int depth, users;
> +       unsigned int depth, users, i, outstanding = 0;
> +       struct blk_mq_hw_ctx *hctx_iter;
> 
>          if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_SHARED))
>                  return true;
> @@ -84,14 +86,18 @@ static inline bool hctx_may_queue(struct blk_mq_hw_ctx
> *hctx,
>           * Allow at least some tags
>           */
>          depth = max((bt->sb.depth + users - 1) / users, 4U);
> -       return atomic_read(&hctx->nr_active) < depth;
> +
> +       queue_for_each_hw_ctx(q, hctx_iter, i)
> +               outstanding += atomic_read(&hctx_iter->nr_active);
> +

OK,  I think that we need to find a cleaner way to do this.

> +       return outstanding < depth;
>   }
> 
>   static int __blk_mq_get_tag(struct blk_mq_alloc_data *data,
>                              struct sbitmap_queue *bt)
>   {
>          if (!(data->flags & BLK_MQ_REQ_INTERNAL) &&
> -           !hctx_may_queue(data->hctx, bt))
> 
> 
> #2 - In completion path, scsi module call blk_mq_run_hw_queues() upon IO
> completion.  I am not sure if it is good to run all the h/w queue or just
> h/w queue of current reference is good enough.
> Below patch helped to reduce contention in hcxt_lock().
> 
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index 610ee41..f72de2a 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -572,6 +572,7 @@ static bool scsi_end_request(struct request *req,
> blk_status_t error,
>          struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(req);
>          struct scsi_device *sdev = cmd->device;
>          struct request_queue *q = sdev->request_queue;
> +       struct blk_mq_hw_ctx *mq_hctx = req->mq_hctx;
> 
>          if (blk_update_request(req, error, bytes))
>                  return true;
> @@ -613,7 +614,7 @@ static bool scsi_end_request(struct request *req,
> blk_status_t error,
>              !list_empty(&sdev->host->starved_list))
>                  kblockd_schedule_work(&sdev->requeue_work);
>          else
> -               blk_mq_run_hw_queues(q, true);
> +               blk_mq_run_hw_queue(mq_hctx, true);

Not sure on this. But, indeed, I found running all queues did add lots 
of extra load for when enabling the deadline scheduler.

> 
>          percpu_ref_put(&q->q_usage_counter);
>          return false;
> 
> #3 -  __blk_mq_tag_idle() calls blk_mq_tag_wakeup_all which may not be
> optimal for shared queue.
> There is a penalty if we are calling __sbq_wake_up() frequently. In case of
> nr_hw_queue = 1, things are better because one hctx and hctx->state will
> avoid multiple calls.
> If blk_mq_tag_wakeup_all is called from hctx0 context, there is no need to
> call from hctx1, hctx2 etc.
> 
> I have added below patch in my testing.
> 
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index 586c9d6..5b331e5 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -53,7 +53,9 @@ void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
> 
>          atomic_dec(&tags->active_queues);
> 
> -       blk_mq_tag_wakeup_all(tags, false);
> +       /* TBD - Do this only for hctx->flags & BLK_MQ_F_TAG_HCTX_SHARED */
> +       if (hctx->queue_num == 0)
> +               blk_mq_tag_wakeup_all(tags, false);

ok, I see. But, again, I think we need to find a cleaner way to do this.

>   }
> 
>   /*
> 
> 
> With all above mentioned changes, I see performance improved from 2.2M IOPS
> to 2.7M on same workload and profile.

But still short of nr_hw_queue = 1, which got 3.1M IOPS, as below, right?

Thanks,
John

> 
>>
>>>
>>> Old Driver which has nr_hw_queue = 1 and I issue IOs from <fio>  queue
>>> depth = 128. We get 3.1M IOPS in this config. This eventually exhaust
>>> host can_queue.
>>
>> So I think I need to find a point where we start to get throttled.
>>
>>> Note - Very low contention in sbitmap_get()
>>>
>>> -   23.58%     0.25%  fio              [kernel.vmlinux]            [k]
>>> blk_mq_make_request
>>>      - 23.33% blk_mq_make_request
>>>         - 21.68% blk_mq_get_request
>>>            - 20.19% blk_mq_get_tag
>>>               + 10.08% prepare_to_wait_exclusive
>>>               + 4.51% io_schedule
>>>               - 3.59% __sbitmap_queue_get
>>>                  - 2.82% sbitmap_get
>>>                       0.86% __sbitmap_get_word
>>>                       0.75% _raw_spin_lock_irqsave
>>>                       0.55% _raw_spin_unlock_irqrestore
>>>
>>> Driver with RFC which has nr_hw_queue = N and I issue IOs from <fio>
>>> queue depth = 128. We get 2.3 M IOPS in this config. This eventually
>>> exhaust host can_queue.
>>> Note - Very high contention in sbitmap_get()
>>>
>>> -   42.39%     0.12%  fio              [kernel.vmlinux]            [k]
>>> generic_make_request
>>>      - 42.27% generic_make_request
>>>         - 41.00% blk_mq_make_request
>>>            - 38.28% blk_mq_get_request
>>>               - 33.76% blk_mq_get_tag
>>>                  - 30.25% __sbitmap_queue_get
>>>                     - 29.90% sbitmap_get
>>>                        + 9.06% _raw_spin_lock_irqsave
>>>                        + 7.94% _raw_spin_unlock_irqrestore
>>>                        + 3.86% __sbitmap_get_word
>>>                        + 1.78% call_function_single_interrupt
>>>                        + 0.67% ret_from_intr
>>>                  + 1.69% io_schedule
>>>                    0.59% prepare_to_wait_exclusive
>>>                    0.55% __blk_mq_get_tag
>>>
>>> In this particular case, I observed alloc_hint = zeros which means,
>>> sbitmap_get is not able to find free tags from hint. That may lead to
>>> contention.
>>> This condition is not happening with nr_hw_queue=1 (without RFC) driver.
>>>
>>> alloc_hint=
>>> {663, 2425, 3060, 54, 3149, 4319, 4175, 4867, 543, 2481, 0, 4779, 377,
>>> ***0***, 2010, 0, 909, 3350, 1546, 2179, 2875, 659, 3902, 2224, 3212,
>>> 836, 1892, 1669, 2420, 3415, 1904, 512, 3027, 4810, 2845, 4690, 712,
>>> 3105, 0, 0, 0, 3268, 4915, 3897, 1349, 547, 4, 733, 1765, 2068, 979,
>>> 51, 880, 0, 370, 3520, 2877, 4097, 418, 4501, 3717, 2893, 604, 508,
>>> 759, 3329, 4038, 4829, 715, 842, 1443, 556}
>>>
>>> Driver with RFC which has nr_hw_queue = N and I issue IOs from <fio>
>>> queue depth = 32. We get 3.1M IOPS in this config. This workload does
>>> *not* exhaust host can_queue.
>>
>> Please ensure .host_tagset is set for whenever nr_hw_queue = N. This is as
>> per
>> RFC, and I don't think you modified from the RFC for your test.
>> But I just wanted to mention that to be crystal clear.
> 
> Yes I have two separate driver copy. One with RFC change and another without
> RFC.
>>
>>>
>>> -    5.07%     0.14%  fio              [kernel.vmlinux]  [k]
>>> generic_make_request
>>>      - 4.93% generic_make_request
>>>         - 3.61% blk_mq_make_request
>>>            - 2.04% blk_mq_get_request
>>>               - 1.08% blk_mq_get_tag
>>>                  - 0.70% __sbitmap_queue_get
>>>                       0.67% sbitmap_get
>>>
>>> In summary, RFC has some performance bottleneck in sbitmap_get () if
>>> outstanding per shost is about to exhaust.  Without this RFC also
>>> driver works in nr_hw_queue = 1, but that case is managed very well.
>>> I am not sure why it happens only with shared host tag ? Theoretically
>>> all the hctx is sharing the same bitmaptag which is same as
>>> nr_hw_queue=1, so why contention is only visible in shared host tag
>>> case.
>>
>> Let me check this.
>>
>>>
>>> If you want to reproduce this issue, may be you have to reduce the
>>> can_queue in hisi_sas driver.
>>>
>>
>> Thanks,
>> John


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC v6 08/10] megaraid_sas: switch fusion adapters to MQ
  2020-04-28 15:55                           ` John Garry
@ 2020-04-29 11:29                             ` John Garry
  2020-04-29 15:50                               ` Kashyap Desai
  0 siblings, 1 reply; 35+ messages in thread
From: John Garry @ 2020-04-29 11:29 UTC (permalink / raw)
  To: Kashyap Desai
  Cc: axboe, jejb, martin.petersen, ming.lei, bvanassche, hare,
	don.brace, Sumit Saxena, hch, Shivasharan Srikanteshwara,
	chenxiang (M),
	linux-block, linux-scsi, esc.storagedev, Hannes Reinecke

On 28/04/2020 16:55, John Garry wrote:
> Hi Kashyap,
> 
>> I am using <mq-deadline> which is MQ version of SQ deadline.
>> For now we can just consider case without scheduler.
>>
>> I have some more findings. May be someone from upstream community can
>> connect the dots.
>>
>> #1. hctx_may_queue() throttle the IO at hctx level. This is eventually 
>> per
>> sdev throttling for megaraid_sas driver because It creates only one
>> context - hctx0 for each scsi device.
>>
>> If driver is using only one h/w queue,  active_queues will be always 
>> steady.
>> In my test it was 64 thread, so active_queues=64.
> 
> So I figure that 64 threads comes from 64 having disks.
> 
>> Even though <fio> thread is shared among allowed cpumask
>> (cpus_allowed_policy=shared option in fio),  active_queues will be 
>> always 64
>> because we have only one h/w queue.
>> All the logical cpus are mapped to one h/w queue. It means, thread moving
>> from one cpu to another cpu will not change active_queues per hctx.
>>
>> In case of this RFC, active_queues are now per hctx and there are 
>> multiple
>> hctx, but tags are shared. 
> 
> Right, so we need a policy to divide up the shared tags across request 
> queues, based on principle of fairness.
> 

Hi Kashyap,

How about this, which counts requests per request_queue for shared 
sbitmap instead of per hctx (and, as such, does not need to aggreate 
them in hctx_may_queue()):

---->8

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index ba68d934e3ca..0c8adecba219 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -60,9 +60,11 @@ void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
   * For shared tag users, we track the number of currently active users
   * and attempt to provide a fair share of the tag depth for each of them.
   */
-static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx,
+static inline bool hctx_may_queue(struct blk_mq_alloc_data *data,
  				  struct sbitmap_queue *bt)
  {
+	struct blk_mq_hw_ctx *hctx = data->hctx;
+	struct request_queue *q = data->q;
  	unsigned int depth, users;

  	if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED))
@@ -84,14 +86,14 @@ static inline bool hctx_may_queue(struct 
blk_mq_hw_ctx *hctx,
  	 * Allow at least some tags
  	 */
  	depth = max((bt->sb.depth + users - 1) / users, 4U);
-	return atomic_read(&hctx->nr_active) < depth;
+	return __blk_mq_active_requests(hctx, q) < depth;
  }

  static int __blk_mq_get_tag(struct blk_mq_alloc_data *data,
  			    struct sbitmap_queue *bt)
  {
  	if (!(data->flags & BLK_MQ_REQ_INTERNAL) &&
-	    !hctx_may_queue(data->hctx, bt))
+	    !hctx_may_queue(data, bt))
  		return -1;
  	if (data->shallow_depth)
  		return __sbitmap_queue_get_shallow(bt, data->shallow_depth);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 7e446f946e73..5bbd95e01f08 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -282,7 +282,7 @@ static struct request *blk_mq_rq_ctx_init(struct 
blk_mq_alloc_data *data,
  	} else {
  		if (data->hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED) {
  			rq_flags = RQF_MQ_INFLIGHT;
-			atomic_inc(&data->hctx->nr_active);
+			__blk_mq_inc_active_requests(data->hctx, data->q);
  		}
  		rq->tag = tag;
  		rq->internal_tag = -1;
@@ -502,7 +502,7 @@ void blk_mq_free_request(struct request *rq)

  	ctx->rq_completed[rq_is_sync(rq)]++;
  	if (rq->rq_flags & RQF_MQ_INFLIGHT)
-		atomic_dec(&hctx->nr_active);
+		__blk_mq_dec_active_requests(hctx, q);

  	if (unlikely(laptop_mode && !blk_rq_is_passthrough(rq)))
  		laptop_io_completion(q->backing_dev_info);
@@ -1048,7 +1048,7 @@ bool blk_mq_get_driver_tag(struct request *rq)
  	if (rq->tag >= 0) {
  		if (shared) {
  			rq->rq_flags |= RQF_MQ_INFLIGHT;
-			atomic_inc(&data.hctx->nr_active);
+			__blk_mq_inc_active_requests(rq->mq_hctx, rq->q);
  		}
  		data.hctx->tags->rqs[rq->tag] = rq;
  	}
diff --git a/block/blk-mq.h b/block/blk-mq.h
index dde2d29f0ce5..5ab1566c4b7d 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -202,6 +202,29 @@ static inline bool 
blk_mq_get_dispatch_budget(struct blk_mq_hw_ctx *hctx)
  	return true;
  }

+static inline  void __blk_mq_inc_active_requests(struct blk_mq_hw_ctx 
*hctx, struct request_queue *q)
+{
+	if (blk_mq_is_sbitmap_shared(q->tag_set))
+		atomic_inc(&q->nr_active_requests);
+	else
+		atomic_inc(&hctx->nr_active);
+}
+
+static inline void __blk_mq_dec_active_requests(struct blk_mq_hw_ctx 
*hctx, struct request_queue *q)
+{
+	if (blk_mq_is_sbitmap_shared(q->tag_set))
+		atomic_dec(&q->nr_active_requests);
+	else
+		atomic_dec(&hctx->nr_active);
+}
+
+static inline int __blk_mq_active_requests(struct blk_mq_hw_ctx *hctx, 
struct request_queue *q)
+{
+	if (blk_mq_is_sbitmap_shared(q->tag_set))
+		return atomic_read(&q->nr_active_requests);
+	return atomic_read(&hctx->nr_active);
+}
+
  static inline void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx,
  					   struct request *rq)
  {
@@ -210,7 +233,7 @@ static inline void __blk_mq_put_driver_tag(struct 
blk_mq_hw_ctx *hctx,

  	if (rq->rq_flags & RQF_MQ_INFLIGHT) {
  		rq->rq_flags &= ~RQF_MQ_INFLIGHT;
-		atomic_dec(&hctx->nr_active);
+		__blk_mq_dec_active_requests(hctx, rq->q);
  	}
  }

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 32868fbedc9e..5f2955872234 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -579,6 +579,8 @@ struct request_queue {

  	size_t			cmd_size;

+	atomic_t nr_active_requests;
+
  	struct work_struct	release_work;

  #define BLK_MAX_WRITE_HINTS	5

> This can create unwanted throttling and
>> eventually more lock contention in sbitmap.
>> I added below patch and things improved a bit, but not a full proof.
>>
>> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
>> index 586c9d6..c708fbc 100644
>> --- a/block/blk-mq-tag.c
>> +++ b/block/blk-mq-tag.c
>> @@ -60,10 +60,12 @@ void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
>>    * For shared tag users, we track the number of currently active users
>>    * and attempt to provide a fair share of the tag depth for each of 
>> them.
>>    */
>> -static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx,
>> +static inline bool hctx_may_queue(struct request_queue *q,
>> +                                 struct blk_mq_hw_ctx *hctx,
>>                                    struct sbitmap_queue *bt)
>>   {
>> -       unsigned int depth, users;
>> +       unsigned int depth, users, i, outstanding = 0;
>> +       struct blk_mq_hw_ctx *hctx_iter;
>>
>>          if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_SHARED))
>>                  return true;
>> @@ -84,14 +86,18 @@ static inline bool hctx_may_queue(struct 
>> blk_mq_hw_ctx
>> *hctx,
>>           * Allow at least some tags
>>           */
>>          depth = max((bt->sb.depth + users - 1) / users, 4U);
>> -       return atomic_read(&hctx->nr_active) < depth;
>> +
>> +       queue_for_each_hw_ctx(q, hctx_iter, i)
>> +               outstanding += atomic_read(&hctx_iter->nr_active);
>> +
> 
> OK,  I think that we need to find a cleaner way to do this.
> 
>> +       return outstanding < depth;
>>   }
>>
>>   static int __blk_mq_get_tag(struct blk_mq_alloc_data *data,
>>                              struct sbitmap_queue *bt)
>>   {
>>          if (!(data->flags & BLK_MQ_REQ_INTERNAL) &&
>> -           !hctx_may_queue(data->hctx, bt))
>>
>>
>> #2 - In completion path, scsi module call blk_mq_run_hw_queues() upon IO
>> completion.  I am not sure if it is good to run all the h/w queue or just
>> h/w queue of current reference is good enough.
>> Below patch helped to reduce contention in hcxt_lock().
>>
>> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
>> index 610ee41..f72de2a 100644
>> --- a/drivers/scsi/scsi_lib.c
>> +++ b/drivers/scsi/scsi_lib.c
>> @@ -572,6 +572,7 @@ static bool scsi_end_request(struct request *req,
>> blk_status_t error,
>>          struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(req);
>>          struct scsi_device *sdev = cmd->device;
>>          struct request_queue *q = sdev->request_queue;
>> +       struct blk_mq_hw_ctx *mq_hctx = req->mq_hctx;
>>
>>          if (blk_update_request(req, error, bytes))
>>                  return true;
>> @@ -613,7 +614,7 @@ static bool scsi_end_request(struct request *req,
>> blk_status_t error,
>>              !list_empty(&sdev->host->starved_list))
>>                  kblockd_schedule_work(&sdev->requeue_work);
>>          else
>> -               blk_mq_run_hw_queues(q, true);
>> +               blk_mq_run_hw_queue(mq_hctx, true);
> 
> Not sure on this. But, indeed, I found running all queues did add lots 
> of extra load for when enabling the deadline scheduler.
> 
>>
>>          percpu_ref_put(&q->q_usage_counter);
>>          return false;
>>
>> #3 -  __blk_mq_tag_idle() calls blk_mq_tag_wakeup_all which may not be
>> optimal for shared queue.
>> There is a penalty if we are calling __sbq_wake_up() frequently. In 
>> case of
>> nr_hw_queue = 1, things are better because one hctx and hctx->state will
>> avoid multiple calls.
>> If blk_mq_tag_wakeup_all is called from hctx0 context, there is no 
>> need to
>> call from hctx1, hctx2 etc.
>>
>> I have added below patch in my testing.
>>
>> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
>> index 586c9d6..5b331e5 100644
>> --- a/block/blk-mq-tag.c
>> +++ b/block/blk-mq-tag.c
>> @@ -53,7 +53,9 @@ void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
>>
>>          atomic_dec(&tags->active_queues);
>>
>> -       blk_mq_tag_wakeup_all(tags, false);
>> +       /* TBD - Do this only for hctx->flags & 
>> BLK_MQ_F_TAG_HCTX_SHARED */
>> +       if (hctx->queue_num == 0)
>> +               blk_mq_tag_wakeup_all(tags, false);
> 
> ok, I see. But, again, I think we need to find a cleaner way to do this.
> 
>>   }
>>
>>   /*
>>
>>
>> With all above mentioned changes, I see performance improved from 2.2M 
>> IOPS
>> to 2.7M on same workload and profile.
> 
> But still short of nr_hw_queue = 1, which got 3.1M IOPS, as below, right?
> 
> Thanks,
> John
> 
>>
>>>
>>>>
>>>> Old Driver which has nr_hw_queue = 1 and I issue IOs from <fio>  queue
>>>> depth = 128. We get 3.1M IOPS in this config. This eventually exhaust
>>>> host can_queue.
>>>
>>> So I think I need to find a point where we start to get throttled.
>>>
>>>> Note - Very low contention in sbitmap_get()
>>>>
>>>> -   23.58%     0.25%  fio              [kernel.vmlinux]            [k]
>>>> blk_mq_make_request
>>>>      - 23.33% blk_mq_make_request
>>>>         - 21.68% blk_mq_get_request
>>>>            - 20.19% blk_mq_get_tag
>>>>               + 10.08% prepare_to_wait_exclusive
>>>>               + 4.51% io_schedule
>>>>               - 3.59% __sbitmap_queue_get
>>>>                  - 2.82% sbitmap_get
>>>>                       0.86% __sbitmap_get_word
>>>>                       0.75% _raw_spin_lock_irqsave
>>>>                       0.55% _raw_spin_unlock_irqrestore
>>>>
>>>> Driver with RFC which has nr_hw_queue = N and I issue IOs from <fio>
>>>> queue depth = 128. We get 2.3 M IOPS in this config. This eventually
>>>> exhaust host can_queue.
>>>> Note - Very high contention in sbitmap_get()
>>>>
>>>> -   42.39%     0.12%  fio              [kernel.vmlinux]            [k]
>>>> generic_make_request
>>>>      - 42.27% generic_make_request
>>>>         - 41.00% blk_mq_make_request
>>>>            - 38.28% blk_mq_get_request
>>>>               - 33.76% blk_mq_get_tag
>>>>                  - 30.25% __sbitmap_queue_get
>>>>                     - 29.90% sbitmap_get
>>>>                        + 9.06% _raw_spin_lock_irqsave
>>>>                        + 7.94% _raw_spin_unlock_irqrestore
>>>>                        + 3.86% __sbitmap_get_word
>>>>                        + 1.78% call_function_single_interrupt
>>>>                        + 0.67% ret_from_intr
>>>>                  + 1.69% io_schedule
>>>>                    0.59% prepare_to_wait_exclusive
>>>>                    0.55% __blk_mq_get_tag
>>>>
>>>> In this particular case, I observed alloc_hint = zeros which means,
>>>> sbitmap_get is not able to find free tags from hint. That may lead to
>>>> contention.
>>>> This condition is not happening with nr_hw_queue=1 (without RFC) 
>>>> driver.
>>>>
>>>> alloc_hint=
>>>> {663, 2425, 3060, 54, 3149, 4319, 4175, 4867, 543, 2481, 0, 4779, 377,
>>>> ***0***, 2010, 0, 909, 3350, 1546, 2179, 2875, 659, 3902, 2224, 3212,
>>>> 836, 1892, 1669, 2420, 3415, 1904, 512, 3027, 4810, 2845, 4690, 712,
>>>> 3105, 0, 0, 0, 3268, 4915, 3897, 1349, 547, 4, 733, 1765, 2068, 979,
>>>> 51, 880, 0, 370, 3520, 2877, 4097, 418, 4501, 3717, 2893, 604, 508,
>>>> 759, 3329, 4038, 4829, 715, 842, 1443, 556}
>>>>
>>>> Driver with RFC which has nr_hw_queue = N and I issue IOs from <fio>
>>>> queue depth = 32. We get 3.1M IOPS in this config. This workload does
>>>> *not* exhaust host can_queue.
>>>
>>> Please ensure .host_tagset is set for whenever nr_hw_queue = N. This 
>>> is as
>>> per
>>> RFC, and I don't think you modified from the RFC for your test.
>>> But I just wanted to mention that to be crystal clear.
>>
>> Yes I have two separate driver copy. One with RFC change and another 
>> without
>> RFC.
>>>
>>>>
>>>> -    5.07%     0.14%  fio              [kernel.vmlinux]  [k]
>>>> generic_make_request
>>>>      - 4.93% generic_make_request
>>>>         - 3.61% blk_mq_make_request
>>>>            - 2.04% blk_mq_get_request
>>>>               - 1.08% blk_mq_get_tag
>>>>                  - 0.70% __sbitmap_queue_get
>>>>                       0.67% sbitmap_get
>>>>
>>>> In summary, RFC has some performance bottleneck in sbitmap_get () if
>>>> outstanding per shost is about to exhaust.  Without this RFC also
>>>> driver works in nr_hw_queue = 1, but that case is managed very well.
>>>> I am not sure why it happens only with shared host tag ? Theoretically
>>>> all the hctx is sharing the same bitmaptag which is same as
>>>> nr_hw_queue=1, so why contention is only visible in shared host tag
>>>> case.
>>>
>>> Let me check this.
>>>
>>>>
>>>> If you want to reproduce this issue, may be you have to reduce the
>>>> can_queue in hisi_sas driver.
>>>>
>>>
>>> Thanks,
>>> John
> 


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* RE: [PATCH RFC v6 08/10] megaraid_sas: switch fusion adapters to MQ
  2020-04-29 11:29                             ` John Garry
@ 2020-04-29 15:50                               ` Kashyap Desai
  2020-04-29 17:55                                 ` John Garry
  0 siblings, 1 reply; 35+ messages in thread
From: Kashyap Desai @ 2020-04-29 15:50 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, jejb, martin.petersen, ming.lei, bvanassche, hare,
	don.brace, Sumit Saxena, hch, Shivasharan Srikanteshwara,
	chenxiang (M),
	linux-block, linux-scsi, esc.storagedev, Hannes Reinecke

> -----Original Message-----
> From: John Garry [mailto:john.garry@huawei.com]
> Sent: Wednesday, April 29, 2020 5:00 PM
> To: Kashyap Desai <kashyap.desai@broadcom.com>
> Cc: axboe@kernel.dk; jejb@linux.ibm.com; martin.petersen@oracle.com;
> ming.lei@redhat.com; bvanassche@acm.org; hare@suse.de;
> don.brace@microsemi.com; Sumit Saxena <sumit.saxena@broadcom.com>;
> hch@infradead.org; Shivasharan Srikanteshwara
> <shivasharan.srikanteshwara@broadcom.com>; chenxiang (M)
> <chenxiang66@hisilicon.com>; linux-block@vger.kernel.org; linux-
> scsi@vger.kernel.org; esc.storagedev@microsemi.com; Hannes Reinecke
> <hare@suse.com>
> Subject: Re: [PATCH RFC v6 08/10] megaraid_sas: switch fusion adapters to
> MQ
>
> On 28/04/2020 16:55, John Garry wrote:
> > Hi Kashyap,
> >
> >> I am using <mq-deadline> which is MQ version of SQ deadline.
> >> For now we can just consider case without scheduler.
> >>
> >> I have some more findings. May be someone from upstream community
> can
> >> connect the dots.
> >>
> >> #1. hctx_may_queue() throttle the IO at hctx level. This is
> >> eventually per sdev throttling for megaraid_sas driver because It
> >> creates only one context - hctx0 for each scsi device.
> >>
> >> If driver is using only one h/w queue,  active_queues will be always
> >> steady.
> >> In my test it was 64 thread, so active_queues=64.
> >
> > So I figure that 64 threads comes from 64 having disks.
> >
> >> Even though <fio> thread is shared among allowed cpumask
> >> (cpus_allowed_policy=shared option in fio),  active_queues will be
> >> always 64 because we have only one h/w queue.
> >> All the logical cpus are mapped to one h/w queue. It means, thread
> >> moving from one cpu to another cpu will not change active_queues per
> hctx.
> >>
> >> In case of this RFC, active_queues are now per hctx and there are
> >> multiple hctx, but tags are shared.
> >
> > Right, so we need a policy to divide up the shared tags across request
> > queues, based on principle of fairness.
> >
>
> Hi Kashyap,
>
> How about this, which counts requests per request_queue for shared sbitmap
> instead of per hctx (and, as such, does not need to aggreate them in
> hctx_may_queue()):

Hi John,

I have to add additional changes on top of your below patch -
active_queues also should be managed differently for shared tag map case. I
added extra flags in queue_flags interface which is per request.

Combination of your patch and below resolves fairness issues. I see some
better results using " --cpus_allowed_policy=split", but
--cpus_allowed_policy=shared is still having some issue and most likely it
is to do with fairness. If fairness is not managed properly, there is high
contention in wait/wakeup handling of sbitmap.


=====Additional patch ===

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 3719e1a..95bcb47 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -23,10 +23,20 @@
  */
 bool __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx)
 {
-       if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state) &&
-           !test_and_set_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
-               atomic_inc(&hctx->tags->active_queues);
+       struct blk_mq_tags *tags = hctx->tags;
+       struct request_queue *q = hctx->queue;
+       struct sbitmap_queue *bt;

+       if (blk_mq_is_sbitmap_shared(q->tag_set)){
+               bt = tags->bitmap_tags;
+               if (!test_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags) &&
+                       !test_and_set_bit(QUEUE_FLAG_HCTX_ACTIVE,
&q->queue_flags))
+                       atomic_inc(&bt->active_queues_per_host);
+       } else {
+               if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state) &&
+                   !test_and_set_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
+                       atomic_inc(&hctx->tags->active_queues);
+       }
        return true;
 }

@@ -47,12 +57,20 @@ void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags,
bool include_reserve)
 void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
 {
        struct blk_mq_tags *tags = hctx->tags;
+       struct sbitmap_queue *bt;
+       struct request_queue *q = hctx->queue;

-       if (!test_and_clear_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
-               return;
-
-       atomic_dec(&tags->active_queues);
+       if (blk_mq_is_sbitmap_shared(q->tag_set)){
+               bt = tags->bitmap_tags;
+               if (!test_and_clear_bit(QUEUE_FLAG_HCTX_ACTIVE,
&q->queue_flags))
+                       return;
+               atomic_dec(&bt->active_queues_per_host);
+       } else {
+               if (!test_and_clear_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
+                       return;

+               atomic_dec(&tags->active_queues);
+       }
        blk_mq_tag_wakeup_all(tags, false);
 }

@@ -65,12 +83,13 @@ static inline bool hctx_may_queue(struct
blk_mq_alloc_data *data,
 {
        struct blk_mq_hw_ctx *hctx = data->hctx;
        struct request_queue *q = data->q;
+       struct blk_mq_tags *tags = hctx->tags;
        unsigned int depth, users;

        if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED))
                return true;
-       if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
-               return true;
+       //if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
+       //      return true;

        /*
         * Don't try dividing an ant
@@ -78,7 +97,11 @@ static inline bool hctx_may_queue(struct
blk_mq_alloc_data *data,
        if (bt->sb.depth == 1)
                return true;

-       users = atomic_read(&hctx->tags->active_queues);
+       if (blk_mq_is_sbitmap_shared(q->tag_set)) {
+               bt = tags->bitmap_tags;
+               users = atomic_read(&bt->active_queues_per_host);
+       } else
+               users = atomic_read(&hctx->tags->active_queues);
        if (!users)
                return true;

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 2bc9998..7049800 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -613,6 +613,7 @@ struct request_queue {
 #define QUEUE_FLAG_PCI_P2PDMA  25      /* device supports PCI p2p requests
*/
 #define QUEUE_FLAG_ZONE_RESETALL 26    /* supports Zone Reset All */
 #define QUEUE_FLAG_RQ_ALLOC_TIME 27    /* record rq->alloc_time_ns */
+#define QUEUE_FLAG_HCTX_ACTIVE 28      /* atleast one hctx is active*/

 #define QUEUE_FLAG_MQ_DEFAULT  ((1 << QUEUE_FLAG_IO_STAT) |            \
                                 (1 << QUEUE_FLAG_SAME_COMP))
diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
index e40d019..fb44925 100644
--- a/include/linux/sbitmap.h
+++ b/include/linux/sbitmap.h
@@ -139,6 +139,8 @@ struct sbitmap_queue {
         * sbitmap_queue_get_shallow() or __sbitmap_queue_get_shallow().
         */
        unsigned int min_shallow_depth;
+
+       atomic_t active_queues_per_host;
 };

 /**

>
> ---->8
>
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c index
> ba68d934e3ca..0c8adecba219 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -60,9 +60,11 @@ void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
>    * For shared tag users, we track the number of currently active users
>    * and attempt to provide a fair share of the tag depth for each of
> them.
>    */
> -static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx,
> +static inline bool hctx_may_queue(struct blk_mq_alloc_data *data,
>   				  struct sbitmap_queue *bt)
>   {
> +	struct blk_mq_hw_ctx *hctx = data->hctx;
> +	struct request_queue *q = data->q;
>   	unsigned int depth, users;
>
>   	if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED)) @@ -
> 84,14 +86,14 @@ static inline bool hctx_may_queue(struct blk_mq_hw_ctx
> *hctx,
>   	 * Allow at least some tags
>   	 */
>   	depth = max((bt->sb.depth + users - 1) / users, 4U);
> -	return atomic_read(&hctx->nr_active) < depth;
> +	return __blk_mq_active_requests(hctx, q) < depth;
>   }
>
>   static int __blk_mq_get_tag(struct blk_mq_alloc_data *data,
>   			    struct sbitmap_queue *bt)
>   {
>   	if (!(data->flags & BLK_MQ_REQ_INTERNAL) &&
> -	    !hctx_may_queue(data->hctx, bt))
> +	    !hctx_may_queue(data, bt))
>   		return -1;
>   	if (data->shallow_depth)
>   		return __sbitmap_queue_get_shallow(bt, data-
> >shallow_depth); diff --git a/block/blk-mq.c b/block/blk-mq.c index
> 7e446f946e73..5bbd95e01f08 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -282,7 +282,7 @@ static struct request *blk_mq_rq_ctx_init(struct
> blk_mq_alloc_data *data,
>   	} else {
>   		if (data->hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED) {
>   			rq_flags = RQF_MQ_INFLIGHT;
> -			atomic_inc(&data->hctx->nr_active);
> +			__blk_mq_inc_active_requests(data->hctx, data->q);
>   		}
>   		rq->tag = tag;
>   		rq->internal_tag = -1;
> @@ -502,7 +502,7 @@ void blk_mq_free_request(struct request *rq)
>
>   	ctx->rq_completed[rq_is_sync(rq)]++;
>   	if (rq->rq_flags & RQF_MQ_INFLIGHT)
> -		atomic_dec(&hctx->nr_active);
> +		__blk_mq_dec_active_requests(hctx, q);
>
>   	if (unlikely(laptop_mode && !blk_rq_is_passthrough(rq)))
>   		laptop_io_completion(q->backing_dev_info);
> @@ -1048,7 +1048,7 @@ bool blk_mq_get_driver_tag(struct request *rq)
>   	if (rq->tag >= 0) {
>   		if (shared) {
>   			rq->rq_flags |= RQF_MQ_INFLIGHT;
> -			atomic_inc(&data.hctx->nr_active);
> +			__blk_mq_inc_active_requests(rq->mq_hctx, rq->q);
>   		}
>   		data.hctx->tags->rqs[rq->tag] = rq;
>   	}
> diff --git a/block/blk-mq.h b/block/blk-mq.h index
> dde2d29f0ce5..5ab1566c4b7d 100644
> --- a/block/blk-mq.h
> +++ b/block/blk-mq.h
> @@ -202,6 +202,29 @@ static inline bool
> blk_mq_get_dispatch_budget(struct blk_mq_hw_ctx *hctx)
>   	return true;
>   }
>
> +static inline  void __blk_mq_inc_active_requests(struct blk_mq_hw_ctx
> *hctx, struct request_queue *q)
> +{
> +	if (blk_mq_is_sbitmap_shared(q->tag_set))
> +		atomic_inc(&q->nr_active_requests);
> +	else
> +		atomic_inc(&hctx->nr_active);
> +}
> +
> +static inline void __blk_mq_dec_active_requests(struct blk_mq_hw_ctx
> *hctx, struct request_queue *q)
> +{
> +	if (blk_mq_is_sbitmap_shared(q->tag_set))
> +		atomic_dec(&q->nr_active_requests);
> +	else
> +		atomic_dec(&hctx->nr_active);
> +}
> +
> +static inline int __blk_mq_active_requests(struct blk_mq_hw_ctx *hctx,
> struct request_queue *q)
> +{
> +	if (blk_mq_is_sbitmap_shared(q->tag_set))
> +		return atomic_read(&q->nr_active_requests);
> +	return atomic_read(&hctx->nr_active);
> +}
> +
>   static inline void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx,
>   					   struct request *rq)
>   {
> @@ -210,7 +233,7 @@ static inline void __blk_mq_put_driver_tag(struct
> blk_mq_hw_ctx *hctx,
>
>   	if (rq->rq_flags & RQF_MQ_INFLIGHT) {
>   		rq->rq_flags &= ~RQF_MQ_INFLIGHT;
> -		atomic_dec(&hctx->nr_active);
> +		__blk_mq_dec_active_requests(hctx, rq->q);
>   	}
>   }
>
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index
> 32868fbedc9e..5f2955872234 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -579,6 +579,8 @@ struct request_queue {
>
>   	size_t			cmd_size;
>
> +	atomic_t nr_active_requests;
> +
>   	struct work_struct	release_work;
>
>   #define BLK_MAX_WRITE_HINTS	5
>
> > This can create unwanted throttling and
> >> eventually more lock contention in sbitmap.
> >> I added below patch and things improved a bit, but not a full proof.
> >>
> >> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c index
> >> 586c9d6..c708fbc 100644
> >> --- a/block/blk-mq-tag.c
> >> +++ b/block/blk-mq-tag.c
> >> @@ -60,10 +60,12 @@ void __blk_mq_tag_idle(struct blk_mq_hw_ctx
> >> *hctx)
> >>    * For shared tag users, we track the number of currently active
> >> users
> >>    * and attempt to provide a fair share of the tag depth for each of
> >> them.
> >>    */
> >> -static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx,
> >> +static inline bool hctx_may_queue(struct request_queue *q,
> >> +                                 struct blk_mq_hw_ctx *hctx,
> >>                                    struct sbitmap_queue *bt)
> >>   {
> >> -       unsigned int depth, users;
> >> +       unsigned int depth, users, i, outstanding = 0;
> >> +       struct blk_mq_hw_ctx *hctx_iter;
> >>
> >>          if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_SHARED))
> >>                  return true;
> >> @@ -84,14 +86,18 @@ static inline bool hctx_may_queue(struct
> >> blk_mq_hw_ctx *hctx,
> >>           * Allow at least some tags
> >>           */
> >>          depth = max((bt->sb.depth + users - 1) / users, 4U);
> >> -       return atomic_read(&hctx->nr_active) < depth;
> >> +
> >> +       queue_for_each_hw_ctx(q, hctx_iter, i)
> >> +               outstanding += atomic_read(&hctx_iter->nr_active);
> >> +
> >
> > OK,  I think that we need to find a cleaner way to do this.
> >
> >> +       return outstanding < depth;
> >>   }
> >>
> >>   static int __blk_mq_get_tag(struct blk_mq_alloc_data *data,
> >>                              struct sbitmap_queue *bt)
> >>   {
> >>          if (!(data->flags & BLK_MQ_REQ_INTERNAL) &&
> >> -           !hctx_may_queue(data->hctx, bt))
> >>
> >>
> >> #2 - In completion path, scsi module call blk_mq_run_hw_queues() upon
> >> IO completion.  I am not sure if it is good to run all the h/w queue
> >> or just h/w queue of current reference is good enough.
> >> Below patch helped to reduce contention in hcxt_lock().
> >>
> >> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index
> >> 610ee41..f72de2a 100644
> >> --- a/drivers/scsi/scsi_lib.c
> >> +++ b/drivers/scsi/scsi_lib.c
> >> @@ -572,6 +572,7 @@ static bool scsi_end_request(struct request *req,
> >> blk_status_t error,
> >>          struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(req);
> >>          struct scsi_device *sdev = cmd->device;
> >>          struct request_queue *q = sdev->request_queue;
> >> +       struct blk_mq_hw_ctx *mq_hctx = req->mq_hctx;
> >>
> >>          if (blk_update_request(req, error, bytes))
> >>                  return true;
> >> @@ -613,7 +614,7 @@ static bool scsi_end_request(struct request *req,
> >> blk_status_t error,
> >>              !list_empty(&sdev->host->starved_list))
> >>                  kblockd_schedule_work(&sdev->requeue_work);
> >>          else
> >> -               blk_mq_run_hw_queues(q, true);
> >> +               blk_mq_run_hw_queue(mq_hctx, true);
> >
> > Not sure on this. But, indeed, I found running all queues did add lots
> > of extra load for when enabling the deadline scheduler.
> >
> >>
> >>          percpu_ref_put(&q->q_usage_counter);
> >>          return false;
> >>
> >> #3 -  __blk_mq_tag_idle() calls blk_mq_tag_wakeup_all which may not
> >> be optimal for shared queue.
> >> There is a penalty if we are calling __sbq_wake_up() frequently. In
> >> case of nr_hw_queue = 1, things are better because one hctx and
> >> hctx->state will avoid multiple calls.
> >> If blk_mq_tag_wakeup_all is called from hctx0 context, there is no
> >> need to call from hctx1, hctx2 etc.
> >>
> >> I have added below patch in my testing.
> >>
> >> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c index
> >> 586c9d6..5b331e5 100644
> >> --- a/block/blk-mq-tag.c
> >> +++ b/block/blk-mq-tag.c
> >> @@ -53,7 +53,9 @@ void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
> >>
> >>          atomic_dec(&tags->active_queues);
> >>
> >> -       blk_mq_tag_wakeup_all(tags, false);
> >> +       /* TBD - Do this only for hctx->flags &
> >> BLK_MQ_F_TAG_HCTX_SHARED */
> >> +       if (hctx->queue_num == 0)
> >> +               blk_mq_tag_wakeup_all(tags, false);
> >
> > ok, I see. But, again, I think we need to find a cleaner way to do this.
> >
> >>   }
> >>
> >>   /*
> >>
> >>
> >> With all above mentioned changes, I see performance improved from
> >> 2.2M IOPS to 2.7M on same workload and profile.
> >
> > But still short of nr_hw_queue = 1, which got 3.1M IOPS, as below,
> > right?
> >
> > Thanks,
> > John
> >
> >>
> >>>
> >>>>
> >>>> Old Driver which has nr_hw_queue = 1 and I issue IOs from <fio>
> >>>> queue depth = 128. We get 3.1M IOPS in this config. This eventually
> >>>> exhaust host can_queue.
> >>>
> >>> So I think I need to find a point where we start to get throttled.
> >>>
> >>>> Note - Very low contention in sbitmap_get()
> >>>>
> >>>> -   23.58%     0.25%  fio              [kernel.vmlinux]
> >>>> [k] blk_mq_make_request
> >>>>      - 23.33% blk_mq_make_request
> >>>>         - 21.68% blk_mq_get_request
> >>>>            - 20.19% blk_mq_get_tag
> >>>>               + 10.08% prepare_to_wait_exclusive
> >>>>               + 4.51% io_schedule
> >>>>               - 3.59% __sbitmap_queue_get
> >>>>                  - 2.82% sbitmap_get
> >>>>                       0.86% __sbitmap_get_word
> >>>>                       0.75% _raw_spin_lock_irqsave
> >>>>                       0.55% _raw_spin_unlock_irqrestore
> >>>>
> >>>> Driver with RFC which has nr_hw_queue = N and I issue IOs from
> >>>> <fio> queue depth = 128. We get 2.3 M IOPS in this config. This
> >>>> eventually exhaust host can_queue.
> >>>> Note - Very high contention in sbitmap_get()
> >>>>
> >>>> -   42.39%     0.12%  fio              [kernel.vmlinux]
> >>>> [k] generic_make_request
> >>>>      - 42.27% generic_make_request
> >>>>         - 41.00% blk_mq_make_request
> >>>>            - 38.28% blk_mq_get_request
> >>>>               - 33.76% blk_mq_get_tag
> >>>>                  - 30.25% __sbitmap_queue_get
> >>>>                     - 29.90% sbitmap_get
> >>>>                        + 9.06% _raw_spin_lock_irqsave
> >>>>                        + 7.94% _raw_spin_unlock_irqrestore
> >>>>                        + 3.86% __sbitmap_get_word
> >>>>                        + 1.78% call_function_single_interrupt
> >>>>                        + 0.67% ret_from_intr
> >>>>                  + 1.69% io_schedule
> >>>>                    0.59% prepare_to_wait_exclusive
> >>>>                    0.55% __blk_mq_get_tag
> >>>>
> >>>> In this particular case, I observed alloc_hint = zeros which means,
> >>>> sbitmap_get is not able to find free tags from hint. That may lead
> >>>> to contention.
> >>>> This condition is not happening with nr_hw_queue=1 (without RFC)
> >>>> driver.
> >>>>
> >>>> alloc_hint=
> >>>> {663, 2425, 3060, 54, 3149, 4319, 4175, 4867, 543, 2481, 0, 4779,
> >>>> 377, ***0***, 2010, 0, 909, 3350, 1546, 2179, 2875, 659, 3902,
> >>>> 2224, 3212, 836, 1892, 1669, 2420, 3415, 1904, 512, 3027, 4810,
> >>>> 2845, 4690, 712, 3105, 0, 0, 0, 3268, 4915, 3897, 1349, 547, 4,
> >>>> 733, 1765, 2068, 979, 51, 880, 0, 370, 3520, 2877, 4097, 418, 4501,
> >>>> 3717, 2893, 604, 508, 759, 3329, 4038, 4829, 715, 842, 1443, 556}
> >>>>
> >>>> Driver with RFC which has nr_hw_queue = N and I issue IOs from
> >>>> <fio> queue depth = 32. We get 3.1M IOPS in this config. This
> >>>> workload does
> >>>> *not* exhaust host can_queue.
> >>>
> >>> Please ensure .host_tagset is set for whenever nr_hw_queue = N. This
> >>> is as per RFC, and I don't think you modified from the RFC for your
> >>> test.
> >>> But I just wanted to mention that to be crystal clear.
> >>
> >> Yes I have two separate driver copy. One with RFC change and another
> >> without RFC.
> >>>
> >>>>
> >>>> -    5.07%     0.14%  fio              [kernel.vmlinux]  [k]
> >>>> generic_make_request
> >>>>      - 4.93% generic_make_request
> >>>>         - 3.61% blk_mq_make_request
> >>>>            - 2.04% blk_mq_get_request
> >>>>               - 1.08% blk_mq_get_tag
> >>>>                  - 0.70% __sbitmap_queue_get
> >>>>                       0.67% sbitmap_get
> >>>>
> >>>> In summary, RFC has some performance bottleneck in sbitmap_get ()
> >>>> if outstanding per shost is about to exhaust.  Without this RFC
> >>>> also driver works in nr_hw_queue = 1, but that case is managed very
> well.
> >>>> I am not sure why it happens only with shared host tag ?
> >>>> Theoretically all the hctx is sharing the same bitmaptag which is
> >>>> same as nr_hw_queue=1, so why contention is only visible in shared
> >>>> host tag case.
> >>>
> >>> Let me check this.
> >>>
> >>>>
> >>>> If you want to reproduce this issue, may be you have to reduce the
> >>>> can_queue in hisi_sas driver.
> >>>>
> >>>
> >>> Thanks,
> >>> John
> >

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC v6 08/10] megaraid_sas: switch fusion adapters to MQ
  2020-04-29 15:50                               ` Kashyap Desai
@ 2020-04-29 17:55                                 ` John Garry
  2020-04-30 17:40                                   ` John Garry
  0 siblings, 1 reply; 35+ messages in thread
From: John Garry @ 2020-04-29 17:55 UTC (permalink / raw)
  To: Kashyap Desai
  Cc: axboe, jejb, martin.petersen, ming.lei, bvanassche, hare,
	don.brace, Sumit Saxena, hch, Shivasharan Srikanteshwara,
	chenxiang (M),
	linux-block, linux-scsi, esc.storagedev, Hannes Reinecke

Hi Kashyap,

> 
> I have to add additional changes on top of your below patch -
> active_queues also should be managed differently for shared tag map case. I
> added extra flags in queue_flags interface which is per request.

ok, so it's not proper to use active hctx per request queue as "users", 
but rather that the active request_queue's per host (which is what we 
effectively have for nr_hw_queues = 1). Just a small comment at the 
bottom on your change.

So I already experimented with reducing shost.can_queue significantly on 
hisi_sas (by a factor of 8, from 4096->512, and I use 12x SAS SSD), and saw:

RFC + shost.can_queue reduced: ~1.2M IOPS
With RFC + shost.can_queue unmodified: ~2M IOPS
Without RFC + shost.can_queue unmodified: ~2M IOPS

And with the change, below, I still get around 1.2M IOPS. But there may 
be some sweet spot/zone where this makes a difference, which I'm not 
visiting.

> 
> Combination of your patch and below resolves fairness issues. I see some
> better results using " --cpus_allowed_policy=split", but
> --cpus_allowed_policy=shared 

I did not try changing the cpus_allowed_policy policy, and so I would be 
using default, which I believe is shared.

is still having some issue and most likely it
> is to do with fairness. If fairness is not managed properly, there is high
> contention in wait/wakeup handling of sbitmap.

ok, I can investigate.

> 
> 
> =====Additional patch ===
> 
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index 3719e1a..95bcb47 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -23,10 +23,20 @@
>    */
>   bool __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx)
>   {
> -       if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state) &&
> -           !test_and_set_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
> -               atomic_inc(&hctx->tags->active_queues);
> +       struct blk_mq_tags *tags = hctx->tags;
> +       struct request_queue *q = hctx->queue;
> +       struct sbitmap_queue *bt;
> 
> +       if (blk_mq_is_sbitmap_shared(q->tag_set)){
> +               bt = tags->bitmap_tags;
> +               if (!test_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags) &&
> +                       !test_and_set_bit(QUEUE_FLAG_HCTX_ACTIVE,
> &q->queue_flags))
> +                       atomic_inc(&bt->active_queues_per_host);
> +       } else {
> +               if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state) &&
> +                   !test_and_set_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
> +                       atomic_inc(&hctx->tags->active_queues);
> +       }
>          return true;
>   }
> 
> @@ -47,12 +57,20 @@ void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags,
> bool include_reserve)
>   void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
>   {
>          struct blk_mq_tags *tags = hctx->tags;
> +       struct sbitmap_queue *bt;
> +       struct request_queue *q = hctx->queue;
> 
> -       if (!test_and_clear_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
> -               return;
> -
> -       atomic_dec(&tags->active_queues);
> +       if (blk_mq_is_sbitmap_shared(q->tag_set)){
> +               bt = tags->bitmap_tags;
> +               if (!test_and_clear_bit(QUEUE_FLAG_HCTX_ACTIVE,
> &q->queue_flags))
> +                       return;
> +               atomic_dec(&bt->active_queues_per_host);
> +       } else {
> +               if (!test_and_clear_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
> +                       return;
> 
> +               atomic_dec(&tags->active_queues);
> +       }
>          blk_mq_tag_wakeup_all(tags, false);
>   }
> 
> @@ -65,12 +83,13 @@ static inline bool hctx_may_queue(struct
> blk_mq_alloc_data *data,
>   {
>          struct blk_mq_hw_ctx *hctx = data->hctx;
>          struct request_queue *q = data->q;
> +       struct blk_mq_tags *tags = hctx->tags;
>          unsigned int depth, users;
> 
>          if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED))
>                  return true;
> -       if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
> -               return true;
> +       //if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
> +       //      return true;
> 
>          /*
>           * Don't try dividing an ant
> @@ -78,7 +97,11 @@ static inline bool hctx_may_queue(struct
> blk_mq_alloc_data *data,
>          if (bt->sb.depth == 1)
>                  return true;
> 
> -       users = atomic_read(&hctx->tags->active_queues);
> +       if (blk_mq_is_sbitmap_shared(q->tag_set)) {
> +               bt = tags->bitmap_tags;
> +               users = atomic_read(&bt->active_queues_per_host);
> +       } else
> +               users = atomic_read(&hctx->tags->active_queues);
>          if (!users)
>                  return true;
> 
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 2bc9998..7049800 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -613,6 +613,7 @@ struct request_queue {
>   #define QUEUE_FLAG_PCI_P2PDMA  25      /* device supports PCI p2p requests
> */
>   #define QUEUE_FLAG_ZONE_RESETALL 26    /* supports Zone Reset All */
>   #define QUEUE_FLAG_RQ_ALLOC_TIME 27    /* record rq->alloc_time_ns */
> +#define QUEUE_FLAG_HCTX_ACTIVE 28      /* atleast one hctx is active*/
> 
>   #define QUEUE_FLAG_MQ_DEFAULT  ((1 << QUEUE_FLAG_IO_STAT) |            \
>                                   (1 << QUEUE_FLAG_SAME_COMP))
> diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
> index e40d019..fb44925 100644
> --- a/include/linux/sbitmap.h
> +++ b/include/linux/sbitmap.h
> @@ -139,6 +139,8 @@ struct sbitmap_queue {
>           * sbitmap_queue_get_shallow() or __sbitmap_queue_get_shallow().
>           */
>          unsigned int min_shallow_depth;
> +
> +       atomic_t active_queues_per_host;

It's prob better to put this in blk_mq_tag_set structure.

>   };


Thanks very much,
John


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH RFC v6 08/10] megaraid_sas: switch fusion adapters to MQ
  2020-04-29 17:55                                 ` John Garry
@ 2020-04-30 17:40                                   ` John Garry
  2020-04-30 19:18                                     ` Kashyap Desai
  0 siblings, 1 reply; 35+ messages in thread
From: John Garry @ 2020-04-30 17:40 UTC (permalink / raw)
  To: Kashyap Desai
  Cc: axboe, jejb, martin.petersen, ming.lei, bvanassche, hare,
	don.brace, Sumit Saxena, hch, Shivasharan Srikanteshwara,
	chenxiang (M),
	linux-block, linux-scsi, esc.storagedev, Hannes Reinecke

On 29/04/2020 18:55, John Garry wrote:
> 

Hi Kashyap,

> ok, so it's not proper to use active hctx per request queue as "users", 
> but rather that the active request_queue's per host (which is what we 
> effectively have for nr_hw_queues = 1). Just a small comment at the 
> bottom on your change.
> 
> So I already experimented with reducing shost.can_queue significantly on 
> hisi_sas (by a factor of 8, from 4096->512, and I use 12x SAS SSD), and 
> saw:
> 
> RFC + shost.can_queue reduced: ~1.2M IOPS
> With RFC + shost.can_queue unmodified: ~2M IOPS
> Without RFC + shost.can_queue unmodified: ~2M IOPS
> 
> And with the change, below, I still get around 1.2M IOPS. But there may 
> be some sweet spot/zone where this makes a difference, which I'm not 
> visiting.
> 
>>
>> Combination of your patch and below resolves fairness issues. I see some
>> better results using " --cpus_allowed_policy=split", but
>> --cpus_allowed_policy=shared 
> 
> I did not try changing the cpus_allowed_policy policy, and so I would be 
> using default, which I believe is shared.
> 
> is still having some issue and most likely it
>> is to do with fairness. If fairness is not managed properly, there is 
>> high
>> contention in wait/wakeup handling of sbitmap.
> 
> ok, I can investigate.

Could you also try this change:

diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index 0a57e4f041a9..e959971b1cee 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -46,11 +46,25 @@ extern void blk_mq_tag_wakeup_all(struct blk_mq_tags 
*tags, bool);
  void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
  		void *priv);

+static inline bool blk_mq_is_sbitmap_shared(struct blk_mq_tag_set *tag_set)
+{
+	return tag_set->flags & BLK_MQ_F_TAG_HCTX_SHARED;
+}
+
  static inline struct sbq_wait_state *bt_wait_ptr(struct sbitmap_queue *bt,
  						 struct blk_mq_hw_ctx *hctx)
  {
+	struct request_queue *queue;
+	struct blk_mq_tag_set *set;
+
  	if (!hctx)
  		return &bt->ws[0];
+	queue = hctx->queue;
+	set = queue->tag_set;
+
+	if (blk_mq_is_sbitmap_shared(set))
+		return sbq_wait_ptr(bt, &set->wait_index);
+	
  	return sbq_wait_ptr(bt, &hctx->wait_index);
  }

diff --git a/block/blk-mq.h b/block/blk-mq.h
index 4f12363d56bf..241607806f77 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -158,11 +158,6 @@ struct blk_mq_alloc_data {
  	struct blk_mq_hw_ctx *hctx;
  };

-static inline bool blk_mq_is_sbitmap_shared(struct blk_mq_tag_set *tag_set)
-{
-	return tag_set->flags & BLK_MQ_F_TAG_HCTX_SHARED;
-}
-
  static inline struct blk_mq_tags *blk_mq_tags_from_data(struct 
blk_mq_alloc_data *data)
  {
  	if (data->flags & BLK_MQ_REQ_INTERNAL)

diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 957cb43c5de7..427c7934c29d 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -259,6 +259,8 @@ struct blk_mq_tag_set {

  	struct mutex		tag_list_lock;
  	struct list_head	tag_list;
+
+	atomic_t		wait_index;
  };

  /**


Thanks,
John

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* RE: [PATCH RFC v6 08/10] megaraid_sas: switch fusion adapters to MQ
  2020-04-30 17:40                                   ` John Garry
@ 2020-04-30 19:18                                     ` Kashyap Desai
  0 siblings, 0 replies; 35+ messages in thread
From: Kashyap Desai @ 2020-04-30 19:18 UTC (permalink / raw)
  To: John Garry
  Cc: axboe, jejb, martin.petersen, ming.lei, bvanassche, hare,
	don.brace, Sumit Saxena, hch, Shivasharan Srikanteshwara,
	chenxiang (M),
	linux-block, linux-scsi, esc.storagedev, Hannes Reinecke

> -----Original Message-----
> From: linux-scsi-owner@vger.kernel.org [mailto:linux-scsi-
> owner@vger.kernel.org] On Behalf Of John Garry
> Sent: Thursday, April 30, 2020 11:11 PM
> To: Kashyap Desai <kashyap.desai@broadcom.com>
> Cc: axboe@kernel.dk; jejb@linux.ibm.com; martin.petersen@oracle.com;
> ming.lei@redhat.com; bvanassche@acm.org; hare@suse.de;
> don.brace@microsemi.com; Sumit Saxena <sumit.saxena@broadcom.com>;
> hch@infradead.org; Shivasharan Srikanteshwara
> <shivasharan.srikanteshwara@broadcom.com>; chenxiang (M)
> <chenxiang66@hisilicon.com>; linux-block@vger.kernel.org; linux-
> scsi@vger.kernel.org; esc.storagedev@microsemi.com; Hannes Reinecke
> <hare@suse.com>
> Subject: Re: [PATCH RFC v6 08/10] megaraid_sas: switch fusion adapters to
> MQ
>
> On 29/04/2020 18:55, John Garry wrote:
> >
>
> Hi Kashyap,
>
> > ok, so it's not proper to use active hctx per request queue as
> > "users", but rather that the active request_queue's per host (which is
> > what we effectively have for nr_hw_queues = 1). Just a small comment
> > at the bottom on your change.
> >
> > So I already experimented with reducing shost.can_queue significantly
> > on hisi_sas (by a factor of 8, from 4096->512, and I use 12x SAS SSD),
> > and
> > saw:
> >
> > RFC + shost.can_queue reduced: ~1.2M IOPS With RFC + shost.can_queue
> > unmodified: ~2M IOPS Without RFC + shost.can_queue unmodified: ~2M
> > IOPS
> >
> > And with the change, below, I still get around 1.2M IOPS. But there
> > may be some sweet spot/zone where this makes a difference, which I'm
> > not visiting.
> >
> >>
> >> Combination of your patch and below resolves fairness issues. I see
> >> some better results using " --cpus_allowed_policy=split", but
> >> --cpus_allowed_policy=shared
> >
> > I did not try changing the cpus_allowed_policy policy, and so I would
> > be using default, which I believe is shared.
> >
> > is still having some issue and most likely it
> >> is to do with fairness. If fairness is not managed properly, there is
> >> high contention in wait/wakeup handling of sbitmap.
> >
> > ok, I can investigate.
>
> Could you also try this change:
>
> diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h index
> 0a57e4f041a9..e959971b1cee 100644
> --- a/block/blk-mq-tag.h
> +++ b/block/blk-mq-tag.h
> @@ -46,11 +46,25 @@ extern void blk_mq_tag_wakeup_all(struct
> blk_mq_tags *tags, bool);
>   void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn
> *fn,
>   		void *priv);
>
> +static inline bool blk_mq_is_sbitmap_shared(struct blk_mq_tag_set
> +*tag_set) {
> +	return tag_set->flags & BLK_MQ_F_TAG_HCTX_SHARED; }
> +
>   static inline struct sbq_wait_state *bt_wait_ptr(struct sbitmap_queue
> *bt,
>   						 struct blk_mq_hw_ctx *hctx)
>   {
> +	struct request_queue *queue;
> +	struct blk_mq_tag_set *set;
> +
>   	if (!hctx)
>   		return &bt->ws[0];
> +	queue = hctx->queue;
> +	set = queue->tag_set;
> +
> +	if (blk_mq_is_sbitmap_shared(set))
> +		return sbq_wait_ptr(bt, &set->wait_index);
> +
>   	return sbq_wait_ptr(bt, &hctx->wait_index);
>   }
>
> diff --git a/block/blk-mq.h b/block/blk-mq.h index
> 4f12363d56bf..241607806f77 100644
> --- a/block/blk-mq.h
> +++ b/block/blk-mq.h
> @@ -158,11 +158,6 @@ struct blk_mq_alloc_data {
>   	struct blk_mq_hw_ctx *hctx;
>   };
>
> -static inline bool blk_mq_is_sbitmap_shared(struct blk_mq_tag_set
> *tag_set)
> -{
> -	return tag_set->flags & BLK_MQ_F_TAG_HCTX_SHARED;
> -}
> -
>   static inline struct blk_mq_tags *blk_mq_tags_from_data(struct
> blk_mq_alloc_data *data)
>   {
>   	if (data->flags & BLK_MQ_REQ_INTERNAL)
>
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index
> 957cb43c5de7..427c7934c29d 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -259,6 +259,8 @@ struct blk_mq_tag_set {
>
>   	struct mutex		tag_list_lock;
>   	struct list_head	tag_list;
> +
> +	atomic_t		wait_index;
>   };

John -

I have this patch in my local repo as I was doing similar change. Your above
exact patch is not tested, but similar attempt was done. It is good to
include in next revision.
There is no significant performance up or down using this change.
Currently system is not available for me to test. I will resume testing once
system is available.


>
>   /**
>
>
> Thanks,
> John

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2020-04-30 19:18 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-05 11:54 [PATCH RFC v6 00/10] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs John Garry
2020-03-05 11:54 ` [PATCH RFC v6 01/10] blk-mq: rename BLK_MQ_F_TAG_SHARED as BLK_MQ_F_TAG_QUEUE_SHARED John Garry
2020-03-05 11:54 ` [PATCH RFC v6 02/10] blk-mq: rename blk_mq_update_tag_set_depth() John Garry
2020-03-05 11:54 ` [PATCH RFC v6 03/10] blk-mq: Use pointers for blk_mq_tags bitmap tags John Garry
2020-03-05 12:42   ` Hannes Reinecke
2020-03-05 11:54 ` [PATCH RFC v6 04/10] blk-mq: Facilitate a shared sbitmap per tagset John Garry
2020-03-05 12:49   ` Hannes Reinecke
2020-03-05 13:52     ` John Garry
2020-03-05 11:54 ` [PATCH RFC v6 05/10] blk-mq: Add support in hctx_tags_bitmap_show() for a shared sbitmap John Garry
2020-03-05 12:52   ` Hannes Reinecke
2020-03-05 11:54 ` [PATCH RFC v6 06/10] scsi: Add template flag 'host_tagset' John Garry
2020-03-06 11:12   ` John Garry
2020-03-05 11:54 ` [PATCH RFC v6 07/10] scsi: hisi_sas: Switch v3 hw to MQ John Garry
2020-03-05 12:52   ` Hannes Reinecke
2020-03-05 11:54 ` [PATCH RFC v6 08/10] megaraid_sas: switch fusion adapters " John Garry
2020-04-07 11:14   ` Kashyap Desai
2020-04-08  9:33     ` John Garry
2020-04-08  9:59       ` Kashyap Desai
2020-04-17 16:46         ` John Garry
2020-04-20 17:47           ` Kashyap Desai
2020-04-21 12:35             ` John Garry
2020-04-22 18:59               ` Kashyap Desai
2020-04-22 21:28                 ` John Garry
2020-04-23 16:31                   ` John Garry
2020-04-24 16:31                     ` Kashyap Desai
2020-04-27 17:06                       ` John Garry
2020-04-27 18:58                         ` Kashyap Desai
2020-04-28 15:55                           ` John Garry
2020-04-29 11:29                             ` John Garry
2020-04-29 15:50                               ` Kashyap Desai
2020-04-29 17:55                                 ` John Garry
2020-04-30 17:40                                   ` John Garry
2020-04-30 19:18                                     ` Kashyap Desai
2020-03-05 11:54 ` [PATCH RFC v6 09/10] smartpqi: enable host tagset John Garry
2020-03-05 11:54 ` [PATCH RFC v6 10/10] hpsa: enable host_tagset and switch to MQ John Garry

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.