[PATCH RFC 0/5] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC 0/5] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs
@ 2019-11-13 13:36 John Garry
  2019-11-13 13:36 ` [PATCH RFC 1/5] blk-mq: Remove some unused function arguments John Garry
                   ` (4 more replies)
  0 siblings, 5 replies; 22+ messages in thread
From: John Garry @ 2019-11-13 13:36 UTC (permalink / raw)
  To: axboe, jejb, martin.petersen
  Cc: linux-block, linux-kernel, linux-scsi, ming.lei, hare,
	bvanassche, chenxiang66, John Garry

This is a 2nd stab at solving the problem of hostwide shared tags for SCSI
HBAs.

As discussed previously, Ming Lei's most recent series in [0] to use
hctx[0] tags for all hctx for a host was a bit messy and intrusive, so seen
as a no go. Indeed, blk-mq is designed for separate tags per hctx.

This series introduces a different approach to solve that problem, in
keeping the per-hctx tags but introducing a new separate shared set of
tags, which SCSI HBAs can use for a hostwide tags.

Adding support for shared tags should not have a significant performance
impact for when shared tags are not requested.

Currently I just fixed up the hisi_sas driver to use the shared tags,
but should not be much trouble to change others over.

Patch #3 is quite experimental at this point. I also threw in a minor
tidy-up patch.

[0] https://lore.kernel.org/linux-block/20190531022801.10003-1-ming.lei@redhat.com/

Hannes Reinecke (1):
  scsi: Add template flag 'host_tagset'

John Garry (3):
  blk-mq: Remove some unused function arguments
  blk-mq: Facilitate a shared tags per tagset
  scsi: hisi_sas: Switch v3 hw to MQ

Ming Lei (1):
  blk-mq: rename BLK_MQ_F_TAG_SHARED as BLK_MQ_F_TAG_QUEUE_SHARED

 block/blk-core.c                       |  1 +
 block/blk-flush.c                      |  2 +
 block/blk-mq-debugfs.c                 |  4 +-
 block/blk-mq-tag.c                     | 91 +++++++++++++++++++++++++-
 block/blk-mq-tag.h                     |  8 +--
 block/blk-mq.c                         | 91 +++++++++++++++++++-------
 block/blk-mq.h                         | 11 +++-
 drivers/scsi/hisi_sas/hisi_sas.h       |  3 +-
 drivers/scsi/hisi_sas/hisi_sas_main.c  | 43 ++++++------
 drivers/scsi/hisi_sas/hisi_sas_v3_hw.c | 85 ++++++++++--------------
 drivers/scsi/scsi_lib.c                |  2 +
 include/linux/blk-mq.h                 |  5 +-
 include/linux/blkdev.h                 |  1 +
 include/scsi/scsi_host.h               |  3 +
 14 files changed, 242 insertions(+), 108 deletions(-)

-- 
2.17.1

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH RFC 1/5] blk-mq: Remove some unused function arguments
  2019-11-13 13:36 [PATCH RFC 0/5] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs John Garry
@ 2019-11-13 13:36 ` John Garry
  2019-11-13 13:58   ` Hannes Reinecke
  2019-11-13 13:36 ` [PATCH RFC 2/5] blk-mq: rename BLK_MQ_F_TAG_SHARED as BLK_MQ_F_TAG_QUEUE_SHARED John Garry
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 22+ messages in thread
From: John Garry @ 2019-11-13 13:36 UTC (permalink / raw)
  To: axboe, jejb, martin.petersen
  Cc: linux-block, linux-kernel, linux-scsi, ming.lei, hare,
	bvanassche, chenxiang66, John Garry

The struct blk_mq_hw_ctx * argument in blk_mq_put_tag(),
blk_mq_poll_nsecs(), and blk_mq_poll_hybrid_sleep() is unused, so remove
it.

Signed-off-by: John Garry <john.garry@huawei.com>
---
 block/blk-mq-tag.c |  4 ++--
 block/blk-mq-tag.h |  3 +--
 block/blk-mq.c     | 10 ++++------
 block/blk-mq.h     |  2 +-
 4 files changed, 8 insertions(+), 11 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 008388e82b5c..53b4a9414fbd 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -191,8 +191,8 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 	return tag + tag_offset;
 }
 
-void blk_mq_put_tag(struct blk_mq_hw_ctx *hctx, struct blk_mq_tags *tags,
-		    struct blk_mq_ctx *ctx, unsigned int tag)
+void blk_mq_put_tag(struct blk_mq_tags *tags, struct blk_mq_ctx *ctx,
+		    unsigned int tag)
 {
 	if (!blk_mq_tag_is_reserved(tags, tag)) {
 		const int real_tag = tag - tags->nr_reserved_tags;
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index 61deab0b5a5a..b16f138e75f8 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -26,8 +26,7 @@ extern struct blk_mq_tags *blk_mq_init_tags(unsigned int nr_tags, unsigned int r
 extern void blk_mq_free_tags(struct blk_mq_tags *tags);
 
 extern unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data);
-extern void blk_mq_put_tag(struct blk_mq_hw_ctx *hctx, struct blk_mq_tags *tags,
-			   struct blk_mq_ctx *ctx, unsigned int tag);
+extern void blk_mq_put_tag(struct blk_mq_tags *tags, struct blk_mq_ctx *ctx, unsigned int tag);
 extern bool blk_mq_has_free_tags(struct blk_mq_tags *tags);
 extern int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
 					struct blk_mq_tags **tags,
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 6e3b15f70cd7..16aa20d23b67 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -499,9 +499,9 @@ static void __blk_mq_free_request(struct request *rq)
 	blk_pm_mark_last_busy(rq);
 	rq->mq_hctx = NULL;
 	if (rq->tag != -1)
-		blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
+		blk_mq_put_tag(hctx->tags, ctx, rq->tag);
 	if (sched_tag != -1)
-		blk_mq_put_tag(hctx, hctx->sched_tags, ctx, sched_tag);
+		blk_mq_put_tag(hctx->sched_tags, ctx, sched_tag);
 	blk_mq_sched_restart(hctx);
 	blk_queue_exit(q);
 }
@@ -3354,7 +3354,6 @@ static void blk_mq_poll_stats_fn(struct blk_stat_callback *cb)
 }
 
 static unsigned long blk_mq_poll_nsecs(struct request_queue *q,
-				       struct blk_mq_hw_ctx *hctx,
 				       struct request *rq)
 {
 	unsigned long ret = 0;
@@ -3387,7 +3386,6 @@ static unsigned long blk_mq_poll_nsecs(struct request_queue *q,
 }
 
 static bool blk_mq_poll_hybrid_sleep(struct request_queue *q,
-				     struct blk_mq_hw_ctx *hctx,
 				     struct request *rq)
 {
 	struct hrtimer_sleeper hs;
@@ -3407,7 +3405,7 @@ static bool blk_mq_poll_hybrid_sleep(struct request_queue *q,
 	if (q->poll_nsec > 0)
 		nsecs = q->poll_nsec;
 	else
-		nsecs = blk_mq_poll_nsecs(q, hctx, rq);
+		nsecs = blk_mq_poll_nsecs(q, rq);
 
 	if (!nsecs)
 		return false;
@@ -3462,7 +3460,7 @@ static bool blk_mq_poll_hybrid(struct request_queue *q,
 			return false;
 	}
 
-	return blk_mq_poll_hybrid_sleep(q, hctx, rq);
+	return blk_mq_poll_hybrid_sleep(q, rq);
 }
 
 /**
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 32c62c64e6c2..78d38b5f2793 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -208,7 +208,7 @@ static inline bool blk_mq_get_dispatch_budget(struct blk_mq_hw_ctx *hctx)
 static inline void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx,
 					   struct request *rq)
 {
-	blk_mq_put_tag(hctx, hctx->tags, rq->mq_ctx, rq->tag);
+	blk_mq_put_tag(hctx->tags, rq->mq_ctx, rq->tag);
 	rq->tag = -1;
 
 	if (rq->rq_flags & RQF_MQ_INFLIGHT) {
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH RFC 2/5] blk-mq: rename BLK_MQ_F_TAG_SHARED as BLK_MQ_F_TAG_QUEUE_SHARED
  2019-11-13 13:36 [PATCH RFC 0/5] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs John Garry
  2019-11-13 13:36 ` [PATCH RFC 1/5] blk-mq: Remove some unused function arguments John Garry
@ 2019-11-13 13:36 ` John Garry
  2019-11-13 13:58   ` Hannes Reinecke
  2019-11-13 13:36 ` [PATCH RFC 3/5] blk-mq: Facilitate a shared tags per tagset John Garry
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 22+ messages in thread
From: John Garry @ 2019-11-13 13:36 UTC (permalink / raw)
  To: axboe, jejb, martin.petersen
  Cc: linux-block, linux-kernel, linux-scsi, ming.lei, hare,
	bvanassche, chenxiang66, John Garry

From: Ming Lei <ming.lei@redhat.com>

BLK_MQ_F_TAG_SHARED actually means that tags is shared among request
queues, all of which should belong to LUNs attached to same HBA.

So rename it to make the point explicitly.

Suggested-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: John Garry <john.garry@huawei.com>
---
 block/blk-mq-debugfs.c |  2 +-
 block/blk-mq-tag.c     |  2 +-
 block/blk-mq-tag.h     |  4 ++--
 block/blk-mq.c         | 20 ++++++++++----------
 include/linux/blk-mq.h |  2 +-
 5 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index b3f2ba483992..33a40ae1d60f 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -236,7 +236,7 @@ static const char *const alloc_policy_name[] = {
 #define HCTX_FLAG_NAME(name) [ilog2(BLK_MQ_F_##name)] = #name
 static const char *const hctx_flag_name[] = {
 	HCTX_FLAG_NAME(SHOULD_MERGE),
-	HCTX_FLAG_NAME(TAG_SHARED),
+	HCTX_FLAG_NAME(TAG_QUEUE_SHARED),
 	HCTX_FLAG_NAME(BLOCKING),
 	HCTX_FLAG_NAME(NO_SCHED),
 };
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 53b4a9414fbd..d7aa23c82dbf 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -73,7 +73,7 @@ static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx,
 {
 	unsigned int depth, users;
 
-	if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_SHARED))
+	if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED))
 		return true;
 	if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
 		return true;
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index b16f138e75f8..88b85daa4976 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -54,7 +54,7 @@ extern void __blk_mq_tag_idle(struct blk_mq_hw_ctx *);
 
 static inline bool blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx)
 {
-	if (!(hctx->flags & BLK_MQ_F_TAG_SHARED))
+	if (!(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED))
 		return false;
 
 	return __blk_mq_tag_busy(hctx);
@@ -62,7 +62,7 @@ static inline bool blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx)
 
 static inline void blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
 {
-	if (!(hctx->flags & BLK_MQ_F_TAG_SHARED))
+	if (!(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED))
 		return;
 
 	__blk_mq_tag_idle(hctx);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 16aa20d23b67..6b39cf0efdcd 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -302,7 +302,7 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
 		rq->tag = -1;
 		rq->internal_tag = tag;
 	} else {
-		if (data->hctx->flags & BLK_MQ_F_TAG_SHARED) {
+		if (data->hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED) {
 			rq_flags = RQF_MQ_INFLIGHT;
 			atomic_inc(&data->hctx->nr_active);
 		}
@@ -1118,7 +1118,7 @@ static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx *hctx,
 	wait_queue_entry_t *wait;
 	bool ret;
 
-	if (!(hctx->flags & BLK_MQ_F_TAG_SHARED)) {
+	if (!(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED)) {
 		blk_mq_sched_mark_restart_hctx(hctx);
 
 		/*
@@ -1249,7 +1249,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 				 * For non-shared tags, the RESTART check
 				 * will suffice.
 				 */
-				if (hctx->flags & BLK_MQ_F_TAG_SHARED)
+				if (hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED)
 					no_tag = true;
 				break;
 			}
@@ -2358,7 +2358,7 @@ blk_mq_alloc_hctx(struct request_queue *q, struct blk_mq_tag_set *set,
 	spin_lock_init(&hctx->lock);
 	INIT_LIST_HEAD(&hctx->dispatch);
 	hctx->queue = q;
-	hctx->flags = set->flags & ~BLK_MQ_F_TAG_SHARED;
+	hctx->flags = set->flags & ~BLK_MQ_F_TAG_QUEUE_SHARED;
 
 	INIT_LIST_HEAD(&hctx->hctx_list);
 
@@ -2575,9 +2575,9 @@ static void queue_set_hctx_shared(struct request_queue *q, bool shared)
 
 	queue_for_each_hw_ctx(q, hctx, i) {
 		if (shared)
-			hctx->flags |= BLK_MQ_F_TAG_SHARED;
+			hctx->flags |= BLK_MQ_F_TAG_QUEUE_SHARED;
 		else
-			hctx->flags &= ~BLK_MQ_F_TAG_SHARED;
+			hctx->flags &= ~BLK_MQ_F_TAG_QUEUE_SHARED;
 	}
 }
 
@@ -2603,7 +2603,7 @@ static void blk_mq_del_queue_tag_set(struct request_queue *q)
 	list_del_rcu(&q->tag_set_list);
 	if (list_is_singular(&set->tag_list)) {
 		/* just transitioned to unshared */
-		set->flags &= ~BLK_MQ_F_TAG_SHARED;
+		set->flags &= ~BLK_MQ_F_TAG_QUEUE_SHARED;
 		/* update existing queue */
 		blk_mq_update_tag_set_depth(set, false);
 	}
@@ -2620,12 +2620,12 @@ static void blk_mq_add_queue_tag_set(struct blk_mq_tag_set *set,
 	 * Check to see if we're transitioning to shared (from 1 to 2 queues).
 	 */
 	if (!list_empty(&set->tag_list) &&
-	    !(set->flags & BLK_MQ_F_TAG_SHARED)) {
-		set->flags |= BLK_MQ_F_TAG_SHARED;
+	    !(set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)) {
+		set->flags |= BLK_MQ_F_TAG_QUEUE_SHARED;
 		/* update existing queue */
 		blk_mq_update_tag_set_depth(set, true);
 	}
-	if (set->flags & BLK_MQ_F_TAG_SHARED)
+	if (set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)
 		queue_set_hctx_shared(q, true);
 	list_add_tail_rcu(&q->tag_set_list, &set->tag_list);
 
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 0bf056de5cc3..147185394a25 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -225,7 +225,7 @@ struct blk_mq_ops {
 
 enum {
 	BLK_MQ_F_SHOULD_MERGE	= 1 << 0,
-	BLK_MQ_F_TAG_SHARED	= 1 << 1,
+	BLK_MQ_F_TAG_QUEUE_SHARED	= 1 << 1,
 	BLK_MQ_F_BLOCKING	= 1 << 5,
 	BLK_MQ_F_NO_SCHED	= 1 << 6,
 	BLK_MQ_F_ALLOC_POLICY_START_BIT = 8,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH RFC 3/5] blk-mq: Facilitate a shared tags per tagset
  2019-11-13 13:36 [PATCH RFC 0/5] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs John Garry
  2019-11-13 13:36 ` [PATCH RFC 1/5] blk-mq: Remove some unused function arguments John Garry
  2019-11-13 13:36 ` [PATCH RFC 2/5] blk-mq: rename BLK_MQ_F_TAG_SHARED as BLK_MQ_F_TAG_QUEUE_SHARED John Garry
@ 2019-11-13 13:36 ` John Garry
  2019-11-13 14:06   ` Hannes Reinecke
  2019-11-13 13:36 ` [PATCH RFC 4/5] scsi: Add template flag 'host_tagset' John Garry
  2019-11-13 13:36 ` [PATCH RFC 5/5] scsi: hisi_sas: Switch v3 hw to MQ John Garry
  4 siblings, 1 reply; 22+ messages in thread
From: John Garry @ 2019-11-13 13:36 UTC (permalink / raw)
  To: axboe, jejb, martin.petersen
  Cc: linux-block, linux-kernel, linux-scsi, ming.lei, hare,
	bvanassche, chenxiang66, John Garry

Some SCSI HBAs (such as HPSA, megaraid, mpt3sas, hisi_sas_v3 ..) support
multiple reply queues with single hostwide tags.

In addition, these drivers want to use interrupt assignment in
pci_alloc_irq_vectors(PCI_IRQ_AFFINITY). However, as discussed in [0],
CPU hotplug may cause in-flight IO completion to not be serviced when an
interrupt is shutdown.

To solve that problem, Ming's patchset to drain hctx's should ensure no
IOs are missed in-flight [1].

However, to take advantage of that patchset, we need to map the HBA HW
queues to blk mq hctx's; to do that, we need to expose the HBA HW queues.

In making that transition, the per-SCSI command request tags are no
longer unique per Scsi host - they are just unique per hctx. As such, the
HBA LLDD would have to generate this tag internally, which has a certain
performance overhead.

However another problem is that blk mq assumes the host may accept
(Scsi_host.can_queue * #hw queue) commands. In [2], we removed the Scsi
host busy counter, which would stop the LLDD being sent more than
.can_queue commands; however, we should still ensure that the block layer
does not issue more than .can_queue commands to the Scsi host.

To solve this problem, introduce a shared tags per blk_mq_tag_set, which
may be requested when allocating the tagset.

New flag BLK_MQ_F_TAG_HCTX_SHARED should be set when requesting the
tagset.

This is based on work originally from Ming Lei in [3].

[0] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/
[1] https://lore.kernel.org/linux-block/20191014015043.25029-1-ming.lei@redhat.com/
[2] https://lore.kernel.org/linux-scsi/20191025065855.6309-1-ming.lei@redhat.com/
[3] https://lore.kernel.org/linux-block/20190531022801.10003-1-ming.lei@redhat.com/

Signed-off-by: John Garry <john.garry@huawei.com>
---
 block/blk-core.c       |  1 +
 block/blk-flush.c      |  2 +
 block/blk-mq-debugfs.c |  2 +-
 block/blk-mq-tag.c     | 85 ++++++++++++++++++++++++++++++++++++++++++
 block/blk-mq-tag.h     |  1 +
 block/blk-mq.c         | 61 +++++++++++++++++++++++++-----
 block/blk-mq.h         |  9 +++++
 include/linux/blk-mq.h |  3 ++
 include/linux/blkdev.h |  1 +
 9 files changed, 155 insertions(+), 10 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index d5e668ec751b..79eb8983ed45 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -116,6 +116,7 @@ void blk_rq_init(struct request_queue *q, struct request *rq)
 	INIT_HLIST_NODE(&rq->hash);
 	RB_CLEAR_NODE(&rq->rb_node);
 	rq->tag = -1;
+	rq->shared_tag = -1;
 	rq->internal_tag = -1;
 	rq->start_time_ns = ktime_get_ns();
 	rq->part = NULL;
diff --git a/block/blk-flush.c b/block/blk-flush.c
index 1eec9cbe5a0a..b9ad9a5978f5 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -228,6 +228,7 @@ static void flush_end_io(struct request *flush_rq, blk_status_t error)
 	if (!q->elevator) {
 		blk_mq_tag_set_rq(hctx, flush_rq->tag, fq->orig_rq);
 		flush_rq->tag = -1;
+		flush_rq->shared_tag = -1;
 	} else {
 		blk_mq_put_driver_tag(flush_rq);
 		flush_rq->internal_tag = -1;
@@ -309,6 +310,7 @@ static void blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq,
 	if (!q->elevator) {
 		fq->orig_rq = first_rq;
 		flush_rq->tag = first_rq->tag;
+		flush_rq->shared_tag = first_rq->shared_tag;
 		blk_mq_tag_set_rq(flush_rq->mq_hctx, first_rq->tag, flush_rq);
 	} else {
 		flush_rq->internal_tag = first_rq->internal_tag;
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 33a40ae1d60f..dc90c42aeb9a 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -339,7 +339,7 @@ int __blk_mq_debugfs_rq_show(struct seq_file *m, struct request *rq)
 	blk_flags_show(m, (__force unsigned int)rq->rq_flags, rqf_name,
 		       ARRAY_SIZE(rqf_name));
 	seq_printf(m, ", .state=%s", blk_mq_rq_state_name(blk_mq_rq_state(rq)));
-	seq_printf(m, ", .tag=%d, .internal_tag=%d", rq->tag,
+	seq_printf(m, ", .tag=%d, .shared_tag=%d, .internal_tag=%d", rq->tag, rq->shared_tag,
 		   rq->internal_tag);
 	if (mq_ops->show_rq)
 		mq_ops->show_rq(m, rq);
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index d7aa23c82dbf..0a6c8a6b05dd 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -191,6 +191,91 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data)
 	return tag + tag_offset;
 }

+ /* We could factor this out */
+unsigned int blk_mq_get_shared_tag(struct blk_mq_alloc_data *data)
+{
+	struct blk_mq_tags *tags = blk_mq_shared_tags_from_data(data);
+	struct sbitmap_queue *bt;
+	struct sbq_wait_state *ws;
+	DEFINE_SBQ_WAIT(wait);
+	unsigned int tag_offset;
+	int tag;
+
+	if (data->flags & BLK_MQ_REQ_RESERVED) {
+		if (unlikely(!tags->nr_reserved_tags)) {
+			WARN_ON_ONCE(1);
+			return BLK_MQ_TAG_FAIL;
+		}
+		bt = &tags->breserved_tags;
+		tag_offset = 0;
+	} else {
+		bt = &tags->bitmap_tags;
+		tag_offset = tags->nr_reserved_tags;
+	}
+
+	tag = __blk_mq_get_tag(data, bt);
+	if (tag != -1)
+		goto found_tag;
+
+	if (data->flags & BLK_MQ_REQ_NOWAIT)
+		return BLK_MQ_TAG_FAIL;
+
+	ws = bt_wait_ptr(bt, data->hctx);
+	do {
+		struct sbitmap_queue *bt_prev;
+
+		/*
+		 * We're out of tags on this hardware queue, kick any
+		 * pending IO submits before going to sleep waiting for
+		 * some to complete.
+		 */
+		blk_mq_run_hw_queues(data->q, false);
+
+		/*
+		 * Retry tag allocation after running the hardware queue,
+		 * as running the queue may also have found completions.
+		 */
+		tag = __blk_mq_get_tag(data, bt);
+		if (tag != -1)
+			break;
+
+		sbitmap_prepare_to_wait(bt, ws, &wait, TASK_UNINTERRUPTIBLE);
+
+		tag = __blk_mq_get_tag(data, bt);
+		if (tag != -1)
+			break;
+
+
+		bt_prev = bt;
+		io_schedule();
+
+		sbitmap_finish_wait(bt, ws, &wait);
+
+		data->ctx = blk_mq_get_ctx(data->q);
+		data->hctx = blk_mq_map_queue(data->q, data->cmd_flags,
+						data->ctx);
+		tags = blk_mq_tags_from_data(data);
+		if (data->flags & BLK_MQ_REQ_RESERVED)
+			bt = &tags->breserved_tags;
+		else
+			bt = &tags->bitmap_tags;
+
+		/*
+		 * If destination hw queue is changed, fake wake up on
+		 * previous queue for compensating the wake up miss, so
+		 * other allocations on previous queue won't be starved.
+		 */
+		if (bt != bt_prev)
+			sbitmap_queue_wake_up(bt_prev);
+
+		ws = bt_wait_ptr(bt, data->hctx);
+	} while (1);
+
+	sbitmap_finish_wait(bt, ws, &wait);
+
+found_tag:
+	return tag + tag_offset;
+}
+
 void blk_mq_put_tag(struct blk_mq_tags *tags, struct blk_mq_ctx *ctx,
 		    unsigned int tag)
 {
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index 88b85daa4976..82ff8faa70f7 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -26,6 +26,7 @@ extern struct blk_mq_tags *blk_mq_init_tags(unsigned int nr_tags, unsigned int r
 extern void blk_mq_free_tags(struct blk_mq_tags *tags);
 
 extern unsigned int blk_mq_get_tag(struct blk_mq_alloc_data *data);
+extern unsigned int blk_mq_get_shared_tag(struct blk_mq_alloc_data *data);
 extern void blk_mq_put_tag(struct blk_mq_tags *tags, struct blk_mq_ctx *ctx, unsigned int tag);
 extern bool blk_mq_has_free_tags(struct blk_mq_tags *tags);
 extern int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 6b39cf0efdcd..792eae37dc44 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -292,7 +292,7 @@ static inline bool blk_mq_need_time_stamp(struct request *rq)
 }
 
 static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
-		unsigned int tag, unsigned int op, u64 alloc_time_ns)
+		unsigned int tag, unsigned int shared_tag, unsigned int op, u64 alloc_time_ns)
 {
 	struct blk_mq_tags *tags = blk_mq_tags_from_data(data);
 	struct request *rq = tags->static_rqs[tag];
@@ -300,6 +300,7 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
 
 	if (data->flags & BLK_MQ_REQ_INTERNAL) {
 		rq->tag = -1;
+		rq->shared_tag = -1;
 		rq->internal_tag = tag;
 	} else {
 		if (data->hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED) {
@@ -307,6 +308,7 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
 			atomic_inc(&data->hctx->nr_active);
 		}
 		rq->tag = tag;
+		rq->shared_tag = shared_tag;
 		rq->internal_tag = -1;
 		data->hctx->tags->rqs[rq->tag] = rq;
 	}
@@ -359,7 +361,7 @@ static struct request *blk_mq_get_request(struct request_queue *q,
 {
 	struct elevator_queue *e = q->elevator;
 	struct request *rq;
-	unsigned int tag;
+	unsigned int tag, shared_tag = BLK_MQ_TAG_FAIL;
 	bool clear_ctx_on_error = false;
 	u64 alloc_time_ns = 0;
 
@@ -396,15 +398,17 @@ static struct request *blk_mq_get_request(struct request_queue *q,
 		blk_mq_tag_busy(data->hctx);
 	}
 
-	tag = blk_mq_get_tag(data);
-	if (tag == BLK_MQ_TAG_FAIL) {
-		if (clear_ctx_on_error)
-			data->ctx = NULL;
-		blk_queue_exit(q);
-		return NULL;
+	if (data->hctx->shared_tags) {
+		shared_tag = blk_mq_get_shared_tag(data);
+		if (shared_tag == BLK_MQ_TAG_FAIL)
+			goto err_shared_tag;
 	}
 
-	rq = blk_mq_rq_ctx_init(data, tag, data->cmd_flags, alloc_time_ns);
+	tag = blk_mq_get_tag(data);
+	if (tag == BLK_MQ_TAG_FAIL)
+		goto err_tag;
+
+	rq = blk_mq_rq_ctx_init(data, tag, shared_tag, data->cmd_flags, alloc_time_ns);
 	if (!op_is_flush(data->cmd_flags)) {
 		rq->elv.icq = NULL;
 		if (e && e->type->ops.prepare_request) {
@@ -417,6 +421,15 @@ static struct request *blk_mq_get_request(struct request_queue *q,
 	}
 	data->hctx->queued++;
 	return rq;
+
+err_tag:
+	if (shared_tag != BLK_MQ_TAG_FAIL)
+		blk_mq_put_tag(data->hctx->shared_tags, data->ctx, shared_tag);
+err_shared_tag:
+	if (clear_ctx_on_error)
+		data->ctx = NULL;
+	blk_queue_exit(q);
+	return NULL;
 }
 
 struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op,
@@ -498,6 +511,8 @@ static void __blk_mq_free_request(struct request *rq)
 
 	blk_pm_mark_last_busy(rq);
 	rq->mq_hctx = NULL;
+	if (rq->shared_tag != -1)
+		blk_mq_put_tag(hctx->shared_tags, ctx, rq->shared_tag);
 	if (rq->tag != -1)
 		blk_mq_put_tag(hctx->tags, ctx, rq->tag);
 	if (sched_tag != -1)
@@ -1070,6 +1085,14 @@ bool blk_mq_get_driver_tag(struct request *rq)
 		data.flags |= BLK_MQ_REQ_RESERVED;
 
 	shared = blk_mq_tag_busy(data.hctx);
+	if (rq && rq->mq_hctx && rq->mq_hctx->shared_tags) {
+		rq->shared_tag = blk_mq_get_shared_tag(&data);
+		if (rq->shared_tag == BLK_MQ_TAG_FAIL) {
+			blk_mq_put_tag(rq->mq_hctx->tags, rq->mq_ctx, rq->tag);
+			rq->tag = -1;
+			goto done;
+		}
+	}
 	rq->tag = blk_mq_get_tag(&data);
 	if (rq->tag >= 0) {
 		if (shared) {
@@ -1077,6 +1100,9 @@ bool blk_mq_get_driver_tag(struct request *rq)
 			atomic_inc(&data.hctx->nr_active);
 		}
 		data.hctx->tags->rqs[rq->tag] = rq;
+	} else if (rq->shared_tag >= 0) {
+		blk_mq_put_tag(rq->mq_hctx->tags, rq->mq_ctx, rq->tag);
+		rq->shared_tag = -1;
 	}
 
 done:
@@ -2317,6 +2343,7 @@ static int blk_mq_init_hctx(struct request_queue *q,
 	cpuhp_state_add_instance_nocalls(CPUHP_BLK_MQ_DEAD, &hctx->cpuhp_dead);
 
 	hctx->tags = set->tags[hctx_idx];
+	hctx->shared_tags = set->shared_tags;
 
 	if (set->ops->init_hctx &&
 	    set->ops->init_hctx(hctx, set->driver_data, hctx_idx))
@@ -3100,6 +3127,22 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
 	if (ret)
 		goto out_free_mq_map;
 
+	if (set->flags & BLK_MQ_F_TAG_HCTX_SHARED) {
+		int node = blk_mq_hw_queue_to_node(&set->map[HCTX_TYPE_DEFAULT], 0);
+		int alloc_policy = BLK_MQ_FLAG_TO_ALLOC_POLICY(set->flags);
+
+		if (node == NUMA_NO_NODE)
+			node = set->numa_node;
+
+		set->shared_tags = blk_mq_init_tags(set->queue_depth,
+						    set->reserved_tags,
+						    node, alloc_policy);
+		if (!set->shared_tags) {
+			ret = -ENOMEM;
+			goto out_free_mq_map;
+		}
+	}
+
 	mutex_init(&set->tag_list_lock);
 	INIT_LIST_HEAD(&set->tag_list);
 
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 78d38b5f2793..c328d335de7d 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -174,6 +174,14 @@ static inline struct blk_mq_tags *blk_mq_tags_from_data(struct blk_mq_alloc_data
 	return data->hctx->tags;
 }
 
+static inline struct blk_mq_tags *blk_mq_shared_tags_from_data(struct blk_mq_alloc_data *data)
+{
+	if (data->flags & BLK_MQ_REQ_INTERNAL)
+		return data->hctx->sched_tags;
+
+	return data->hctx->shared_tags;
+}
+
 static inline bool blk_mq_hctx_stopped(struct blk_mq_hw_ctx *hctx)
 {
 	return test_bit(BLK_MQ_S_STOPPED, &hctx->state);
@@ -210,6 +218,7 @@ static inline void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx,
 {
 	blk_mq_put_tag(hctx->tags, rq->mq_ctx, rq->tag);
 	rq->tag = -1;
+	rq->shared_tag = -1;
 
 	if (rq->rq_flags & RQF_MQ_INFLIGHT) {
 		rq->rq_flags &= ~RQF_MQ_INFLIGHT;
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 147185394a25..d3b402bd01a9 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -46,6 +46,7 @@ struct blk_mq_hw_ctx {
 	atomic_t		wait_index;
 
 	struct blk_mq_tags	*tags;
+	struct blk_mq_tags	*shared_tags;
 	struct blk_mq_tags	*sched_tags;
 
 	unsigned long		queued;
@@ -109,6 +110,7 @@ struct blk_mq_tag_set {
 	unsigned int		flags;		/* BLK_MQ_F_* */
 	void			*driver_data;
 
+	struct blk_mq_tags	*shared_tags;
 	struct blk_mq_tags	**tags;
 
 	struct mutex		tag_list_lock;
@@ -226,6 +228,7 @@ struct blk_mq_ops {
 enum {
 	BLK_MQ_F_SHOULD_MERGE	= 1 << 0,
 	BLK_MQ_F_TAG_QUEUE_SHARED	= 1 << 1,
+	BLK_MQ_F_TAG_HCTX_SHARED	= 1 << 2,
 	BLK_MQ_F_BLOCKING	= 1 << 5,
 	BLK_MQ_F_NO_SCHED	= 1 << 6,
 	BLK_MQ_F_ALLOC_POLICY_START_BIT = 8,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index f3ea78b0c91c..a4caa6407b3a 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -138,6 +138,7 @@ struct request {
 	req_flags_t rq_flags;
 
 	int tag;
+	int shared_tag;
 	int internal_tag;
 
 	/* the following two fields are internal, NEVER access directly */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH RFC 4/5] scsi: Add template flag 'host_tagset'
  2019-11-13 13:36 [PATCH RFC 0/5] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs John Garry
                   ` (2 preceding siblings ...)
  2019-11-13 13:36 ` [PATCH RFC 3/5] blk-mq: Facilitate a shared tags per tagset John Garry
@ 2019-11-13 13:36 ` John Garry
  2019-11-13 13:36 ` [PATCH RFC 5/5] scsi: hisi_sas: Switch v3 hw to MQ John Garry
  4 siblings, 0 replies; 22+ messages in thread
From: John Garry @ 2019-11-13 13:36 UTC (permalink / raw)
  To: axboe, jejb, martin.petersen
  Cc: linux-block, linux-kernel, linux-scsi, ming.lei, hare,
	bvanassche, chenxiang66, John Garry

From: Hannes Reinecke <hare@suse.com>

Add a host template flag 'host_tagset' so hostwide tagset can be
shared on multiple reply queues after the SCSI device's reply queue
is converted to blk-mq hw queue.

Signed-off-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: John Garry <john.garry@huawei.com>
---
 drivers/scsi/scsi_lib.c  | 2 ++
 include/scsi/scsi_host.h | 3 +++
 2 files changed, 5 insertions(+)

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index dc210b9d4896..4baa5c01a0bc 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1900,6 +1900,8 @@ int scsi_mq_setup_tags(struct Scsi_Host *shost)
 	shost->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
 	shost->tag_set.flags |=
 		BLK_ALLOC_POLICY_TO_MQ_FLAG(shost->hostt->tag_alloc_policy);
+	if (shost->hostt->host_tagset)
+		shost->tag_set.flags |= BLK_MQ_F_TAG_HCTX_SHARED;
 	shost->tag_set.driver_data = shost;
 
 	return blk_mq_alloc_tag_set(&shost->tag_set);
diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h
index 2c3f0c58869b..a90d7e38366b 100644
--- a/include/scsi/scsi_host.h
+++ b/include/scsi/scsi_host.h
@@ -442,6 +442,9 @@ struct scsi_host_template {
 	/* True if the low-level driver supports blk-mq only */
 	unsigned force_blk_mq:1;
 
+	/* True if the host uses host-wide tagspace */
+	unsigned host_tagset:1;
+
 	/*
 	 * Countdown for host blocking with no commands outstanding.
 	 */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH RFC 5/5] scsi: hisi_sas: Switch v3 hw to MQ
  2019-11-13 13:36 [PATCH RFC 0/5] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs John Garry
                   ` (3 preceding siblings ...)
  2019-11-13 13:36 ` [PATCH RFC 4/5] scsi: Add template flag 'host_tagset' John Garry
@ 2019-11-13 13:36 ` John Garry
  4 siblings, 0 replies; 22+ messages in thread
From: John Garry @ 2019-11-13 13:36 UTC (permalink / raw)
  To: axboe, jejb, martin.petersen
  Cc: linux-block, linux-kernel, linux-scsi, ming.lei, hare,
	bvanassche, chenxiang66, John Garry

Now that the block layer provides a shared tag, we can switch the driver
to expose all HW queues.

Signed-off-by: John Garry <john.garry@huawei.com>
---
 drivers/scsi/hisi_sas/hisi_sas.h       |  3 +-
 drivers/scsi/hisi_sas/hisi_sas_main.c  | 43 +++++++------
 drivers/scsi/hisi_sas/hisi_sas_v3_hw.c | 85 +++++++++++---------------
 3 files changed, 59 insertions(+), 72 deletions(-)

diff --git a/drivers/scsi/hisi_sas/hisi_sas.h b/drivers/scsi/hisi_sas/hisi_sas.h
index 233c73e01246..0405602df2a4 100644
--- a/drivers/scsi/hisi_sas/hisi_sas.h
+++ b/drivers/scsi/hisi_sas/hisi_sas.h
@@ -8,6 +8,8 @@
 #define _HISI_SAS_H_
 
 #include <linux/acpi.h>
+#include <linux/blk-mq.h>
+#include <linux/blk-mq-pci.h>
 #include <linux/clk.h>
 #include <linux/debugfs.h>
 #include <linux/dmapool.h>
@@ -431,7 +433,6 @@ struct hisi_hba {
 	u32 intr_coal_count;	/* Interrupt count to coalesce */
 
 	int cq_nvecs;
-	unsigned int *reply_map;
 
 	/* bist */
 	enum sas_linkrate debugfs_bist_linkrate;
diff --git a/drivers/scsi/hisi_sas/hisi_sas_main.c b/drivers/scsi/hisi_sas/hisi_sas_main.c
index 52d5b9afbb18..cca94b296065 100644
--- a/drivers/scsi/hisi_sas/hisi_sas_main.c
+++ b/drivers/scsi/hisi_sas/hisi_sas_main.c
@@ -183,12 +183,13 @@ static void hisi_sas_slot_index_set(struct hisi_hba *hisi_hba, int slot_idx)
 static int hisi_sas_slot_index_alloc(struct hisi_hba *hisi_hba,
 				     struct scsi_cmnd *scsi_cmnd)
 {
-	int index;
 	void *bitmap = hisi_hba->slot_index_tags;
+	struct Scsi_Host *shost = hisi_hba->shost;
 	unsigned long flags;
+	int index;
 
-	if (scsi_cmnd)
-		return scsi_cmnd->request->tag;
+	if (shost->hostt->host_tagset && scsi_cmnd)
+		return scsi_cmnd->request->shared_tag;
 
 	spin_lock_irqsave(&hisi_hba->lock, flags);
 	index = find_next_zero_bit(bitmap, hisi_hba->slot_index_count,
@@ -421,6 +422,7 @@ static int hisi_sas_task_prep(struct sas_task *task,
 	struct device *dev = hisi_hba->dev;
 	int dlvry_queue_slot, dlvry_queue, rc, slot_idx;
 	int n_elem = 0, n_elem_dif = 0, n_elem_req = 0;
+	struct scsi_cmnd *scmd = NULL;
 	struct hisi_sas_dq *dq;
 	unsigned long flags;
 	int wr_q_index;
@@ -436,10 +438,23 @@ static int hisi_sas_task_prep(struct sas_task *task,
 		return -ECOMM;
 	}
 
-	if (hisi_hba->reply_map) {
-		int cpu = raw_smp_processor_id();
-		unsigned int dq_index = hisi_hba->reply_map[cpu];
+	if (task->uldd_task) {
+		struct ata_queued_cmd *qc;
+
+		if (dev_is_sata(device)) {
+			qc = task->uldd_task;
+			scmd = qc->scsicmd;
+		} else {
+			scmd = task->uldd_task;
+		}
+	}
 
+	if (scmd) {
+		unsigned int dq_index;
+		u32 blk_tag;
+
+		blk_tag = blk_mq_unique_tag(scmd->request);
+		dq_index = blk_mq_unique_tag_to_hwq(blk_tag);
 		*dq_pointer = dq = &hisi_hba->dq[dq_index];
 	} else {
 		*dq_pointer = dq = sas_dev->dq;
@@ -468,21 +483,9 @@ static int hisi_sas_task_prep(struct sas_task *task,
 
 	if (hisi_hba->hw->slot_index_alloc)
 		rc = hisi_hba->hw->slot_index_alloc(hisi_hba, device);
-	else {
-		struct scsi_cmnd *scsi_cmnd = NULL;
-
-		if (task->uldd_task) {
-			struct ata_queued_cmd *qc;
+	else
+		rc = hisi_sas_slot_index_alloc(hisi_hba, scmd);
 
-			if (dev_is_sata(device)) {
-				qc = task->uldd_task;
-				scsi_cmnd = qc->scsicmd;
-			} else {
-				scsi_cmnd = task->uldd_task;
-			}
-		}
-		rc  = hisi_sas_slot_index_alloc(hisi_hba, scsi_cmnd);
-	}
 	if (rc < 0)
 		goto err_out_dif_dma_unmap;
 
diff --git a/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c b/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
index bf5d5f138437..98f2bd24a23b 100644
--- a/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
+++ b/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
@@ -2353,66 +2353,35 @@ static irqreturn_t cq_interrupt_v3_hw(int irq_no, void *p)
 	return IRQ_HANDLED;
 }
 
-static void setup_reply_map_v3_hw(struct hisi_hba *hisi_hba, int nvecs)
+static int interrupt_preinit_v3_hw(struct hisi_hba *hisi_hba)
 {
-	const struct cpumask *mask;
-	int queue, cpu;
-
-	for (queue = 0; queue < nvecs; queue++) {
-		struct hisi_sas_cq *cq = &hisi_hba->cq[queue];
+	int vectors;
+	int max_msi = HISI_SAS_MSI_COUNT_V3_HW, min_msi;
+	struct Scsi_Host *shost = hisi_hba->shost;
+	struct irq_affinity desc = {
+		.pre_vectors = BASE_VECTORS_V3_HW,
+	};
+
+	min_msi = MIN_AFFINE_VECTORS_V3_HW;
+	vectors = pci_alloc_irq_vectors_affinity(hisi_hba->pci_dev,
+						 min_msi, max_msi,
+						 PCI_IRQ_MSI |
+						 PCI_IRQ_AFFINITY,
+						 &desc);
+	if (vectors < 0)
+		return -ENOENT;
 
-		mask = pci_irq_get_affinity(hisi_hba->pci_dev, queue +
-					    BASE_VECTORS_V3_HW);
-		if (!mask)
-			goto fallback;
-		cq->pci_irq_mask = mask;
-		for_each_cpu(cpu, mask)
-			hisi_hba->reply_map[cpu] = queue;
-	}
-	return;
+	hisi_hba->cq_nvecs = vectors - BASE_VECTORS_V3_HW;
+	shost->nr_hw_queues = hisi_hba->cq_nvecs;
 
-fallback:
-	for_each_possible_cpu(cpu)
-		hisi_hba->reply_map[cpu] = cpu % hisi_hba->queue_count;
-	/* Don't clean all CQ masks */
+	return 0;
 }
 
 static int interrupt_init_v3_hw(struct hisi_hba *hisi_hba)
 {
 	struct device *dev = hisi_hba->dev;
 	struct pci_dev *pdev = hisi_hba->pci_dev;
-	int vectors, rc, i;
-	int max_msi = HISI_SAS_MSI_COUNT_V3_HW, min_msi;
-
-	if (auto_affine_msi_experimental) {
-		struct irq_affinity desc = {
-			.pre_vectors = BASE_VECTORS_V3_HW,
-		};
-
-		min_msi = MIN_AFFINE_VECTORS_V3_HW;
-
-		hisi_hba->reply_map = devm_kcalloc(dev, nr_cpu_ids,
-						   sizeof(unsigned int),
-						   GFP_KERNEL);
-		if (!hisi_hba->reply_map)
-			return -ENOMEM;
-		vectors = pci_alloc_irq_vectors_affinity(hisi_hba->pci_dev,
-							 min_msi, max_msi,
-							 PCI_IRQ_MSI |
-							 PCI_IRQ_AFFINITY,
-							 &desc);
-		if (vectors < 0)
-			return -ENOENT;
-		setup_reply_map_v3_hw(hisi_hba, vectors - BASE_VECTORS_V3_HW);
-	} else {
-		min_msi = max_msi;
-		vectors = pci_alloc_irq_vectors(hisi_hba->pci_dev, min_msi,
-						max_msi, PCI_IRQ_MSI);
-		if (vectors < 0)
-			return vectors;
-	}
-
-	hisi_hba->cq_nvecs = vectors - BASE_VECTORS_V3_HW;
+	int rc, i;
 
 	rc = devm_request_irq(dev, pci_irq_vector(pdev, 1),
 			      int_phy_up_down_bcast_v3_hw, 0,
@@ -3057,6 +3026,15 @@ static int debugfs_set_bist_v3_hw(struct hisi_hba *hisi_hba, bool enable)
 	return 0;
 }
 
+static int hisi_sas_map_queues(struct Scsi_Host *shost)
+{
+	struct hisi_hba *hisi_hba = shost_priv(shost);
+	struct blk_mq_queue_map *qmap = &shost->tag_set.map[HCTX_TYPE_DEFAULT];
+
+	return blk_mq_pci_map_queues(qmap, hisi_hba->pci_dev,
+					     BASE_VECTORS_V3_HW);
+}
+
 static struct scsi_host_template sht_v3_hw = {
 	.name			= DRV_NAME,
 	.module			= THIS_MODULE,
@@ -3065,6 +3043,7 @@ static struct scsi_host_template sht_v3_hw = {
 	.slave_configure	= hisi_sas_slave_configure,
 	.scan_finished		= hisi_sas_scan_finished,
 	.scan_start		= hisi_sas_scan_start,
+	.map_queues		= hisi_sas_map_queues,
 	.change_queue_depth	= sas_change_queue_depth,
 	.bios_param		= sas_bios_param,
 	.this_id		= -1,
@@ -3078,6 +3057,7 @@ static struct scsi_host_template sht_v3_hw = {
 	.shost_attrs		= host_attrs_v3_hw,
 	.tag_alloc_policy	= BLK_TAG_ALLOC_RR,
 	.host_reset             = hisi_sas_host_reset,
+	.host_tagset		= 1,
 };
 
 static const struct hisi_sas_hw hisi_sas_v3_hw = {
@@ -3249,6 +3229,9 @@ hisi_sas_v3_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	if (hisi_sas_debugfs_enable)
 		hisi_sas_debugfs_init(hisi_hba);
 
+	rc = interrupt_preinit_v3_hw(hisi_hba);
+	if (rc)
+		goto err_out_ha;
 	rc = scsi_add_host(shost, dev);
 	if (rc)
 		goto err_out_ha;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC 1/5] blk-mq: Remove some unused function arguments
  2019-11-13 13:36 ` [PATCH RFC 1/5] blk-mq: Remove some unused function arguments John Garry
@ 2019-11-13 13:58   ` Hannes Reinecke
  0 siblings, 0 replies; 22+ messages in thread
From: Hannes Reinecke @ 2019-11-13 13:58 UTC (permalink / raw)
  To: John Garry, axboe, jejb, martin.petersen
  Cc: linux-block, linux-kernel, linux-scsi, ming.lei, hare,
	bvanassche, chenxiang66

On 11/13/19 2:36 PM, John Garry wrote:
> The struct blk_mq_hw_ctx * argument in blk_mq_put_tag(),
> blk_mq_poll_nsecs(), and blk_mq_poll_hybrid_sleep() is unused, so remove
> it.
> 
> Signed-off-by: John Garry <john.garry@huawei.com>
> ---
>  block/blk-mq-tag.c |  4 ++--
>  block/blk-mq-tag.h |  3 +--
>  block/blk-mq.c     | 10 ++++------
>  block/blk-mq.h     |  2 +-
>  4 files changed, 8 insertions(+), 11 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      Teamlead Storage & Networking
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 247165 (AG München), GF: Felix Imendörffer

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC 2/5] blk-mq: rename BLK_MQ_F_TAG_SHARED as BLK_MQ_F_TAG_QUEUE_SHARED
  2019-11-13 13:36 ` [PATCH RFC 2/5] blk-mq: rename BLK_MQ_F_TAG_SHARED as BLK_MQ_F_TAG_QUEUE_SHARED John Garry
@ 2019-11-13 13:58   ` Hannes Reinecke
  0 siblings, 0 replies; 22+ messages in thread
From: Hannes Reinecke @ 2019-11-13 13:58 UTC (permalink / raw)
  To: John Garry, axboe, jejb, martin.petersen
  Cc: linux-block, linux-kernel, linux-scsi, ming.lei, hare,
	bvanassche, chenxiang66

On 11/13/19 2:36 PM, John Garry wrote:
> From: Ming Lei <ming.lei@redhat.com>
> 
> BLK_MQ_F_TAG_SHARED actually means that tags is shared among request
> queues, all of which should belong to LUNs attached to same HBA.
> 
> So rename it to make the point explicitly.
> 
> Suggested-by: Bart Van Assche <bvanassche@acm.org>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> Signed-off-by: John Garry <john.garry@huawei.com>
> ---
>  block/blk-mq-debugfs.c |  2 +-
>  block/blk-mq-tag.c     |  2 +-
>  block/blk-mq-tag.h     |  4 ++--
>  block/blk-mq.c         | 20 ++++++++++----------
>  include/linux/blk-mq.h |  2 +-
>  5 files changed, 15 insertions(+), 15 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      Teamlead Storage & Networking
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 247165 (AG München), GF: Felix Imendörffer

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC 3/5] blk-mq: Facilitate a shared tags per tagset
  2019-11-13 13:36 ` [PATCH RFC 3/5] blk-mq: Facilitate a shared tags per tagset John Garry
@ 2019-11-13 14:06   ` Hannes Reinecke
  2019-11-13 14:57     ` John Garry
  0 siblings, 1 reply; 22+ messages in thread
From: Hannes Reinecke @ 2019-11-13 14:06 UTC (permalink / raw)
  To: John Garry, axboe, jejb, martin.petersen
  Cc: linux-block, linux-kernel, linux-scsi, ming.lei, hare,
	bvanassche, chenxiang66

On 11/13/19 2:36 PM, John Garry wrote:
> Some SCSI HBAs (such as HPSA, megaraid, mpt3sas, hisi_sas_v3 ..) support
> multiple reply queues with single hostwide tags.
> 
> In addition, these drivers want to use interrupt assignment in
> pci_alloc_irq_vectors(PCI_IRQ_AFFINITY). However, as discussed in [0],
> CPU hotplug may cause in-flight IO completion to not be serviced when an
> interrupt is shutdown.
> 
> To solve that problem, Ming's patchset to drain hctx's should ensure no
> IOs are missed in-flight [1].
> 
> However, to take advantage of that patchset, we need to map the HBA HW
> queues to blk mq hctx's; to do that, we need to expose the HBA HW queues.
> 
> In making that transition, the per-SCSI command request tags are no
> longer unique per Scsi host - they are just unique per hctx. As such, the
> HBA LLDD would have to generate this tag internally, which has a certain
> performance overhead.
> 
> However another problem is that blk mq assumes the host may accept
> (Scsi_host.can_queue * #hw queue) commands. In [2], we removed the Scsi
> host busy counter, which would stop the LLDD being sent more than
> .can_queue commands; however, we should still ensure that the block layer
> does not issue more than .can_queue commands to the Scsi host.
> 
> To solve this problem, introduce a shared tags per blk_mq_tag_set, which
> may be requested when allocating the tagset.
> 
> New flag BLK_MQ_F_TAG_HCTX_SHARED should be set when requesting the
> tagset.
> 
> This is based on work originally from Ming Lei in [3].
> 
> [0] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/
> [1] https://lore.kernel.org/linux-block/20191014015043.25029-1-ming.lei@redhat.com/
> [2] https://lore.kernel.org/linux-scsi/20191025065855.6309-1-ming.lei@redhat.com/
> [3] https://lore.kernel.org/linux-block/20190531022801.10003-1-ming.lei@redhat.com/
> 
> Signed-off-by: John Garry <john.garry@huawei.com>
> ---
>  block/blk-core.c       |  1 +
>  block/blk-flush.c      |  2 +
>  block/blk-mq-debugfs.c |  2 +-
>  block/blk-mq-tag.c     | 85 ++++++++++++++++++++++++++++++++++++++++++
>  block/blk-mq-tag.h     |  1 +
>  block/blk-mq.c         | 61 +++++++++++++++++++++++++-----
>  block/blk-mq.h         |  9 +++++
>  include/linux/blk-mq.h |  3 ++
>  include/linux/blkdev.h |  1 +
>  9 files changed, 155 insertions(+), 10 deletions(-)
> 
[ .. ]
> @@ -396,15 +398,17 @@ static struct request *blk_mq_get_request(struct request_queue *q,
>  		blk_mq_tag_busy(data->hctx);
>  	}
>  
> -	tag = blk_mq_get_tag(data);
> -	if (tag == BLK_MQ_TAG_FAIL) {
> -		if (clear_ctx_on_error)
> -			data->ctx = NULL;
> -		blk_queue_exit(q);
> -		return NULL;
> +	if (data->hctx->shared_tags) {
> +		shared_tag = blk_mq_get_shared_tag(data);
> +		if (shared_tag == BLK_MQ_TAG_FAIL)
> +			goto err_shared_tag;
>  	}
>  
> -	rq = blk_mq_rq_ctx_init(data, tag, data->cmd_flags, alloc_time_ns);
> +	tag = blk_mq_get_tag(data);
> +	if (tag == BLK_MQ_TAG_FAIL)
> +		goto err_tag;
> +
> +	rq = blk_mq_rq_ctx_init(data, tag, shared_tag, data->cmd_flags, alloc_time_ns);
>  	if (!op_is_flush(data->cmd_flags)) {
>  		rq->elv.icq = NULL;
>  		if (e && e->type->ops.prepare_request) {
Why do you need to keep a parallel tag accounting between 'normal' and
'shared' tags here?
Isn't is sufficient to get a shared tag only, and us that in lieo of the
'real' one?

I would love to combine both, as then we can easily do a reverse mapping
by using the 'tag' value to lookup the command itself, and can possibly
do the 'scsi_cmd_priv' trick of embedding the LLDD-specific parts within
the command. With this split we'll be wasting quite some memory there,
as the possible 'tag' values are actually nr_hw_queues * shared_tags.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      Teamlead Storage & Networking
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 247165 (AG München), GF: Felix Imendörffer

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC 3/5] blk-mq: Facilitate a shared tags per tagset
  2019-11-13 14:06   ` Hannes Reinecke
@ 2019-11-13 14:57     ` John Garry
  2019-11-13 15:38       ` Hannes Reinecke
  0 siblings, 1 reply; 22+ messages in thread
From: John Garry @ 2019-11-13 14:57 UTC (permalink / raw)
  To: Hannes Reinecke, axboe, jejb, martin.petersen
  Cc: linux-block, linux-kernel, linux-scsi, ming.lei, hare,
	bvanassche, chenxiang (M)

On 13/11/2019 14:06, Hannes Reinecke wrote:
> On 11/13/19 2:36 PM, John Garry wrote:
>> Some SCSI HBAs (such as HPSA, megaraid, mpt3sas, hisi_sas_v3 ..) support
>> multiple reply queues with single hostwide tags.
>>
>> In addition, these drivers want to use interrupt assignment in
>> pci_alloc_irq_vectors(PCI_IRQ_AFFINITY). However, as discussed in [0],
>> CPU hotplug may cause in-flight IO completion to not be serviced when an
>> interrupt is shutdown.
>>
>> To solve that problem, Ming's patchset to drain hctx's should ensure no
>> IOs are missed in-flight [1].
>>
>> However, to take advantage of that patchset, we need to map the HBA HW
>> queues to blk mq hctx's; to do that, we need to expose the HBA HW queues.
>>
>> In making that transition, the per-SCSI command request tags are no
>> longer unique per Scsi host - they are just unique per hctx. As such, the
>> HBA LLDD would have to generate this tag internally, which has a certain
>> performance overhead.
>>
>> However another problem is that blk mq assumes the host may accept
>> (Scsi_host.can_queue * #hw queue) commands. In [2], we removed the Scsi
>> host busy counter, which would stop the LLDD being sent more than
>> .can_queue commands; however, we should still ensure that the block layer
>> does not issue more than .can_queue commands to the Scsi host.
>>
>> To solve this problem, introduce a shared tags per blk_mq_tag_set, which
>> may be requested when allocating the tagset.
>>
>> New flag BLK_MQ_F_TAG_HCTX_SHARED should be set when requesting the
>> tagset.
>>
>> This is based on work originally from Ming Lei in [3].
>>
>> [0] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/
>> [1] https://lore.kernel.org/linux-block/20191014015043.25029-1-ming.lei@redhat.com/
>> [2] https://lore.kernel.org/linux-scsi/20191025065855.6309-1-ming.lei@redhat.com/
>> [3] https://lore.kernel.org/linux-block/20190531022801.10003-1-ming.lei@redhat.com/
>>
>> Signed-off-by: John Garry <john.garry@huawei.com>
>> ---
>>   block/blk-core.c       |  1 +
>>   block/blk-flush.c      |  2 +
>>   block/blk-mq-debugfs.c |  2 +-
>>   block/blk-mq-tag.c     | 85 ++++++++++++++++++++++++++++++++++++++++++
>>   block/blk-mq-tag.h     |  1 +
>>   block/blk-mq.c         | 61 +++++++++++++++++++++++++-----
>>   block/blk-mq.h         |  9 +++++
>>   include/linux/blk-mq.h |  3 ++
>>   include/linux/blkdev.h |  1 +
>>   9 files changed, 155 insertions(+), 10 deletions(-)
>>
> [ .. ]
>> @@ -396,15 +398,17 @@ static struct request *blk_mq_get_request(struct request_queue *q,
>>   		blk_mq_tag_busy(data->hctx);
>>   	}
>>   
>> -	tag = blk_mq_get_tag(data);
>> -	if (tag == BLK_MQ_TAG_FAIL) {
>> -		if (clear_ctx_on_error)
>> -			data->ctx = NULL;
>> -		blk_queue_exit(q);
>> -		return NULL;
>> +	if (data->hctx->shared_tags) {
>> +		shared_tag = blk_mq_get_shared_tag(data);
>> +		if (shared_tag == BLK_MQ_TAG_FAIL)
>> +			goto err_shared_tag;
>>   	}
>>   
>> -	rq = blk_mq_rq_ctx_init(data, tag, data->cmd_flags, alloc_time_ns);
>> +	tag = blk_mq_get_tag(data);
>> +	if (tag == BLK_MQ_TAG_FAIL)
>> +		goto err_tag;
>> +
>> +	rq = blk_mq_rq_ctx_init(data, tag, shared_tag, data->cmd_flags, alloc_time_ns);
>>   	if (!op_is_flush(data->cmd_flags)) {
>>   		rq->elv.icq = NULL;
>>   		if (e && e->type->ops.prepare_request) {

Hi Hannes,

> Why do you need to keep a parallel tag accounting between 'normal' and
> 'shared' tags here?
> Isn't is sufficient to get a shared tag only, and us that in lieo of the
> 'real' one?

In theory, yes. Just the 'shared' tag should be adequate.

A problem I see with this approach is that we lose the identity of which 
tags are allocated for each hctx. As an example for this, consider 
blk_mq_queue_tag_busy_iter(), which iterates the bits for each hctx. 
Now, if you're just using shared tags only, that wouldn't work.

Consider blk_mq_can_queue() as another example - this tells us if a hctx 
has any bits unset, while with only using shared tags it would tell if 
any bits unset over all queues, and this change in semantics could break 
things. At a glance, function __blk_mq_tag_idle() looks problematic also.

And this is where it becomes messy to implement.

> 
> I would love to combine both,

Same here...

  as then we can easily do a reverse mapping
> by using the 'tag' value to lookup the command itself, and can possibly
> do the 'scsi_cmd_priv' trick of embedding the LLDD-specific parts within
> the command. With this split we'll be wasting quite some memory there,
> as the possible 'tag' values are actually nr_hw_queues * shared_tags.

Yeah, understood. That's just a trade-off I saw.

Thanks,
John

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC 3/5] blk-mq: Facilitate a shared tags per tagset
  2019-11-13 14:57     ` John Garry
@ 2019-11-13 15:38       ` Hannes Reinecke
  2019-11-13 16:21         ` John Garry
  0 siblings, 1 reply; 22+ messages in thread
From: Hannes Reinecke @ 2019-11-13 15:38 UTC (permalink / raw)
  To: John Garry, axboe, jejb, martin.petersen
  Cc: linux-block, linux-kernel, linux-scsi, ming.lei, hare,
	bvanassche, chenxiang (M)

On 11/13/19 3:57 PM, John Garry wrote:
> On 13/11/2019 14:06, Hannes Reinecke wrote:
>> On 11/13/19 2:36 PM, John Garry wrote:
>>> Some SCSI HBAs (such as HPSA, megaraid, mpt3sas, hisi_sas_v3 ..) support
>>> multiple reply queues with single hostwide tags.
>>>
>>> In addition, these drivers want to use interrupt assignment in
>>> pci_alloc_irq_vectors(PCI_IRQ_AFFINITY). However, as discussed in [0],
>>> CPU hotplug may cause in-flight IO completion to not be serviced when an
>>> interrupt is shutdown.
>>>
>>> To solve that problem, Ming's patchset to drain hctx's should ensure no
>>> IOs are missed in-flight [1].
>>>
>>> However, to take advantage of that patchset, we need to map the HBA HW
>>> queues to blk mq hctx's; to do that, we need to expose the HBA HW
>>> queues.
>>>
>>> In making that transition, the per-SCSI command request tags are no
>>> longer unique per Scsi host - they are just unique per hctx. As such,
>>> the
>>> HBA LLDD would have to generate this tag internally, which has a certain
>>> performance overhead.
>>>
>>> However another problem is that blk mq assumes the host may accept
>>> (Scsi_host.can_queue * #hw queue) commands. In [2], we removed the Scsi
>>> host busy counter, which would stop the LLDD being sent more than
>>> .can_queue commands; however, we should still ensure that the block
>>> layer
>>> does not issue more than .can_queue commands to the Scsi host.
>>>
>>> To solve this problem, introduce a shared tags per blk_mq_tag_set, which
>>> may be requested when allocating the tagset.
>>>
>>> New flag BLK_MQ_F_TAG_HCTX_SHARED should be set when requesting the
>>> tagset.
>>>
>>> This is based on work originally from Ming Lei in [3].
>>>
>>> [0]
>>> https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/
>>>
>>> [1]
>>> https://lore.kernel.org/linux-block/20191014015043.25029-1-ming.lei@redhat.com/
>>>
>>> [2]
>>> https://lore.kernel.org/linux-scsi/20191025065855.6309-1-ming.lei@redhat.com/
>>>
>>> [3]
>>> https://lore.kernel.org/linux-block/20190531022801.10003-1-ming.lei@redhat.com/
>>>
>>>
>>> Signed-off-by: John Garry <john.garry@huawei.com>
>>> ---
>>>   block/blk-core.c       |  1 +
>>>   block/blk-flush.c      |  2 +
>>>   block/blk-mq-debugfs.c |  2 +-
>>>   block/blk-mq-tag.c     | 85 ++++++++++++++++++++++++++++++++++++++++++
>>>   block/blk-mq-tag.h     |  1 +
>>>   block/blk-mq.c         | 61 +++++++++++++++++++++++++-----
>>>   block/blk-mq.h         |  9 +++++
>>>   include/linux/blk-mq.h |  3 ++
>>>   include/linux/blkdev.h |  1 +
>>>   9 files changed, 155 insertions(+), 10 deletions(-)
>>>
>> [ .. ]
>>> @@ -396,15 +398,17 @@ static struct request
>>> *blk_mq_get_request(struct request_queue *q,
>>>           blk_mq_tag_busy(data->hctx);
>>>       }
>>>   -    tag = blk_mq_get_tag(data);
>>> -    if (tag == BLK_MQ_TAG_FAIL) {
>>> -        if (clear_ctx_on_error)
>>> -            data->ctx = NULL;
>>> -        blk_queue_exit(q);
>>> -        return NULL;
>>> +    if (data->hctx->shared_tags) {
>>> +        shared_tag = blk_mq_get_shared_tag(data);
>>> +        if (shared_tag == BLK_MQ_TAG_FAIL)
>>> +            goto err_shared_tag;
>>>       }
>>>   -    rq = blk_mq_rq_ctx_init(data, tag, data->cmd_flags,
>>> alloc_time_ns);
>>> +    tag = blk_mq_get_tag(data);
>>> +    if (tag == BLK_MQ_TAG_FAIL)
>>> +        goto err_tag;
>>> +
>>> +    rq = blk_mq_rq_ctx_init(data, tag, shared_tag, data->cmd_flags,
>>> alloc_time_ns);
>>>       if (!op_is_flush(data->cmd_flags)) {
>>>           rq->elv.icq = NULL;
>>>           if (e && e->type->ops.prepare_request) {
> 
> Hi Hannes,
> 
>> Why do you need to keep a parallel tag accounting between 'normal' and
>> 'shared' tags here?
>> Isn't is sufficient to get a shared tag only, and us that in lieo of the
>> 'real' one?
> 
> In theory, yes. Just the 'shared' tag should be adequate.
> 
> A problem I see with this approach is that we lose the identity of which
> tags are allocated for each hctx. As an example for this, consider
> blk_mq_queue_tag_busy_iter(), which iterates the bits for each hctx.
> Now, if you're just using shared tags only, that wouldn't work.
> 
> Consider blk_mq_can_queue() as another example - this tells us if a hctx
> has any bits unset, while with only using shared tags it would tell if
> any bits unset over all queues, and this change in semantics could break
> things. At a glance, function __blk_mq_tag_idle() looks problematic also.
> 
> And this is where it becomes messy to implement.
> 
Oh, my. Indeed, that's correct.

But then we don't really care _which_ shared tag is assigned; so
wouldn't be we better off by just having an atomic counter here?
Cache locality will be blown anyway ...

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      Teamlead Storage & Networking
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 247165 (AG München), GF: Felix Imendörffer

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC 3/5] blk-mq: Facilitate a shared tags per tagset
  2019-11-13 15:38       ` Hannes Reinecke
@ 2019-11-13 16:21         ` John Garry
  2019-11-13 18:38           ` Hannes Reinecke
  0 siblings, 1 reply; 22+ messages in thread
From: John Garry @ 2019-11-13 16:21 UTC (permalink / raw)
  To: Hannes Reinecke, axboe, jejb, martin.petersen
  Cc: linux-block, linux-kernel, linux-scsi, ming.lei, hare,
	bvanassche, chenxiang (M)

On 13/11/2019 15:38, Hannes Reinecke wrote:
>>>> -        if (clear_ctx_on_error)
>>>> -            data->ctx = NULL;
>>>> -        blk_queue_exit(q);
>>>> -        return NULL;
>>>> +    if (data->hctx->shared_tags) {
>>>> +        shared_tag = blk_mq_get_shared_tag(data);
>>>> +        if (shared_tag == BLK_MQ_TAG_FAIL)
>>>> +            goto err_shared_tag;
>>>>        }
>>>>    -    rq = blk_mq_rq_ctx_init(data, tag, data->cmd_flags,
>>>> alloc_time_ns);
>>>> +    tag = blk_mq_get_tag(data);
>>>> +    if (tag == BLK_MQ_TAG_FAIL)
>>>> +        goto err_tag;
>>>> +
>>>> +    rq = blk_mq_rq_ctx_init(data, tag, shared_tag, data->cmd_flags,
>>>> alloc_time_ns);
>>>>        if (!op_is_flush(data->cmd_flags)) {
>>>>            rq->elv.icq = NULL;
>>>>            if (e && e->type->ops.prepare_request) {
>> Hi Hannes,
>>
>>> Why do you need to keep a parallel tag accounting between 'normal' and
>>> 'shared' tags here?
>>> Isn't is sufficient to get a shared tag only, and us that in lieo of the
>>> 'real' one?
>> In theory, yes. Just the 'shared' tag should be adequate.
>>
>> A problem I see with this approach is that we lose the identity of which
>> tags are allocated for each hctx. As an example for this, consider
>> blk_mq_queue_tag_busy_iter(), which iterates the bits for each hctx.
>> Now, if you're just using shared tags only, that wouldn't work.
>>
>> Consider blk_mq_can_queue() as another example - this tells us if a hctx
>> has any bits unset, while with only using shared tags it would tell if
>> any bits unset over all queues, and this change in semantics could break
>> things. At a glance, function __blk_mq_tag_idle() looks problematic also.
>>
>> And this is where it becomes messy to implement.
>>

Hi Hannes,

> Oh, my. Indeed, that's correct.

The tags could be kept in sync like this:

shared_tag = blk_mq_get_tag(shared_tagset);
if (shared_tag != -1)
	sbitmap_set(hctx->tags, shared_tag);

But that's obviously not ideal.

> 
> But then we don't really care _which_ shared tag is assigned; so
> wouldn't be we better off by just having an atomic counter here?
> Cache locality will be blown anyway ...
The atomic counter would solve the issuing more than Scsi_host.can_queue 
to the LLDD, but we still need a unique tag, which is what the shared 
tag is.

Thanks,
John

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC 3/5] blk-mq: Facilitate a shared tags per tagset
  2019-11-13 16:21         ` John Garry
@ 2019-11-13 18:38           ` Hannes Reinecke
  2019-11-14  9:41             ` John Garry
  0 siblings, 1 reply; 22+ messages in thread
From: Hannes Reinecke @ 2019-11-13 18:38 UTC (permalink / raw)
  To: John Garry, axboe, jejb, martin.petersen
  Cc: linux-block, linux-kernel, linux-scsi, ming.lei, hare,
	bvanassche, chenxiang (M)

On 11/13/19 5:21 PM, John Garry wrote:
> On 13/11/2019 15:38, Hannes Reinecke wrote:
>>>>> -        if (clear_ctx_on_error)
>>>>> -            data->ctx = NULL;
>>>>> -        blk_queue_exit(q);
>>>>> -        return NULL;
>>>>> +    if (data->hctx->shared_tags) {
>>>>> +        shared_tag = blk_mq_get_shared_tag(data);
>>>>> +        if (shared_tag == BLK_MQ_TAG_FAIL)
>>>>> +            goto err_shared_tag;
>>>>>        }
>>>>>    -    rq = blk_mq_rq_ctx_init(data, tag, data->cmd_flags,
>>>>> alloc_time_ns);
>>>>> +    tag = blk_mq_get_tag(data);
>>>>> +    if (tag == BLK_MQ_TAG_FAIL)
>>>>> +        goto err_tag;
>>>>> +
>>>>> +    rq = blk_mq_rq_ctx_init(data, tag, shared_tag, data->cmd_flags,
>>>>> alloc_time_ns);
>>>>>        if (!op_is_flush(data->cmd_flags)) {
>>>>>            rq->elv.icq = NULL;
>>>>>            if (e && e->type->ops.prepare_request) {
>>> Hi Hannes,
>>>
>>>> Why do you need to keep a parallel tag accounting between 'normal' and
>>>> 'shared' tags here?
>>>> Isn't is sufficient to get a shared tag only, and us that in lieo of 
>>>> the
>>>> 'real' one?
>>> In theory, yes. Just the 'shared' tag should be adequate.
>>>
>>> A problem I see with this approach is that we lose the identity of which
>>> tags are allocated for each hctx. As an example for this, consider
>>> blk_mq_queue_tag_busy_iter(), which iterates the bits for each hctx.
>>> Now, if you're just using shared tags only, that wouldn't work.
>>>
>>> Consider blk_mq_can_queue() as another example - this tells us if a hctx
>>> has any bits unset, while with only using shared tags it would tell if
>>> any bits unset over all queues, and this change in semantics could break
>>> things. At a glance, function __blk_mq_tag_idle() looks problematic 
>>> also.
>>>
>>> And this is where it becomes messy to implement.
>>>
> 
> Hi Hannes,
> 
>> Oh, my. Indeed, that's correct.
> 
> The tags could be kept in sync like this:
> 
> shared_tag = blk_mq_get_tag(shared_tagset);
> if (shared_tag != -1)
>      sbitmap_set(hctx->tags, shared_tag);
> 
> But that's obviously not ideal.
> 
Actually, I _do_ prefer keeping both in sync.
We might want to check if the 'normal' tag is set (typically it would 
not, but then, who knows ...)
The beauty here is that both 'shared' and 'normal' tag are in sync, so 
if a driver would be wanting to use the tag as index into a command 
array it can do so without any surprises.

Why do you think it's not ideal?

>>
>> But then we don't really care _which_ shared tag is assigned; so
>> wouldn't be we better off by just having an atomic counter here?
>> Cache locality will be blown anyway ...
> The atomic counter would solve the issuing more than Scsi_host.can_queue 
> to the LLDD, but we still need a unique tag, which is what the shared 
> tag is.
> 
Yeah, true. Daft idea :-)

Cheers,

Hannes
-- 
Dr. Hannes Reinecke            Teamlead Storage & Networking
hare@suse.de                              +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Mary Higgins, Sri Rasiah
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC 3/5] blk-mq: Facilitate a shared tags per tagset
  2019-11-13 18:38           ` Hannes Reinecke
@ 2019-11-14  9:41             ` John Garry
  2019-11-15  5:30               ` Bart Van Assche
  2019-11-15  7:26               ` Hannes Reinecke
  0 siblings, 2 replies; 22+ messages in thread
From: John Garry @ 2019-11-14  9:41 UTC (permalink / raw)
  To: Hannes Reinecke, axboe, jejb, martin.petersen
  Cc: linux-block, linux-kernel, linux-scsi, ming.lei, hare,
	bvanassche, chenxiang (M)

On 13/11/2019 18:38, Hannes Reinecke wrote:
>> Hi Hannes,
>>
>>> Oh, my. Indeed, that's correct.
>>
>> The tags could be kept in sync like this:
>>
>> shared_tag = blk_mq_get_tag(shared_tagset);
>> if (shared_tag != -1)
>>      sbitmap_set(hctx->tags, shared_tag);
>>
>> But that's obviously not ideal.
>>
> Actually, I _do_ prefer keeping both in sync.
> We might want to check if the 'normal' tag is set (typically it would 
> not, but then, who knows ...)
> The beauty here is that both 'shared' and 'normal' tag are in sync, so 
> if a driver would be wanting to use the tag as index into a command 
> array it can do so without any surprises.
> 
> Why do you think it's not ideal?

A few points:
- Getting a bit from one tagset and then setting it in another tagset is 
a bit clunky.
- There may be an atomicity of the getting the shared tag bit and 
setting the hctx tag bit - I don't think that there is.
- Consider that sometimes we may want to check if there is space on a hw 
queue - checking the hctx tags is not really proper any longer, as 
typically there would always be space on hctx, but not always the shared 
tags. We did delete blk_mq_can_queue() yesterday, which would be an 
example of that. Need to check if there are others.

Having said all that, the obvious advantage is performance gain, can 
still use request.tag and so maybe less intrusive changes.

I'll have a look at the implementation. The devil is mostly in the detail...

Thanks,
John

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC 3/5] blk-mq: Facilitate a shared tags per tagset
  2019-11-14  9:41             ` John Garry
@ 2019-11-15  5:30               ` Bart Van Assche
  2019-11-15  7:29                 ` Hannes Reinecke
  2019-11-15 10:24                 ` John Garry
  2019-11-15  7:26               ` Hannes Reinecke
  1 sibling, 2 replies; 22+ messages in thread
From: Bart Van Assche @ 2019-11-15  5:30 UTC (permalink / raw)
  To: John Garry, Hannes Reinecke, axboe, jejb, martin.petersen
  Cc: linux-block, linux-kernel, linux-scsi, ming.lei, hare, chenxiang (M)

On 11/14/19 1:41 AM, John Garry wrote:
> On 13/11/2019 18:38, Hannes Reinecke wrote:
>>> Hi Hannes,
>>>
>>>> Oh, my. Indeed, that's correct.
>>>
>>> The tags could be kept in sync like this:
>>>
>>> shared_tag = blk_mq_get_tag(shared_tagset);
>>> if (shared_tag != -1)
>>>      sbitmap_set(hctx->tags, shared_tag);
>>>
>>> But that's obviously not ideal.
>>>
>> Actually, I _do_ prefer keeping both in sync.
>> We might want to check if the 'normal' tag is set (typically it would not, but then, who knows ...)
>> The beauty here is that both 'shared' and 'normal' tag are in sync, so if a driver would be wanting to use the tag as index into a command array it can do so without any surprises.
>>
>> Why do you think it's not ideal?
> 
> A few points:
> - Getting a bit from one tagset and then setting it in another tagset is a bit clunky.
> - There may be an atomicity of the getting the shared tag bit and setting the hctx tag bit - I don't think that there is.
> - Consider that sometimes we may want to check if there is space on a hw queue - checking the hctx tags is not really proper any longer, as typically there would always be space on hctx, but not always the shared tags. We did delete blk_mq_can_queue() yesterday, which
> would be an example of that. Need to check if there are others.
> 
> Having said all that, the obvious advantage is performance gain, can still use request.tag and so maybe less intrusive changes.
> 
> I'll have a look at the implementation. The devil is mostly in the detail...

Wouldn't that approach trigger a deadlock if it is attempted to allocate the last
tag from two different hardware queues? How about sharing tag sets across hardware
queues, e.g. like in the (totally untested) patch below?

Thanks,

Bart.

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index b3f2ba483992..3678e95ec947 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -211,8 +211,6 @@ static const struct blk_mq_debugfs_attr blk_mq_debugfs_queue_attrs[] = {
 #define HCTX_STATE_NAME(name) [BLK_MQ_S_##name] = #name
 static const char *const hctx_state_name[] = {
 	HCTX_STATE_NAME(STOPPED),
-	HCTX_STATE_NAME(TAG_ACTIVE),
-	HCTX_STATE_NAME(SCHED_RESTART),
 };
 #undef HCTX_STATE_NAME

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index ca22afd47b3d..6262584dca09 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -64,18 +64,18 @@ void blk_mq_sched_assign_ioc(struct request *rq)
  */
 void blk_mq_sched_mark_restart_hctx(struct blk_mq_hw_ctx *hctx)
 {
-	if (test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state))
+	if (test_bit(BLK_MQ_T_SCHED_RESTART, &hctx->tags->state))
 		return;

-	set_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
+	set_bit(BLK_MQ_T_SCHED_RESTART, &hctx->tags->state);
 }
 EXPORT_SYMBOL_GPL(blk_mq_sched_mark_restart_hctx);

 void blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx)
 {
-	if (!test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state))
+	if (!test_bit(BLK_MQ_T_SCHED_RESTART, &hctx->tags->state))
 		return;
-	clear_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
+	clear_bit(BLK_MQ_T_SCHED_RESTART, &hctx->tags->state);

 	blk_mq_run_hw_queue(hctx, true);
 }
@@ -479,12 +479,15 @@ static int blk_mq_sched_alloc_tags(struct request_queue *q,
 /* called in queue's release handler, tagset has gone away */
 static void blk_mq_sched_tags_teardown(struct request_queue *q)
 {
+	struct blk_mq_tags *sched_tags = NULL;
 	struct blk_mq_hw_ctx *hctx;
 	int i;

 	queue_for_each_hw_ctx(q, hctx, i) {
-		if (hctx->sched_tags) {
+		if (hctx->sched_tags != sched_tags) {
 			blk_mq_free_rq_map(hctx->sched_tags);
+			if (!sched_tags)
+				sched_tags = hctx->sched_tags;
 			hctx->sched_tags = NULL;
 		}
 	}
@@ -512,6 +515,10 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
 				   BLKDEV_MAX_RQ);

 	queue_for_each_hw_ctx(q, hctx, i) {
+		if (i > 0 && q->tag_set->share_tags) {
+			hctx->sched_tags = q->queue_hw_ctx[0]->sched_tags;
+			continue;
+		}
 		ret = blk_mq_sched_alloc_tags(q, hctx, i);
 		if (ret)
 			goto err;
@@ -556,8 +563,11 @@ void blk_mq_sched_free_requests(struct request_queue *q)
 	int i;

 	queue_for_each_hw_ctx(q, hctx, i) {
-		if (hctx->sched_tags)
+		if (hctx->sched_tags) {
 			blk_mq_free_rqs(q->tag_set, hctx->sched_tags, i);
+			if (q->tag_set->share_tags)
+				break;
+		}
 	}
 }

diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
index 126021fc3a11..15174a646468 100644
--- a/block/blk-mq-sched.h
+++ b/block/blk-mq-sched.h
@@ -82,7 +82,7 @@ static inline bool blk_mq_sched_has_work(struct blk_mq_hw_ctx *hctx)

 static inline bool blk_mq_sched_needs_restart(struct blk_mq_hw_ctx *hctx)
 {
-	return test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
+	return test_bit(BLK_MQ_T_SCHED_RESTART, &hctx->tags->state);
 }

 #endif
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 586c9d6e904a..770fe2324230 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -23,8 +23,8 @@
  */
 bool __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx)
 {
-	if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state) &&
-	    !test_and_set_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
+	if (!test_bit(BLK_MQ_T_ACTIVE, &hctx->tags->state) &&
+	    !test_and_set_bit(BLK_MQ_T_ACTIVE, &hctx->tags->state))
 		atomic_inc(&hctx->tags->active_queues);

 	return true;
@@ -48,7 +48,7 @@ void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
 {
 	struct blk_mq_tags *tags = hctx->tags;

-	if (!test_and_clear_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
+	if (!test_and_clear_bit(BLK_MQ_T_ACTIVE, &hctx->tags->state))
 		return;

 	atomic_dec(&tags->active_queues);
@@ -67,7 +67,7 @@ static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx,

 	if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_SHARED))
 		return true;
-	if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
+	if (!test_bit(BLK_MQ_T_ACTIVE, &hctx->tags->state))
 		return true;

 	/*
@@ -220,7 +220,7 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
 	 * We can hit rq == NULL here, because the tagging functions
 	 * test and set the bit before assigning ->rqs[].
 	 */
-	if (rq && rq->q == hctx->queue)
+	if (rq && rq->q == hctx->queue && rq->mq_hctx == hctx)
 		return iter_data->fn(hctx, rq, iter_data->data, reserved);
 	return true;
 }
@@ -341,8 +341,11 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
 	int i;

 	for (i = 0; i < tagset->nr_hw_queues; i++) {
-		if (tagset->tags && tagset->tags[i])
+		if (tagset->tags && tagset->tags[i]) {
 			blk_mq_all_tag_busy_iter(tagset->tags[i], fn, priv);
+			if (tagset->share_tags)
+				break;
+		}
 	}
 }
 EXPORT_SYMBOL(blk_mq_tagset_busy_iter);
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index d0c10d043891..f75fa936b090 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -4,6 +4,11 @@

 #include "blk-mq.h"

+enum {
+	BLK_MQ_T_ACTIVE		= 1,
+	BLK_MQ_T_SCHED_RESTART	= 2,
+};
+
 /*
  * Tag address space map.
  */
@@ -11,6 +16,11 @@ struct blk_mq_tags {
 	unsigned int nr_tags;
 	unsigned int nr_reserved_tags;

+	/**
+	 * @state: BLK_MQ_T_* flags. Defines the state of the hw
+	 * queue (active, scheduled to restart).
+	 */
+	unsigned long	state;
 	atomic_t active_queues;

 	struct sbitmap_queue bitmap_tags;
diff --git a/block/blk-mq.c b/block/blk-mq.c
index fec4b82ff91c..81d4d6a96098 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2404,6 +2404,12 @@ static bool __blk_mq_alloc_rq_map(struct blk_mq_tag_set *set, int hctx_idx)
 {
 	int ret = 0;

+	if (hctx_idx > 0 && set->share_tags) {
+		WARN_ON_ONCE(!set->tags[0]);
+		set->tags[hctx_idx] = set->tags[0];
+		return 0;
+	}
+
 	set->tags[hctx_idx] = blk_mq_alloc_rq_map(set, hctx_idx,
 					set->queue_depth, set->reserved_tags);
 	if (!set->tags[hctx_idx])
@@ -2423,8 +2429,10 @@ static void blk_mq_free_map_and_requests(struct blk_mq_tag_set *set,
 					 unsigned int hctx_idx)
 {
 	if (set->tags && set->tags[hctx_idx]) {
-		blk_mq_free_rqs(set, set->tags[hctx_idx], hctx_idx);
-		blk_mq_free_rq_map(set->tags[hctx_idx]);
+		if (hctx_idx == 0 || !set->share_tags) {
+			blk_mq_free_rqs(set, set->tags[hctx_idx], hctx_idx);
+			blk_mq_free_rq_map(set->tags[hctx_idx]);
+		}
 		set->tags[hctx_idx] = NULL;
 	}
 }
@@ -2568,7 +2576,7 @@ static void blk_mq_del_queue_tag_set(struct request_queue *q)

 	mutex_lock(&set->tag_list_lock);
 	list_del_rcu(&q->tag_set_list);
-	if (list_is_singular(&set->tag_list)) {
+	if (list_is_singular(&set->tag_list) && !set->share_tags) {
 		/* just transitioned to unshared */
 		set->flags &= ~BLK_MQ_F_TAG_SHARED;
 		/* update existing queue */
@@ -2586,7 +2594,7 @@ static void blk_mq_add_queue_tag_set(struct blk_mq_tag_set *set,
 	/*
 	 * Check to see if we're transitioning to shared (from 1 to 2 queues).
 	 */
-	if (!list_empty(&set->tag_list) &&
+	if ((!list_empty(&set->tag_list) || set->share_tags) &&
 	    !(set->flags & BLK_MQ_F_TAG_SHARED)) {
 		set->flags |= BLK_MQ_F_TAG_SHARED;
 		/* update existing queue */
@@ -2911,15 +2919,21 @@ static int __blk_mq_alloc_rq_maps(struct blk_mq_tag_set *set)
 {
 	int i;

-	for (i = 0; i < set->nr_hw_queues; i++)
-		if (!__blk_mq_alloc_rq_map(set, i))
+	for (i = 0; i < set->nr_hw_queues; i++) {
+		if (i > 0 && set->share_tags) {
+			set->tags[i] = set->tags[0];
+		} else if (!__blk_mq_alloc_rq_map(set, i))
 			goto out_unwind;
+	}

 	return 0;

 out_unwind:
-	while (--i >= 0)
+	while (--i >= 0) {
+		if (i > 0 && set->share_tags)
+			continue;
 		blk_mq_free_rq_map(set->tags[i]);
+	}

 	return -ENOMEM;
 }
@@ -3016,6 +3030,10 @@ static int blk_mq_realloc_tag_set_tags(struct blk_mq_tag_set *set,
  * May fail with EINVAL for various error conditions. May adjust the
  * requested depth down, if it's too large. In that case, the set
  * value will be stored in set->queue_depth.
+ *
+ * @set: tag set for which to allocate tags.
+ * @share_tags: If true, allocate a single set of tags and share it across
+ *	hardware queues.
  */
 int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
 {
@@ -3137,6 +3155,12 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
 	queue_for_each_hw_ctx(q, hctx, i) {
 		if (!hctx->tags)
 			continue;
+		if (i > 0 && set->share_tags) {
+			hctx->tags[i] = hctx->tags[0];
+			if (hctx->sched_tags)
+				hctx->sched_tags[i] = hctx->sched_tags[0];
+			continue;
+		}
 		/*
 		 * If we're using an MQ scheduler, just update the scheduler
 		 * queue depth. This is similar to what the old code would do.
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 11cfd6470b1a..dd5517476314 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -224,10 +224,13 @@ enum hctx_type {
  * @numa_node:	   NUMA node the storage adapter has been connected to.
  * @timeout:	   Request processing timeout in jiffies.
  * @flags:	   Zero or more BLK_MQ_F_* flags.
+ * @share_tags:	   Whether or not to share one tag set across hardware queues.
  * @driver_data:   Pointer to data owned by the block driver that created this
  *		   tag set.
- * @tags:	   Tag sets. One tag set per hardware queue. Has @nr_hw_queues
- *		   elements.
+ * @tags:	   Array of tag set pointers. Has @nr_hw_queues elements. If
+ *		   share_tags has not been set, all tag set pointers are
+ *		   different. If share_tags has been set, all tag_set pointers
+ *		   are identical.
  * @tag_list_lock: Serializes tag_list accesses.
  * @tag_list:	   List of the request queues that use this tag set. See also
  *		   request_queue.tag_set_list.
@@ -243,6 +246,7 @@ struct blk_mq_tag_set {
 	int			numa_node;
 	unsigned int		timeout;
 	unsigned int		flags;
+	bool			share_tags;
 	void			*driver_data;

 	struct blk_mq_tags	**tags;
@@ -394,8 +398,6 @@ enum {
 	BLK_MQ_F_ALLOC_POLICY_BITS = 1,

 	BLK_MQ_S_STOPPED	= 0,
-	BLK_MQ_S_TAG_ACTIVE	= 1,
-	BLK_MQ_S_SCHED_RESTART	= 2,

 	BLK_MQ_MAX_DEPTH	= 10240,



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC 3/5] blk-mq: Facilitate a shared tags per tagset
  2019-11-14  9:41             ` John Garry
  2019-11-15  5:30               ` Bart Van Assche
@ 2019-11-15  7:26               ` Hannes Reinecke
  2019-11-15 10:46                 ` John Garry
  1 sibling, 1 reply; 22+ messages in thread
From: Hannes Reinecke @ 2019-11-15  7:26 UTC (permalink / raw)
  To: John Garry, axboe, jejb, martin.petersen
  Cc: linux-block, linux-kernel, linux-scsi, ming.lei, hare,
	bvanassche, chenxiang (M)

On 11/14/19 10:41 AM, John Garry wrote:
> On 13/11/2019 18:38, Hannes Reinecke wrote:
>>> Hi Hannes,
>>>
>>>> Oh, my. Indeed, that's correct.
>>>
>>> The tags could be kept in sync like this:
>>>
>>> shared_tag = blk_mq_get_tag(shared_tagset);
>>> if (shared_tag != -1)
>>>      sbitmap_set(hctx->tags, shared_tag);
>>>
>>> But that's obviously not ideal.
>>>
>> Actually, I _do_ prefer keeping both in sync.
>> We might want to check if the 'normal' tag is set (typically it would
>> not, but then, who knows ...)
>> The beauty here is that both 'shared' and 'normal' tag are in sync, so
>> if a driver would be wanting to use the tag as index into a command
>> array it can do so without any surprises.
>>
>> Why do you think it's not ideal?
> 
> A few points:
> - Getting a bit from one tagset and then setting it in another tagset is
> a bit clunky.
Yes, that's true.
But painstakingly trying to find a free bit in a bitmask when we already
know which to pick is also a bit daft.

> - There may be an atomicity of the getting the shared tag bit and
> setting the hctx tag bit - I don't think that there is.

That was precisely what I've alluded to in 'We might want to check if
the normal tag is set'.
Typically the 'normal' tag would be free (as the shared tag set out of
necessity needs to be the combination of all hctx tag sets).
Any difference here _is_ a programming error, and should be flagged as
such (sbitmap_test_and_set() anyone?)
We might have ordering issues on release, as we really should drop the
hctx tag before the shared tag; but when we observe that we should be fine.

> - Consider that sometimes we may want to check if there is space on a hw
> queue - checking the hctx tags is not really proper any longer, as
> typically there would always be space on hctx, but not always the shared
> tags. We did delete blk_mq_can_queue() yesterday, which would be an
> example of that. Need to check if there are others.
> 
Clearly, this needs an audit of all functions accessing the hctx tag
space; maybe it's worth having a pre-requisite patchset differentiating
between hctx tags and global, shared tags. Hmm.

> Having said all that, the obvious advantage is performance gain, can
> still use request.tag and so maybe less intrusive changes.
> 
> I'll have a look at the implementation. The devil is mostly in the
> detail...
> 
True.
And, incidentally, if we run with shared tage we can skip the scheduling
section in blk_mq_get_tag(); if we're out of tags, we're out of tags,
and no rescheduling will help as we don't _have_ other tagsets to look at.

But overall I like this approach.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      Teamlead Storage & Networking
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 247165 (AG München), GF: Felix Imendörffer

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC 3/5] blk-mq: Facilitate a shared tags per tagset
  2019-11-15  5:30               ` Bart Van Assche
@ 2019-11-15  7:29                 ` Hannes Reinecke
  2019-11-15 10:24                 ` John Garry
  1 sibling, 0 replies; 22+ messages in thread
From: Hannes Reinecke @ 2019-11-15  7:29 UTC (permalink / raw)
  To: Bart Van Assche, John Garry, axboe, jejb, martin.petersen
  Cc: linux-block, linux-kernel, linux-scsi, ming.lei, hare, chenxiang (M)

On 11/15/19 6:30 AM, Bart Van Assche wrote:
> On 11/14/19 1:41 AM, John Garry wrote:
>> On 13/11/2019 18:38, Hannes Reinecke wrote:
>>>> Hi Hannes,
>>>>
>>>>> Oh, my. Indeed, that's correct.
>>>>
>>>> The tags could be kept in sync like this:
>>>>
>>>> shared_tag = blk_mq_get_tag(shared_tagset);
>>>> if (shared_tag != -1)
>>>>      sbitmap_set(hctx->tags, shared_tag);
>>>>
>>>> But that's obviously not ideal.
>>>>
>>> Actually, I _do_ prefer keeping both in sync.
>>> We might want to check if the 'normal' tag is set (typically it would not, but then, who knows ...)
>>> The beauty here is that both 'shared' and 'normal' tag are in sync, so if a driver would be wanting to use the tag as index into a command array it can do so without any surprises.
>>>
>>> Why do you think it's not ideal?
>>
>> A few points:
>> - Getting a bit from one tagset and then setting it in another tagset is a bit clunky.
>> - There may be an atomicity of the getting the shared tag bit and setting the hctx tag bit - I don't think that there is.
>> - Consider that sometimes we may want to check if there is space on a hw queue - checking the hctx tags is not really proper any longer, as typically there would always be space on hctx, but not always the shared tags. We did delete blk_mq_can_queue() yesterday, which
>> would be an example of that. Need to check if there are others.
>>
>> Having said all that, the obvious advantage is performance gain, can still use request.tag and so maybe less intrusive changes.
>>
>> I'll have a look at the implementation. The devil is mostly in the detail...
> 
> Wouldn't that approach trigger a deadlock if it is attempted to allocate the last
> tag from two different hardware queues? How about sharing tag sets across hardware
> queues, e.g. like in the (totally untested) patch below?
> 
Why should it?
The shared tag map determines which tag should be allocated in the
per-hctx map, and as the former is a strict superset of all hctx maps
the bit _has_ to be free in the hctx map.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      Teamlead Storage & Networking
hare@suse.de			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 247165 (AG München), GF: Felix Imendörffer

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC 3/5] blk-mq: Facilitate a shared tags per tagset
  2019-11-15  5:30               ` Bart Van Assche
  2019-11-15  7:29                 ` Hannes Reinecke
@ 2019-11-15 10:24                 ` John Garry
  2019-11-15 17:57                   ` Bart Van Assche
  1 sibling, 1 reply; 22+ messages in thread
From: John Garry @ 2019-11-15 10:24 UTC (permalink / raw)
  To: Bart Van Assche, Hannes Reinecke, axboe, jejb, martin.petersen
  Cc: linux-block, linux-kernel, linux-scsi, ming.lei, hare, chenxiang (M)

>>> Actually, I _do_ prefer keeping both in sync.
>>> We might want to check if the 'normal' tag is set (typically it would not, but then, who knows ...)
>>> The beauty here is that both 'shared' and 'normal' tag are in sync, so if a driver would be wanting to use the tag as index into a command array it can do so without any surprises.
>>>
>>> Why do you think it's not ideal?
>>
>> A few points:
>> - Getting a bit from one tagset and then setting it in another tagset is a bit clunky.
>> - There may be an atomicity of the getting the shared tag bit and setting the hctx tag bit - I don't think that there is.
>> - Consider that sometimes we may want to check if there is space on a hw queue - checking the hctx tags is not really proper any longer, as typically there would always be space on hctx, but not always the shared tags. We did delete blk_mq_can_queue() yesterday, which
>> would be an example of that. Need to check if there are others.
>>
>> Having said all that, the obvious advantage is performance gain, can still use request.tag and so maybe less intrusive changes.
>>
>> I'll have a look at the implementation. The devil is mostly in the detail...
> 

Hi Bart,

> Wouldn't that approach trigger a deadlock if it is attempted to allocate the last
> tag from two different hardware queues?

I see Hannes answered this one.

How about sharing tag sets across hardware
 > queues, e.g. like in the (totally untested) patch below?

So this is similar in principle what Ming Lei came up with here:
https://lore.kernel.org/linux-block/20190531022801.10003-1-ming.lei@redhat.com/

However your implementation looks neater, which is good.

My concern with this approach is that we can't differentiate which tags 
are allocated for which hctx, and sometimes we need to know that.

An example here was blk_mq_queue_tag_busy_iter(), which iterates the 
bits for each hctx. This would just be broken by that change, unless we 
record which bits are associated with each hctx.

Another example was __blk_mq_tag_idle(), which looks problematic.

> 
> Thanks,
> 
> Bart. >
> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c

For debugfs, when we examine 
/sys/kernel/debug/block/.../hctxX/tags_bitmap, wouldn't that be the tags 
for all hctx (hctx0)?

For debugging reasons, I would say we want to know which tags are 
allocated for a specific hctx, as this is tightly related to the 
requests for that hctx.

> index b3f2ba483992..3678e95ec947 100644
> --- a/block/blk-mq-debugfs.c
> +++ b/block/blk-mq-debugfs.c
> @@ -211,8 +211,6 @@ static const struct blk_mq_debugfs_attr blk_mq_debugfs_queue_attrs[] = {
>   #define HCTX_STATE_NAME(name) [BLK_MQ_S_##name] = #name
>   static const char *const hctx_state_name[] = {
>   	HCTX_STATE_NAME(STOPPED),
> -	HCTX_STATE_NAME(TAG_ACTIVE),
> -	HCTX_STATE_NAME(SCHED_RESTART),
>   };
>   #undef HCTX_STATE_NAME
> 
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index ca22afd47b3d..6262584dca09 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -64,18 +64,18 @@ void blk_mq_sched_assign_ioc(struct request *rq)
>    */
>   void blk_mq_sched_mark_restart_hctx(struct blk_mq_hw_ctx *hctx)
>   {
> -	if (test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state))
> +	if (test_bit(BLK_MQ_T_SCHED_RESTART, &hctx->tags->state))
>   		return;
> 
> -	set_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
> +	set_bit(BLK_MQ_T_SCHED_RESTART, &hctx->tags->state);
>   }
>   EXPORT_SYMBOL_GPL(blk_mq_sched_mark_restart_hctx);
> 
>   void blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx)
>   {
> -	if (!test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state))
> +	if (!test_bit(BLK_MQ_T_SCHED_RESTART, &hctx->tags->state))
>   		return;
> -	clear_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
> +	clear_bit(BLK_MQ_T_SCHED_RESTART, &hctx->tags->state);
> 
>   	blk_mq_run_hw_queue(hctx, true);
>   }
> @@ -479,12 +479,15 @@ static int blk_mq_sched_alloc_tags(struct request_queue *q,
>   /* called in queue's release handler, tagset has gone away */
>   static void blk_mq_sched_tags_teardown(struct request_queue *q)
>   {
> +	struct blk_mq_tags *sched_tags = NULL;
>   	struct blk_mq_hw_ctx *hctx;
>   	int i;
> 
>   	queue_for_each_hw_ctx(q, hctx, i) {
> -		if (hctx->sched_tags) {
> +		if (hctx->sched_tags != sched_tags) {
>   			blk_mq_free_rq_map(hctx->sched_tags);
> +			if (!sched_tags)
> +				sched_tags = hctx->sched_tags;
>   			hctx->sched_tags = NULL;
>   		}
>   	}
> @@ -512,6 +515,10 @@ int blk_mq_init_sched(struct request_queue *q, struct elevator_type *e)
>   				   BLKDEV_MAX_RQ);
> 
>   	queue_for_each_hw_ctx(q, hctx, i) {
> +		if (i > 0 && q->tag_set->share_tags) {
> +			hctx->sched_tags = q->queue_hw_ctx[0]->sched_tags;
> +			continue;
> +		}
>   		ret = blk_mq_sched_alloc_tags(q, hctx, i);
>   		if (ret)
>   			goto err;
> @@ -556,8 +563,11 @@ void blk_mq_sched_free_requests(struct request_queue *q)
>   	int i;
> 
>   	queue_for_each_hw_ctx(q, hctx, i) {
> -		if (hctx->sched_tags)
> +		if (hctx->sched_tags) {
>   			blk_mq_free_rqs(q->tag_set, hctx->sched_tags, i);
> +			if (q->tag_set->share_tags)
> +				break;
> +		}
>   	}
>   }
> 
> diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h
> index 126021fc3a11..15174a646468 100644
> --- a/block/blk-mq-sched.h
> +++ b/block/blk-mq-sched.h
> @@ -82,7 +82,7 @@ static inline bool blk_mq_sched_has_work(struct blk_mq_hw_ctx *hctx)
> 
>   static inline bool blk_mq_sched_needs_restart(struct blk_mq_hw_ctx *hctx)
>   {
> -	return test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
> +	return test_bit(BLK_MQ_T_SCHED_RESTART, &hctx->tags->state);
>   }
> 
>   #endif
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index 586c9d6e904a..770fe2324230 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -23,8 +23,8 @@
>    */
>   bool __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx)
>   {
> -	if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state) &&
> -	    !test_and_set_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
> +	if (!test_bit(BLK_MQ_T_ACTIVE, &hctx->tags->state) &&
> +	    !test_and_set_bit(BLK_MQ_T_ACTIVE, &hctx->tags->state))
>   		atomic_inc(&hctx->tags->active_queues);
> 
>   	return true;
> @@ -48,7 +48,7 @@ void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
>   {
>   	struct blk_mq_tags *tags = hctx->tags;
> 
> -	if (!test_and_clear_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
> +	if (!test_and_clear_bit(BLK_MQ_T_ACTIVE, &hctx->tags->state))
>   		return;
> 
>   	atomic_dec(&tags->active_queues);
> @@ -67,7 +67,7 @@ static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx,
> 
>   	if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_SHARED))
>   		return true;
> -	if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
> +	if (!test_bit(BLK_MQ_T_ACTIVE, &hctx->tags->state))
>   		return true;
> 
>   	/*
> @@ -220,7 +220,7 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
>   	 * We can hit rq == NULL here, because the tagging functions
>   	 * test and set the bit before assigning ->rqs[].
>   	 */
> -	if (rq && rq->q == hctx->queue)
> +	if (rq && rq->q == hctx->queue && rq->mq_hctx == hctx)
>   		return iter_data->fn(hctx, rq, iter_data->data, reserved);
>   	return true;
>   }
> @@ -341,8 +341,11 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
>   	int i;
> 
>   	for (i = 0; i < tagset->nr_hw_queues; i++) {
> -		if (tagset->tags && tagset->tags[i])
> +		if (tagset->tags && tagset->tags[i]) {
>   			blk_mq_all_tag_busy_iter(tagset->tags[i], fn, priv);

As I mentioned earlier, wouldn't this iterate over all tags for all 
hctx's, when we just want the tags for hctx[i]?

Thanks,
John

[Not trimming reply for future reference]

> +			if (tagset->share_tags)
> +				break;
> +		}
>   	}
>   }
>   EXPORT_SYMBOL(blk_mq_tagset_busy_iter);
> diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
> index d0c10d043891..f75fa936b090 100644
> --- a/block/blk-mq-tag.h
> +++ b/block/blk-mq-tag.h
> @@ -4,6 +4,11 @@
> 
>   #include "blk-mq.h"
> 
> +enum {
> +	BLK_MQ_T_ACTIVE		= 1,
> +	BLK_MQ_T_SCHED_RESTART	= 2,
> +};
> +
>   /*
>    * Tag address space map.
>    */
> @@ -11,6 +16,11 @@ struct blk_mq_tags {
>   	unsigned int nr_tags;
>   	unsigned int nr_reserved_tags;
> 
> +	/**
> +	 * @state: BLK_MQ_T_* flags. Defines the state of the hw
> +	 * queue (active, scheduled to restart).
> +	 */
> +	unsigned long	state;
>   	atomic_t active_queues;
> 
>   	struct sbitmap_queue bitmap_tags;
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index fec4b82ff91c..81d4d6a96098 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2404,6 +2404,12 @@ static bool __blk_mq_alloc_rq_map(struct blk_mq_tag_set *set, int hctx_idx)
>   {
>   	int ret = 0;
> 
> +	if (hctx_idx > 0 && set->share_tags) {
> +		WARN_ON_ONCE(!set->tags[0]);
> +		set->tags[hctx_idx] = set->tags[0];
> +		return 0;
> +	}
> +
>   	set->tags[hctx_idx] = blk_mq_alloc_rq_map(set, hctx_idx,
>   					set->queue_depth, set->reserved_tags);
>   	if (!set->tags[hctx_idx])
> @@ -2423,8 +2429,10 @@ static void blk_mq_free_map_and_requests(struct blk_mq_tag_set *set,
>   					 unsigned int hctx_idx)
>   {
>   	if (set->tags && set->tags[hctx_idx]) {
> -		blk_mq_free_rqs(set, set->tags[hctx_idx], hctx_idx);
> -		blk_mq_free_rq_map(set->tags[hctx_idx]);
> +		if (hctx_idx == 0 || !set->share_tags) {
> +			blk_mq_free_rqs(set, set->tags[hctx_idx], hctx_idx);
> +			blk_mq_free_rq_map(set->tags[hctx_idx]);
> +		}
>   		set->tags[hctx_idx] = NULL;
>   	}
>   }
> @@ -2568,7 +2576,7 @@ static void blk_mq_del_queue_tag_set(struct request_queue *q)
> 
>   	mutex_lock(&set->tag_list_lock);
>   	list_del_rcu(&q->tag_set_list);
> -	if (list_is_singular(&set->tag_list)) {
> +	if (list_is_singular(&set->tag_list) && !set->share_tags) {
>   		/* just transitioned to unshared */
>   		set->flags &= ~BLK_MQ_F_TAG_SHARED;
>   		/* update existing queue */
> @@ -2586,7 +2594,7 @@ static void blk_mq_add_queue_tag_set(struct blk_mq_tag_set *set,
>   	/*
>   	 * Check to see if we're transitioning to shared (from 1 to 2 queues).
>   	 */
> -	if (!list_empty(&set->tag_list) &&
> +	if ((!list_empty(&set->tag_list) || set->share_tags) &&
>   	    !(set->flags & BLK_MQ_F_TAG_SHARED)) {
>   		set->flags |= BLK_MQ_F_TAG_SHARED;
>   		/* update existing queue */
> @@ -2911,15 +2919,21 @@ static int __blk_mq_alloc_rq_maps(struct blk_mq_tag_set *set)
>   {
>   	int i;
> 
> -	for (i = 0; i < set->nr_hw_queues; i++)
> -		if (!__blk_mq_alloc_rq_map(set, i))
> +	for (i = 0; i < set->nr_hw_queues; i++) {
> +		if (i > 0 && set->share_tags) {
> +			set->tags[i] = set->tags[0];
> +		} else if (!__blk_mq_alloc_rq_map(set, i))
>   			goto out_unwind;
> +	}
> 
>   	return 0;
> 
>   out_unwind:
> -	while (--i >= 0)
> +	while (--i >= 0) {
> +		if (i > 0 && set->share_tags)
> +			continue;
>   		blk_mq_free_rq_map(set->tags[i]);
> +	}
> 
>   	return -ENOMEM;
>   }
> @@ -3016,6 +3030,10 @@ static int blk_mq_realloc_tag_set_tags(struct blk_mq_tag_set *set,
>    * May fail with EINVAL for various error conditions. May adjust the
>    * requested depth down, if it's too large. In that case, the set
>    * value will be stored in set->queue_depth.
> + *
> + * @set: tag set for which to allocate tags.
> + * @share_tags: If true, allocate a single set of tags and share it across
> + *	hardware queues.
>    */
>   int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set)
>   {
> @@ -3137,6 +3155,12 @@ int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr)
>   	queue_for_each_hw_ctx(q, hctx, i) {
>   		if (!hctx->tags)
>   			continue;
> +		if (i > 0 && set->share_tags) {
> +			hctx->tags[i] = hctx->tags[0];
> +			if (hctx->sched_tags)
> +				hctx->sched_tags[i] = hctx->sched_tags[0];
> +			continue;
> +		}
>   		/*
>   		 * If we're using an MQ scheduler, just update the scheduler
>   		 * queue depth. This is similar to what the old code would do.
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index 11cfd6470b1a..dd5517476314 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -224,10 +224,13 @@ enum hctx_type {
>    * @numa_node:	   NUMA node the storage adapter has been connected to.
>    * @timeout:	   Request processing timeout in jiffies.
>    * @flags:	   Zero or more BLK_MQ_F_* flags.
> + * @share_tags:	   Whether or not to share one tag set across hardware queues.
>    * @driver_data:   Pointer to data owned by the block driver that created this
>    *		   tag set.
> - * @tags:	   Tag sets. One tag set per hardware queue. Has @nr_hw_queues
> - *		   elements.
> + * @tags:	   Array of tag set pointers. Has @nr_hw_queues elements. If
> + *		   share_tags has not been set, all tag set pointers are
> + *		   different. If share_tags has been set, all tag_set pointers
> + *		   are identical.
>    * @tag_list_lock: Serializes tag_list accesses.
>    * @tag_list:	   List of the request queues that use this tag set. See also
>    *		   request_queue.tag_set_list.
> @@ -243,6 +246,7 @@ struct blk_mq_tag_set {
>   	int			numa_node;
>   	unsigned int		timeout;
>   	unsigned int		flags;
> +	bool			share_tags;
>   	void			*driver_data;
> 
>   	struct blk_mq_tags	**tags;
> @@ -394,8 +398,6 @@ enum {
>   	BLK_MQ_F_ALLOC_POLICY_BITS = 1,
> 
>   	BLK_MQ_S_STOPPED	= 0,
> -	BLK_MQ_S_TAG_ACTIVE	= 1,
> -	BLK_MQ_S_SCHED_RESTART	= 2,
> 
>   	BLK_MQ_MAX_DEPTH	= 10240,
> 
> 
> .
> 


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC 3/5] blk-mq: Facilitate a shared tags per tagset
  2019-11-15  7:26               ` Hannes Reinecke
@ 2019-11-15 10:46                 ` John Garry
  0 siblings, 0 replies; 22+ messages in thread
From: John Garry @ 2019-11-15 10:46 UTC (permalink / raw)
  To: Hannes Reinecke, axboe, jejb, martin.petersen
  Cc: linux-block, linux-kernel, linux-scsi, ming.lei, hare,
	bvanassche, chenxiang (M)

On 15/11/2019 07:26, Hannes Reinecke wrote:
> On 11/14/19 10:41 AM, John Garry wrote:
>> On 13/11/2019 18:38, Hannes Reinecke wrote:
>>>> Hi Hannes,
>>>>
>>>>> Oh, my. Indeed, that's correct.
>>>>
>>>> The tags could be kept in sync like this:
>>>>
>>>> shared_tag = blk_mq_get_tag(shared_tagset);
>>>> if (shared_tag != -1)
>>>>       sbitmap_set(hctx->tags, shared_tag);
>>>>
>>>> But that's obviously not ideal.
>>>>
>>> Actually, I _do_ prefer keeping both in sync.
>>> We might want to check if the 'normal' tag is set (typically it would
>>> not, but then, who knows ...)
>>> The beauty here is that both 'shared' and 'normal' tag are in sync, so
>>> if a driver would be wanting to use the tag as index into a command
>>> array it can do so without any surprises.
>>>
>>> Why do you think it's not ideal?
>>
>> A few points:
>> - Getting a bit from one tagset and then setting it in another tagset is
>> a bit clunky.
> Yes, that's true.
> But painstakingly trying to find a free bit in a bitmask when we already
> know which to pick is also a bit daft.

Yeah, but it's not all good - there would still be a certain overhead in 
the atomic set and unset bit on the hctx sbitmap. However it still 
should be faster as it excludes the search.

> 
>> - There may be an atomicity of the getting the shared tag bit and
>> setting the hctx tag bit - I don't think that there is.
> 
> That was precisely what I've alluded to in 'We might want to check if
> the normal tag is set'.
> Typically the 'normal' tag would be free (as the shared tag set out of
> necessity needs to be the combination of all hctx tag sets).

Right

> Any difference here _is_ a programming error, and should be flagged as
> such (sbitmap_test_and_set() anyone?)

Eh, I hope that we wouldn't need this.

> We might have ordering issues on release, as we really should drop the
> hctx tag before the shared tag; but when we observe that we should be fine.
> 
>> - Consider that sometimes we may want to check if there is space on a hw
>> queue - checking the hctx tags is not really proper any longer, as
>> typically there would always be space on hctx, but not always the shared
>> tags. We did delete blk_mq_can_queue() yesterday, which would be an
>> example of that. Need to check if there are others.
>>
> Clearly, this needs an audit of all functions accessing the hctx tag
> space; maybe it's worth having a pre-requisite patchset differentiating
> between hctx tags and global, shared tags. Hmm.
> 
>> Having said all that, the obvious advantage is performance gain, can
>> still use request.tag and so maybe less intrusive changes.
>>
>> I'll have a look at the implementation. The devil is mostly in the
>> detail...
>>
> True.
> And, incidentally, if we run with shared tage we can skip the scheduling
> section in blk_mq_get_tag(); if we're out of tags, we're out of tags,

Right, but don't we need to then "kick all hw queues", instead of just 
that for the current hctx in blk_mq_get_tag() when free tags are exhausted?

> and no rescheduling will help as we don't _have_ other tagsets to look at.
> 
> But overall I like this approach.
> 

Yeah, to me it seems sensible. Again, a neat implementation is the 
challenge.

I'll post an RFC v2 for this one.

I am also requesting some performance figures also internally.

Thanks,
John

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC 3/5] blk-mq: Facilitate a shared tags per tagset
  2019-11-15 10:24                 ` John Garry
@ 2019-11-15 17:57                   ` Bart Van Assche
  2019-11-18 10:31                     ` John Garry
  0 siblings, 1 reply; 22+ messages in thread
From: Bart Van Assche @ 2019-11-15 17:57 UTC (permalink / raw)
  To: John Garry, Hannes Reinecke, axboe, jejb, martin.petersen
  Cc: linux-block, linux-kernel, linux-scsi, ming.lei, hare, chenxiang (M)

On 11/15/19 2:24 AM, John Garry wrote:
> Bart Van Assche wrote:
> > How about sharing tag sets across hardware
> > queues, e.g. like in the (totally untested) patch below?
> 
> So this is similar in principle what Ming Lei came up with here:
> https://lore.kernel.org/linux-block/20190531022801.10003-1-ming.lei@redhat.com/ 
> 
> However your implementation looks neater, which is good.
> 
> My concern with this approach is that we can't differentiate which tags 
> are allocated for which hctx, and sometimes we need to know that.
> 
> An example here was blk_mq_queue_tag_busy_iter(), which iterates the 
> bits for each hctx. This would just be broken by that change, unless we 
> record which bits are associated with each hctx.

I disagree. In bt_iter() I added " && rq->mq_hctx == hctx" such that 
blk_mq_queue_tag_busy_iter() only calls the callback function for 
matching (hctx, rq) pairs.

> Another example was __blk_mq_tag_idle(), which looks problematic.

Please elaborate.

> For debugfs, when we examine 
> /sys/kernel/debug/block/.../hctxX/tags_bitmap, wouldn't that be the tags 
> for all hctx (hctx0)?
> 
> For debugging reasons, I would say we want to know which tags are 
> allocated for a specific hctx, as this is tightly related to the 
> requests for that hctx.

That is an open issue in the patch I posted and something that needs to 
be addressed. One way to address this is to change the 
sbitmap_bitmap_show() calls into calls to a function that only shows 
those bits for which rq->mq_hctx == hctx.

>> @@ -341,8 +341,11 @@ void blk_mq_tagset_busy_iter(struct 
>> blk_mq_tag_set *tagset,
>>       int i;
>>
>>       for (i = 0; i < tagset->nr_hw_queues; i++) {
>> -        if (tagset->tags && tagset->tags[i])
>> +        if (tagset->tags && tagset->tags[i]) {
>>               blk_mq_all_tag_busy_iter(tagset->tags[i], fn, priv);
> 
> As I mentioned earlier, wouldn't this iterate over all tags for all 
> hctx's, when we just want the tags for hctx[i]?
> 
> Thanks,
> John
> 
> [Not trimming reply for future reference]
> 
>> +            if (tagset->share_tags)
>> +                break;
>> +        }
>>       }
>>   }
>>   EXPORT_SYMBOL(blk_mq_tagset_busy_iter);

Since blk_mq_tagset_busy_iter() loops over all hardware queues all what 
is changed is the order in which requests are examined. I am not aware 
of any block driver that calls blk_mq_tagset_busy_iter() and that 
depends on the order of the requests passed to the callback function.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC 3/5] blk-mq: Facilitate a shared tags per tagset
  2019-11-15 17:57                   ` Bart Van Assche
@ 2019-11-18 10:31                     ` John Garry
  2019-11-19  9:26                       ` John Garry
  0 siblings, 1 reply; 22+ messages in thread
From: John Garry @ 2019-11-18 10:31 UTC (permalink / raw)
  To: Bart Van Assche, Hannes Reinecke, axboe, jejb, martin.petersen
  Cc: linux-block, linux-kernel, linux-scsi, ming.lei, hare, chenxiang (M)

On 15/11/2019 17:57, Bart Van Assche wrote:
> On 11/15/19 2:24 AM, John Garry wrote:
>> Bart Van Assche wrote:
>> > How about sharing tag sets across hardware
>> > queues, e.g. like in the (totally untested) patch below?
>>
>> So this is similar in principle what Ming Lei came up with here:
>> https://lore.kernel.org/linux-block/20190531022801.10003-1-ming.lei@redhat.com/ 
>>
>> However your implementation looks neater, which is good.
>>
>> My concern with this approach is that we can't differentiate which 
>> tags are allocated for which hctx, and sometimes we need to know that.
>>

Hi Bart,

>> An example here was blk_mq_queue_tag_busy_iter(), which iterates the 
>> bits for each hctx. This would just be broken by that change, unless 
>> we record which bits are associated with each hctx.
> 
> I disagree. In bt_iter() I added " && rq->mq_hctx == hctx" such that 
> blk_mq_queue_tag_busy_iter() only calls the callback function for 
> matching (hctx, rq) pairs.

OK, I see. I assumed that rq->mq_hctx was statically set when we 
initially allocate the static requests per hctx; but that doesn’t appear 
so - it's set in blk_mq_get_request()->blk_mq_rq_ctx_init().

> 
>> Another example was __blk_mq_tag_idle(), which looks problematic.
> 
> Please elaborate.

Again, this was for the same reason being that I thought we could not 
differentiate which rqs were associated with which hctx.

> 
>> For debugfs, when we examine 
>> /sys/kernel/debug/block/.../hctxX/tags_bitmap, wouldn't that be the 
>> tags for all hctx (hctx0)?
>>
>> For debugging reasons, I would say we want to know which tags are 
>> allocated for a specific hctx, as this is tightly related to the 
>> requests for that hctx.
> 
> That is an open issue in the patch I posted and something that needs to 
> be addressed. One way to address this is to change the 
> sbitmap_bitmap_show() calls into calls to a function that only shows 
> those bits for which rq->mq_hctx == hctx.

Yeah, understood.

> 
>>> @@ -341,8 +341,11 @@ void blk_mq_tagset_busy_iter(struct 
>>> blk_mq_tag_set *tagset,
>>>       int i;
>>>
>>>       for (i = 0; i < tagset->nr_hw_queues; i++) {
>>> -        if (tagset->tags && tagset->tags[i])
>>> +        if (tagset->tags && tagset->tags[i]) {
>>>               blk_mq_all_tag_busy_iter(tagset->tags[i], fn, priv);
>>
>> As I mentioned earlier, wouldn't this iterate over all tags for all 
>> hctx's, when we just want the tags for hctx[i]?
>>
>> Thanks,
>> John
>>
>> [Not trimming reply for future reference]
>>
>>> +            if (tagset->share_tags)
>>> +                break;
>>> +        }
>>>       }
>>>   }
>>>   EXPORT_SYMBOL(blk_mq_tagset_busy_iter);
> 
> Since blk_mq_tagset_busy_iter() loops over all hardware queues all what 
> is changed is the order in which requests are examined. I am not aware 
> of any block driver that calls blk_mq_tagset_busy_iter() and that 
> depends on the order of the requests passed to the callback function.
> 

OK, fine.

So, to me, this approach also seems viable then.

I am however not so happy with how we use blk_mq_tag_set.tags[0] for the 
shared tags; I would like to use blk_mq_tag_set.shared_tags and make 
blk_mq_tag_set.tags[] point at blk_mq_tag_set.shared_tags or maybe not 
blk_mq_tag_set.tags[] at all. However maybe that change may be more 
intrusive.

And another more real concern is that we miss a check somewhere for 
rq->mq_hctx == hctx when examining the bits on the shared tags.

Thanks,
John

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC 3/5] blk-mq: Facilitate a shared tags per tagset
  2019-11-18 10:31                     ` John Garry
@ 2019-11-19  9:26                       ` John Garry
  0 siblings, 0 replies; 22+ messages in thread
From: John Garry @ 2019-11-19  9:26 UTC (permalink / raw)
  To: Bart Van Assche, Hannes Reinecke, axboe, jejb, martin.petersen
  Cc: linux-block, linux-kernel, linux-scsi, ming.lei, hare, chenxiang (M)

>>
>>>> @@ -341,8 +341,11 @@ void blk_mq_tagset_busy_iter(struct 
>>>> blk_mq_tag_set *tagset,
>>>>       int i;
>>>>
>>>>       for (i = 0; i < tagset->nr_hw_queues; i++) {
>>>> -        if (tagset->tags && tagset->tags[i])
>>>> +        if (tagset->tags && tagset->tags[i]) {
>>>>               blk_mq_all_tag_busy_iter(tagset->tags[i], fn, priv);
>>>
>>> As I mentioned earlier, wouldn't this iterate over all tags for all 
>>> hctx's, when we just want the tags for hctx[i]?
>>>
>>> Thanks,
>>> John
>>>
>>> [Not trimming reply for future reference]
>>>
>>>> +            if (tagset->share_tags)
>>>> +                break;
>>>> +        }
>>>>       }
>>>>   }
>>>>   EXPORT_SYMBOL(blk_mq_tagset_busy_iter);
>>
>> Since blk_mq_tagset_busy_iter() loops over all hardware queues all 
>> what is changed is the order in which requests are examined. I am not 
>> aware of any block driver that calls blk_mq_tagset_busy_iter() and 
>> that depends on the order of the requests passed to the callback 
>> function.
>>
> 
> OK, fine.
> 
> So, to me, this approach also seems viable then.
> 
> I am however not so happy with how we use blk_mq_tag_set.tags[0] for the 
> shared tags; I would like to use blk_mq_tag_set.shared_tags and make 
> blk_mq_tag_set.tags[] point at blk_mq_tag_set.shared_tags or maybe not 
> blk_mq_tag_set.tags[] at all. However maybe that change may be more 
> intrusive.
> 
> And another more real concern is that we miss a check somewhere for 
> rq->mq_hctx == hctx when examining the bits on the shared tags.

Another issue I notice is that using tags from just hctx0 may cause a 
breakage when associated with a different hw queue in the driver.

Specifically we may have blk_mq_alloc_rqs(hctx_idx = 
0)->blk_mq_init_request()->nvme_init_request(), and we would set all 
iod->nvmeq = nvmeq0; since we may actually use this iod on another hw 
queue, we're breaking that interface.

Thanks,
John



^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2019-11-19  9:27 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-13 13:36 [PATCH RFC 0/5] blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs John Garry
2019-11-13 13:36 ` [PATCH RFC 1/5] blk-mq: Remove some unused function arguments John Garry
2019-11-13 13:58   ` Hannes Reinecke
2019-11-13 13:36 ` [PATCH RFC 2/5] blk-mq: rename BLK_MQ_F_TAG_SHARED as BLK_MQ_F_TAG_QUEUE_SHARED John Garry
2019-11-13 13:58   ` Hannes Reinecke
2019-11-13 13:36 ` [PATCH RFC 3/5] blk-mq: Facilitate a shared tags per tagset John Garry
2019-11-13 14:06   ` Hannes Reinecke
2019-11-13 14:57     ` John Garry
2019-11-13 15:38       ` Hannes Reinecke
2019-11-13 16:21         ` John Garry
2019-11-13 18:38           ` Hannes Reinecke
2019-11-14  9:41             ` John Garry
2019-11-15  5:30               ` Bart Van Assche
2019-11-15  7:29                 ` Hannes Reinecke
2019-11-15 10:24                 ` John Garry
2019-11-15 17:57                   ` Bart Van Assche
2019-11-18 10:31                     ` John Garry
2019-11-19  9:26                       ` John Garry
2019-11-15  7:26               ` Hannes Reinecke
2019-11-15 10:46                 ` John Garry
2019-11-13 13:36 ` [PATCH RFC 4/5] scsi: Add template flag 'host_tagset' John Garry
2019-11-13 13:36 ` [PATCH RFC 5/5] scsi: hisi_sas: Switch v3 hw to MQ John Garry

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).