All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] blk-mq: fix blk_mq_alloc_request_hctx
@ 2021-06-29  7:49 ` Ming Lei
  0 siblings, 0 replies; 50+ messages in thread
From: Ming Lei @ 2021-06-29  7:49 UTC (permalink / raw)
  To: Jens Axboe, linux-block, linux-nvme, Christoph Hellwig
  Cc: Ming Lei, Sagi Grimberg, Daniel Wagner, Wen Xiong, John Garry

Hi,

blk_mq_alloc_request_hctx() is used by NVMe fc/rdma/tcp/loop to connect
io queue. Also the sw ctx is chosen as the 1st online cpu in hctx->cpumask.
However, all cpus in hctx->cpumask may be offline.

This usage model isn't well supported by blk-mq which supposes allocator is
always done on one online CPU in hctx->cpumask. This assumption is
related with managed irq, which also requires blk-mq to drain inflight
request in this hctx when the last cpu in hctx->cpumask is going to
offline.

However, NVMe fc/rdma/tcp/loop don't use managed irq, so we should allow
them to ask for request allocation when the specified hctx is inactive
(all cpus in hctx->cpumask are offline).

Fix blk_mq_alloc_request_hctx() by adding/passing flag of
BLK_MQ_F_NOT_USE_MANAGED_IRQ. 


Ming Lei (2):
  blk-mq: not deactivate hctx if the device doesn't use managed irq
  nvme: pass BLK_MQ_F_NOT_USE_MANAGED_IRQ for fc/rdma/tcp/loop

 block/blk-mq.c             | 6 +++++-
 drivers/nvme/host/fc.c     | 3 ++-
 drivers/nvme/host/rdma.c   | 3 ++-
 drivers/nvme/host/tcp.c    | 3 ++-
 drivers/nvme/target/loop.c | 3 ++-
 include/linux/blk-mq.h     | 1 +
 6 files changed, 14 insertions(+), 5 deletions(-)

Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Daniel Wagner <dwagner@suse.de>
Cc: Wen Xiong <wenxiong@us.ibm.com>
Cc: John Garry <john.garry@huawei.com>


-- 
2.31.1


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 0/2] blk-mq: fix blk_mq_alloc_request_hctx
@ 2021-06-29  7:49 ` Ming Lei
  0 siblings, 0 replies; 50+ messages in thread
From: Ming Lei @ 2021-06-29  7:49 UTC (permalink / raw)
  To: Jens Axboe, linux-block, linux-nvme, Christoph Hellwig
  Cc: Ming Lei, Sagi Grimberg, Daniel Wagner, Wen Xiong, John Garry

Hi,

blk_mq_alloc_request_hctx() is used by NVMe fc/rdma/tcp/loop to connect
io queue. Also the sw ctx is chosen as the 1st online cpu in hctx->cpumask.
However, all cpus in hctx->cpumask may be offline.

This usage model isn't well supported by blk-mq which supposes allocator is
always done on one online CPU in hctx->cpumask. This assumption is
related with managed irq, which also requires blk-mq to drain inflight
request in this hctx when the last cpu in hctx->cpumask is going to
offline.

However, NVMe fc/rdma/tcp/loop don't use managed irq, so we should allow
them to ask for request allocation when the specified hctx is inactive
(all cpus in hctx->cpumask are offline).

Fix blk_mq_alloc_request_hctx() by adding/passing flag of
BLK_MQ_F_NOT_USE_MANAGED_IRQ. 


Ming Lei (2):
  blk-mq: not deactivate hctx if the device doesn't use managed irq
  nvme: pass BLK_MQ_F_NOT_USE_MANAGED_IRQ for fc/rdma/tcp/loop

 block/blk-mq.c             | 6 +++++-
 drivers/nvme/host/fc.c     | 3 ++-
 drivers/nvme/host/rdma.c   | 3 ++-
 drivers/nvme/host/tcp.c    | 3 ++-
 drivers/nvme/target/loop.c | 3 ++-
 include/linux/blk-mq.h     | 1 +
 6 files changed, 14 insertions(+), 5 deletions(-)

Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Daniel Wagner <dwagner@suse.de>
Cc: Wen Xiong <wenxiong@us.ibm.com>
Cc: John Garry <john.garry@huawei.com>


-- 
2.31.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 1/2] blk-mq: not deactivate hctx if the device doesn't use managed irq
  2021-06-29  7:49 ` Ming Lei
@ 2021-06-29  7:49   ` Ming Lei
  -1 siblings, 0 replies; 50+ messages in thread
From: Ming Lei @ 2021-06-29  7:49 UTC (permalink / raw)
  To: Jens Axboe, linux-block, linux-nvme, Christoph Hellwig
  Cc: Ming Lei, Sagi Grimberg, Daniel Wagner, Wen Xiong, John Garry

hctx is deactivated when all CPU in hctx->cpumask become offline by
draining all requests originated from this hctx and moving new
allocation to active hctx. This way is for avoiding inflight IO when
the managed irq is shutdown.

Some drivers(nvme fc, rdma, tcp, loop) doesn't use managed irq, so
they needn't to deactivate hctx. Also, they are the only user of
blk_mq_alloc_request_hctx() which is used for connecting io queue.
And their requirement is that the connect request can be submitted
via one specified hctx on which all CPU in its hctx->cpumask may have
become offline.

Address the requirement for nvme fc/rdma/loop, so the reported kernel
panic on the following line in blk_mq_alloc_request_hctx() can be fixed.

	data.ctx = __blk_mq_get_ctx(q, cpu)

Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Daniel Wagner <dwagner@suse.de>
Cc: Wen Xiong <wenxiong@us.ibm.com>
Cc: John Garry <john.garry@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c         | 6 +++++-
 include/linux/blk-mq.h | 1 +
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index df5dc3b756f5..74632f50d969 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -494,7 +494,7 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
 	data.hctx = q->queue_hw_ctx[hctx_idx];
 	if (!blk_mq_hw_queue_mapped(data.hctx))
 		goto out_queue_exit;
-	cpu = cpumask_first_and(data.hctx->cpumask, cpu_online_mask);
+	cpu = cpumask_first(data.hctx->cpumask);
 	data.ctx = __blk_mq_get_ctx(q, cpu);
 
 	if (!q->elevator)
@@ -2570,6 +2570,10 @@ static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node)
 	    !blk_mq_last_cpu_in_hctx(cpu, hctx))
 		return 0;
 
+	/* Controller doesn't use managed IRQ, no need to deactivate hctx */
+	if (hctx->flags & BLK_MQ_F_NOT_USE_MANAGED_IRQ)
+		return 0;
+
 	/*
 	 * Prevent new request from being allocated on the current hctx.
 	 *
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 21140132a30d..600c5dd1a069 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -403,6 +403,7 @@ enum {
 	 */
 	BLK_MQ_F_STACKING	= 1 << 2,
 	BLK_MQ_F_TAG_HCTX_SHARED = 1 << 3,
+	BLK_MQ_F_NOT_USE_MANAGED_IRQ = 1 << 4,
 	BLK_MQ_F_BLOCKING	= 1 << 5,
 	BLK_MQ_F_NO_SCHED	= 1 << 6,
 	BLK_MQ_F_ALLOC_POLICY_START_BIT = 8,
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 1/2] blk-mq: not deactivate hctx if the device doesn't use managed irq
@ 2021-06-29  7:49   ` Ming Lei
  0 siblings, 0 replies; 50+ messages in thread
From: Ming Lei @ 2021-06-29  7:49 UTC (permalink / raw)
  To: Jens Axboe, linux-block, linux-nvme, Christoph Hellwig
  Cc: Ming Lei, Sagi Grimberg, Daniel Wagner, Wen Xiong, John Garry

hctx is deactivated when all CPU in hctx->cpumask become offline by
draining all requests originated from this hctx and moving new
allocation to active hctx. This way is for avoiding inflight IO when
the managed irq is shutdown.

Some drivers(nvme fc, rdma, tcp, loop) doesn't use managed irq, so
they needn't to deactivate hctx. Also, they are the only user of
blk_mq_alloc_request_hctx() which is used for connecting io queue.
And their requirement is that the connect request can be submitted
via one specified hctx on which all CPU in its hctx->cpumask may have
become offline.

Address the requirement for nvme fc/rdma/loop, so the reported kernel
panic on the following line in blk_mq_alloc_request_hctx() can be fixed.

	data.ctx = __blk_mq_get_ctx(q, cpu)

Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Daniel Wagner <dwagner@suse.de>
Cc: Wen Xiong <wenxiong@us.ibm.com>
Cc: John Garry <john.garry@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq.c         | 6 +++++-
 include/linux/blk-mq.h | 1 +
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index df5dc3b756f5..74632f50d969 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -494,7 +494,7 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
 	data.hctx = q->queue_hw_ctx[hctx_idx];
 	if (!blk_mq_hw_queue_mapped(data.hctx))
 		goto out_queue_exit;
-	cpu = cpumask_first_and(data.hctx->cpumask, cpu_online_mask);
+	cpu = cpumask_first(data.hctx->cpumask);
 	data.ctx = __blk_mq_get_ctx(q, cpu);
 
 	if (!q->elevator)
@@ -2570,6 +2570,10 @@ static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node)
 	    !blk_mq_last_cpu_in_hctx(cpu, hctx))
 		return 0;
 
+	/* Controller doesn't use managed IRQ, no need to deactivate hctx */
+	if (hctx->flags & BLK_MQ_F_NOT_USE_MANAGED_IRQ)
+		return 0;
+
 	/*
 	 * Prevent new request from being allocated on the current hctx.
 	 *
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 21140132a30d..600c5dd1a069 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -403,6 +403,7 @@ enum {
 	 */
 	BLK_MQ_F_STACKING	= 1 << 2,
 	BLK_MQ_F_TAG_HCTX_SHARED = 1 << 3,
+	BLK_MQ_F_NOT_USE_MANAGED_IRQ = 1 << 4,
 	BLK_MQ_F_BLOCKING	= 1 << 5,
 	BLK_MQ_F_NO_SCHED	= 1 << 6,
 	BLK_MQ_F_ALLOC_POLICY_START_BIT = 8,
-- 
2.31.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 2/2] nvme: pass BLK_MQ_F_NOT_USE_MANAGED_IRQ for fc/rdma/tcp/loop
  2021-06-29  7:49 ` Ming Lei
@ 2021-06-29  7:49   ` Ming Lei
  -1 siblings, 0 replies; 50+ messages in thread
From: Ming Lei @ 2021-06-29  7:49 UTC (permalink / raw)
  To: Jens Axboe, linux-block, linux-nvme, Christoph Hellwig
  Cc: Ming Lei, Sagi Grimberg, Daniel Wagner, Wen Xiong, John Garry

All the 4 host drivers don't use managed irq for completing request, so
it is correct to pass the flag to blk-mq.

Secondly with this flag, blk-mq will help us dispatch connect request
allocated via blk_mq_alloc_request_hctx() to driver even though all
CPU in the specified hctx's cpumask are offline.

Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Daniel Wagner <dwagner@suse.de>
Cc: Wen Xiong <wenxiong@us.ibm.com>
Cc: John Garry <john.garry@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/nvme/host/fc.c     | 3 ++-
 drivers/nvme/host/rdma.c   | 3 ++-
 drivers/nvme/host/tcp.c    | 3 ++-
 drivers/nvme/target/loop.c | 3 ++-
 4 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index 256e87721a01..c563a2b6e9fc 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -2876,7 +2876,8 @@ nvme_fc_create_io_queues(struct nvme_fc_ctrl *ctrl)
 	ctrl->tag_set.queue_depth = ctrl->ctrl.opts->queue_size;
 	ctrl->tag_set.reserved_tags = NVMF_RESERVED_TAGS;
 	ctrl->tag_set.numa_node = ctrl->ctrl.numa_node;
-	ctrl->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
+	ctrl->tag_set.flags = BLK_MQ_F_SHOULD_MERGE |
+		BLK_MQ_F_NOT_USE_MANAGED_IRQ;
 	ctrl->tag_set.cmd_size =
 		struct_size((struct nvme_fcp_op_w_sgl *)NULL, priv,
 			    ctrl->lport->ops->fcprqst_priv_sz);
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 37943dc4c2c1..4b7bdc829109 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -817,7 +817,8 @@ static struct blk_mq_tag_set *nvme_rdma_alloc_tagset(struct nvme_ctrl *nctrl,
 		set->queue_depth = nctrl->sqsize + 1;
 		set->reserved_tags = NVMF_RESERVED_TAGS;
 		set->numa_node = nctrl->numa_node;
-		set->flags = BLK_MQ_F_SHOULD_MERGE;
+		set->flags = BLK_MQ_F_SHOULD_MERGE |
+			BLK_MQ_F_NOT_USE_MANAGED_IRQ;
 		set->cmd_size = sizeof(struct nvme_rdma_request) +
 				NVME_RDMA_DATA_SGL_SIZE;
 		if (nctrl->max_integrity_segments)
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 34f4b3402f7c..0125463b7d77 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1600,7 +1600,8 @@ static struct blk_mq_tag_set *nvme_tcp_alloc_tagset(struct nvme_ctrl *nctrl,
 		set->queue_depth = nctrl->sqsize + 1;
 		set->reserved_tags = NVMF_RESERVED_TAGS;
 		set->numa_node = nctrl->numa_node;
-		set->flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_BLOCKING;
+		set->flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_BLOCKING |
+			BLK_MQ_F_NOT_USE_MANAGED_IRQ;
 		set->cmd_size = sizeof(struct nvme_tcp_request);
 		set->driver_data = ctrl;
 		set->nr_hw_queues = nctrl->queue_count - 1;
diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c
index cb30cb942e1d..bf032249e010 100644
--- a/drivers/nvme/target/loop.c
+++ b/drivers/nvme/target/loop.c
@@ -524,7 +524,8 @@ static int nvme_loop_create_io_queues(struct nvme_loop_ctrl *ctrl)
 	ctrl->tag_set.queue_depth = ctrl->ctrl.opts->queue_size;
 	ctrl->tag_set.reserved_tags = NVMF_RESERVED_TAGS;
 	ctrl->tag_set.numa_node = ctrl->ctrl.numa_node;
-	ctrl->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
+	ctrl->tag_set.flags = BLK_MQ_F_SHOULD_MERGE |
+		BLK_MQ_F_NOT_USE_MANAGED_IRQ;
 	ctrl->tag_set.cmd_size = sizeof(struct nvme_loop_iod) +
 		NVME_INLINE_SG_CNT * sizeof(struct scatterlist);
 	ctrl->tag_set.driver_data = ctrl;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 2/2] nvme: pass BLK_MQ_F_NOT_USE_MANAGED_IRQ for fc/rdma/tcp/loop
@ 2021-06-29  7:49   ` Ming Lei
  0 siblings, 0 replies; 50+ messages in thread
From: Ming Lei @ 2021-06-29  7:49 UTC (permalink / raw)
  To: Jens Axboe, linux-block, linux-nvme, Christoph Hellwig
  Cc: Ming Lei, Sagi Grimberg, Daniel Wagner, Wen Xiong, John Garry

All the 4 host drivers don't use managed irq for completing request, so
it is correct to pass the flag to blk-mq.

Secondly with this flag, blk-mq will help us dispatch connect request
allocated via blk_mq_alloc_request_hctx() to driver even though all
CPU in the specified hctx's cpumask are offline.

Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Daniel Wagner <dwagner@suse.de>
Cc: Wen Xiong <wenxiong@us.ibm.com>
Cc: John Garry <john.garry@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/nvme/host/fc.c     | 3 ++-
 drivers/nvme/host/rdma.c   | 3 ++-
 drivers/nvme/host/tcp.c    | 3 ++-
 drivers/nvme/target/loop.c | 3 ++-
 4 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index 256e87721a01..c563a2b6e9fc 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -2876,7 +2876,8 @@ nvme_fc_create_io_queues(struct nvme_fc_ctrl *ctrl)
 	ctrl->tag_set.queue_depth = ctrl->ctrl.opts->queue_size;
 	ctrl->tag_set.reserved_tags = NVMF_RESERVED_TAGS;
 	ctrl->tag_set.numa_node = ctrl->ctrl.numa_node;
-	ctrl->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
+	ctrl->tag_set.flags = BLK_MQ_F_SHOULD_MERGE |
+		BLK_MQ_F_NOT_USE_MANAGED_IRQ;
 	ctrl->tag_set.cmd_size =
 		struct_size((struct nvme_fcp_op_w_sgl *)NULL, priv,
 			    ctrl->lport->ops->fcprqst_priv_sz);
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 37943dc4c2c1..4b7bdc829109 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -817,7 +817,8 @@ static struct blk_mq_tag_set *nvme_rdma_alloc_tagset(struct nvme_ctrl *nctrl,
 		set->queue_depth = nctrl->sqsize + 1;
 		set->reserved_tags = NVMF_RESERVED_TAGS;
 		set->numa_node = nctrl->numa_node;
-		set->flags = BLK_MQ_F_SHOULD_MERGE;
+		set->flags = BLK_MQ_F_SHOULD_MERGE |
+			BLK_MQ_F_NOT_USE_MANAGED_IRQ;
 		set->cmd_size = sizeof(struct nvme_rdma_request) +
 				NVME_RDMA_DATA_SGL_SIZE;
 		if (nctrl->max_integrity_segments)
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 34f4b3402f7c..0125463b7d77 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1600,7 +1600,8 @@ static struct blk_mq_tag_set *nvme_tcp_alloc_tagset(struct nvme_ctrl *nctrl,
 		set->queue_depth = nctrl->sqsize + 1;
 		set->reserved_tags = NVMF_RESERVED_TAGS;
 		set->numa_node = nctrl->numa_node;
-		set->flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_BLOCKING;
+		set->flags = BLK_MQ_F_SHOULD_MERGE | BLK_MQ_F_BLOCKING |
+			BLK_MQ_F_NOT_USE_MANAGED_IRQ;
 		set->cmd_size = sizeof(struct nvme_tcp_request);
 		set->driver_data = ctrl;
 		set->nr_hw_queues = nctrl->queue_count - 1;
diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c
index cb30cb942e1d..bf032249e010 100644
--- a/drivers/nvme/target/loop.c
+++ b/drivers/nvme/target/loop.c
@@ -524,7 +524,8 @@ static int nvme_loop_create_io_queues(struct nvme_loop_ctrl *ctrl)
 	ctrl->tag_set.queue_depth = ctrl->ctrl.opts->queue_size;
 	ctrl->tag_set.reserved_tags = NVMF_RESERVED_TAGS;
 	ctrl->tag_set.numa_node = ctrl->ctrl.numa_node;
-	ctrl->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
+	ctrl->tag_set.flags = BLK_MQ_F_SHOULD_MERGE |
+		BLK_MQ_F_NOT_USE_MANAGED_IRQ;
 	ctrl->tag_set.cmd_size = sizeof(struct nvme_loop_iod) +
 		NVME_INLINE_SG_CNT * sizeof(struct scatterlist);
 	ctrl->tag_set.driver_data = ctrl;
-- 
2.31.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/2] blk-mq: not deactivate hctx if the device doesn't use managed irq
  2021-06-29  7:49   ` Ming Lei
@ 2021-06-29 12:39     ` Hannes Reinecke
  -1 siblings, 0 replies; 50+ messages in thread
From: Hannes Reinecke @ 2021-06-29 12:39 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block, linux-nvme, Christoph Hellwig
  Cc: Sagi Grimberg, Daniel Wagner, Wen Xiong, John Garry

On 6/29/21 9:49 AM, Ming Lei wrote:
> hctx is deactivated when all CPU in hctx->cpumask become offline by
> draining all requests originated from this hctx and moving new
> allocation to active hctx. This way is for avoiding inflight IO when
> the managed irq is shutdown.
> 
> Some drivers(nvme fc, rdma, tcp, loop) doesn't use managed irq, so
> they needn't to deactivate hctx. Also, they are the only user of
> blk_mq_alloc_request_hctx() which is used for connecting io queue.
> And their requirement is that the connect request can be submitted
> via one specified hctx on which all CPU in its hctx->cpumask may have
> become offline.
> 

How can you submit a connect request for a hctx on which all CPUs are 
offline? That hctx will be unusable as it'll never be able to receive 
interrupts ...

> Address the requirement for nvme fc/rdma/loop, so the reported kernel
> panic on the following line in blk_mq_alloc_request_hctx() can be fixed.
> 
> 	data.ctx = __blk_mq_get_ctx(q, cpu)
> 
> Cc: Sagi Grimberg <sagi@grimberg.me>
> Cc: Daniel Wagner <dwagner@suse.de>
> Cc: Wen Xiong <wenxiong@us.ibm.com>
> Cc: John Garry <john.garry@huawei.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/blk-mq.c         | 6 +++++-
>   include/linux/blk-mq.h | 1 +
>   2 files changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index df5dc3b756f5..74632f50d969 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -494,7 +494,7 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
>   	data.hctx = q->queue_hw_ctx[hctx_idx];
>   	if (!blk_mq_hw_queue_mapped(data.hctx))
>   		goto out_queue_exit;
> -	cpu = cpumask_first_and(data.hctx->cpumask, cpu_online_mask);
> +	cpu = cpumask_first(data.hctx->cpumask);
>   	data.ctx = __blk_mq_get_ctx(q, cpu);

I don't get it.
Doesn't this allow us to allocate a request on a dead cpu, ie the very 
thing we try to prevent?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/2] blk-mq: not deactivate hctx if the device doesn't use managed irq
@ 2021-06-29 12:39     ` Hannes Reinecke
  0 siblings, 0 replies; 50+ messages in thread
From: Hannes Reinecke @ 2021-06-29 12:39 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block, linux-nvme, Christoph Hellwig
  Cc: Sagi Grimberg, Daniel Wagner, Wen Xiong, John Garry

On 6/29/21 9:49 AM, Ming Lei wrote:
> hctx is deactivated when all CPU in hctx->cpumask become offline by
> draining all requests originated from this hctx and moving new
> allocation to active hctx. This way is for avoiding inflight IO when
> the managed irq is shutdown.
> 
> Some drivers(nvme fc, rdma, tcp, loop) doesn't use managed irq, so
> they needn't to deactivate hctx. Also, they are the only user of
> blk_mq_alloc_request_hctx() which is used for connecting io queue.
> And their requirement is that the connect request can be submitted
> via one specified hctx on which all CPU in its hctx->cpumask may have
> become offline.
> 

How can you submit a connect request for a hctx on which all CPUs are 
offline? That hctx will be unusable as it'll never be able to receive 
interrupts ...

> Address the requirement for nvme fc/rdma/loop, so the reported kernel
> panic on the following line in blk_mq_alloc_request_hctx() can be fixed.
> 
> 	data.ctx = __blk_mq_get_ctx(q, cpu)
> 
> Cc: Sagi Grimberg <sagi@grimberg.me>
> Cc: Daniel Wagner <dwagner@suse.de>
> Cc: Wen Xiong <wenxiong@us.ibm.com>
> Cc: John Garry <john.garry@huawei.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/blk-mq.c         | 6 +++++-
>   include/linux/blk-mq.h | 1 +
>   2 files changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index df5dc3b756f5..74632f50d969 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -494,7 +494,7 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
>   	data.hctx = q->queue_hw_ctx[hctx_idx];
>   	if (!blk_mq_hw_queue_mapped(data.hctx))
>   		goto out_queue_exit;
> -	cpu = cpumask_first_and(data.hctx->cpumask, cpu_online_mask);
> +	cpu = cpumask_first(data.hctx->cpumask);
>   	data.ctx = __blk_mq_get_ctx(q, cpu);

I don't get it.
Doesn't this allow us to allocate a request on a dead cpu, ie the very 
thing we try to prevent?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/2] blk-mq: not deactivate hctx if the device doesn't use managed irq
  2021-06-29 12:39     ` Hannes Reinecke
@ 2021-06-29 14:17       ` Ming Lei
  -1 siblings, 0 replies; 50+ messages in thread
From: Ming Lei @ 2021-06-29 14:17 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Jens Axboe, linux-block, linux-nvme, Christoph Hellwig,
	Sagi Grimberg, Daniel Wagner, Wen Xiong, John Garry

On Tue, Jun 29, 2021 at 02:39:14PM +0200, Hannes Reinecke wrote:
> On 6/29/21 9:49 AM, Ming Lei wrote:
> > hctx is deactivated when all CPU in hctx->cpumask become offline by
> > draining all requests originated from this hctx and moving new
> > allocation to active hctx. This way is for avoiding inflight IO when
> > the managed irq is shutdown.
> > 
> > Some drivers(nvme fc, rdma, tcp, loop) doesn't use managed irq, so
> > they needn't to deactivate hctx. Also, they are the only user of
> > blk_mq_alloc_request_hctx() which is used for connecting io queue.
> > And their requirement is that the connect request can be submitted
> > via one specified hctx on which all CPU in its hctx->cpumask may have
> > become offline.
> > 
> 
> How can you submit a connect request for a hctx on which all CPUs are
> offline? That hctx will be unusable as it'll never be able to receive
> interrupts ...

I believe BLK_MQ_F_NOT_USE_MANAGED_IRQ is self-explanatory. And the
interrupt(non-managed) of this hctx will be migrated to online CPUs,
see migrate_one_irq().

For managed irq, we have to prevent new allocation if all CPUs of this
hctx is offline because genirq will shutdown the interrupt.

> 
> > Address the requirement for nvme fc/rdma/loop, so the reported kernel
> > panic on the following line in blk_mq_alloc_request_hctx() can be fixed.
> > 
> > 	data.ctx = __blk_mq_get_ctx(q, cpu)
> > 
> > Cc: Sagi Grimberg <sagi@grimberg.me>
> > Cc: Daniel Wagner <dwagner@suse.de>
> > Cc: Wen Xiong <wenxiong@us.ibm.com>
> > Cc: John Garry <john.garry@huawei.com>
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >   block/blk-mq.c         | 6 +++++-
> >   include/linux/blk-mq.h | 1 +
> >   2 files changed, 6 insertions(+), 1 deletion(-)
> > 
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index df5dc3b756f5..74632f50d969 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -494,7 +494,7 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
> >   	data.hctx = q->queue_hw_ctx[hctx_idx];
> >   	if (!blk_mq_hw_queue_mapped(data.hctx))
> >   		goto out_queue_exit;
> > -	cpu = cpumask_first_and(data.hctx->cpumask, cpu_online_mask);
> > +	cpu = cpumask_first(data.hctx->cpumask);
> >   	data.ctx = __blk_mq_get_ctx(q, cpu);
> 
> I don't get it.
> Doesn't this allow us to allocate a request on a dead cpu, ie the very thing
> we try to prevent?

It is fine to allocate & dispatch one request to the hctx when all CPU on
its cpumask are offline if this hctx's interrupt isn't managed.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/2] blk-mq: not deactivate hctx if the device doesn't use managed irq
@ 2021-06-29 14:17       ` Ming Lei
  0 siblings, 0 replies; 50+ messages in thread
From: Ming Lei @ 2021-06-29 14:17 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Jens Axboe, linux-block, linux-nvme, Christoph Hellwig,
	Sagi Grimberg, Daniel Wagner, Wen Xiong, John Garry

On Tue, Jun 29, 2021 at 02:39:14PM +0200, Hannes Reinecke wrote:
> On 6/29/21 9:49 AM, Ming Lei wrote:
> > hctx is deactivated when all CPU in hctx->cpumask become offline by
> > draining all requests originated from this hctx and moving new
> > allocation to active hctx. This way is for avoiding inflight IO when
> > the managed irq is shutdown.
> > 
> > Some drivers(nvme fc, rdma, tcp, loop) doesn't use managed irq, so
> > they needn't to deactivate hctx. Also, they are the only user of
> > blk_mq_alloc_request_hctx() which is used for connecting io queue.
> > And their requirement is that the connect request can be submitted
> > via one specified hctx on which all CPU in its hctx->cpumask may have
> > become offline.
> > 
> 
> How can you submit a connect request for a hctx on which all CPUs are
> offline? That hctx will be unusable as it'll never be able to receive
> interrupts ...

I believe BLK_MQ_F_NOT_USE_MANAGED_IRQ is self-explanatory. And the
interrupt(non-managed) of this hctx will be migrated to online CPUs,
see migrate_one_irq().

For managed irq, we have to prevent new allocation if all CPUs of this
hctx is offline because genirq will shutdown the interrupt.

> 
> > Address the requirement for nvme fc/rdma/loop, so the reported kernel
> > panic on the following line in blk_mq_alloc_request_hctx() can be fixed.
> > 
> > 	data.ctx = __blk_mq_get_ctx(q, cpu)
> > 
> > Cc: Sagi Grimberg <sagi@grimberg.me>
> > Cc: Daniel Wagner <dwagner@suse.de>
> > Cc: Wen Xiong <wenxiong@us.ibm.com>
> > Cc: John Garry <john.garry@huawei.com>
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >   block/blk-mq.c         | 6 +++++-
> >   include/linux/blk-mq.h | 1 +
> >   2 files changed, 6 insertions(+), 1 deletion(-)
> > 
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index df5dc3b756f5..74632f50d969 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -494,7 +494,7 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
> >   	data.hctx = q->queue_hw_ctx[hctx_idx];
> >   	if (!blk_mq_hw_queue_mapped(data.hctx))
> >   		goto out_queue_exit;
> > -	cpu = cpumask_first_and(data.hctx->cpumask, cpu_online_mask);
> > +	cpu = cpumask_first(data.hctx->cpumask);
> >   	data.ctx = __blk_mq_get_ctx(q, cpu);
> 
> I don't get it.
> Doesn't this allow us to allocate a request on a dead cpu, ie the very thing
> we try to prevent?

It is fine to allocate & dispatch one request to the hctx when all CPU on
its cpumask are offline if this hctx's interrupt isn't managed.


Thanks,
Ming


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/2] blk-mq: not deactivate hctx if the device doesn't use managed irq
  2021-06-29  7:49   ` Ming Lei
@ 2021-06-29 15:49     ` John Garry
  -1 siblings, 0 replies; 50+ messages in thread
From: John Garry @ 2021-06-29 15:49 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block, linux-nvme, Christoph Hellwig
  Cc: Sagi Grimberg, Daniel Wagner, Wen Xiong

On 29/06/2021 08:49, Ming Lei wrote:
> hctx is deactivated when all CPU in hctx->cpumask become offline by
> draining all requests originated from this hctx and moving new
> allocation to active hctx. This way is for avoiding inflight IO when
> the managed irq is shutdown.
> 
> Some drivers(nvme fc, rdma, tcp, loop) doesn't use managed irq, so
> they needn't to deactivate hctx. Also, they are the only user of
> blk_mq_alloc_request_hctx() which is used for connecting io queue.
> And their requirement is that the connect request can be submitted
> via one specified hctx on which all CPU in its hctx->cpumask may have
> become offline.
> 
> Address the requirement for nvme fc/rdma/loop, so the reported kernel
> panic on the following line in blk_mq_alloc_request_hctx() can be fixed.
> 
> 	data.ctx = __blk_mq_get_ctx(q, cpu)
> 
> Cc: Sagi Grimberg <sagi@grimberg.me>
> Cc: Daniel Wagner <dwagner@suse.de>
> Cc: Wen Xiong <wenxiong@us.ibm.com>
> Cc: John Garry <john.garry@huawei.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/blk-mq.c         | 6 +++++-
>   include/linux/blk-mq.h | 1 +
>   2 files changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index df5dc3b756f5..74632f50d969 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -494,7 +494,7 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
>   	data.hctx = q->queue_hw_ctx[hctx_idx];
>   	if (!blk_mq_hw_queue_mapped(data.hctx))
>   		goto out_queue_exit;
> -	cpu = cpumask_first_and(data.hctx->cpumask, cpu_online_mask);
> +	cpu = cpumask_first(data.hctx->cpumask);
>   	data.ctx = __blk_mq_get_ctx(q, cpu);
>   
>   	if (!q->elevator)
> @@ -2570,6 +2570,10 @@ static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node)
>   	    !blk_mq_last_cpu_in_hctx(cpu, hctx))
>   		return 0;
>   
> +	/* Controller doesn't use managed IRQ, no need to deactivate hctx */
> +	if (hctx->flags & BLK_MQ_F_NOT_USE_MANAGED_IRQ)
> +		return 0;
Is there anything to be gained in registering the CPU hotplug handler 
for the hctx in this case at all?

> +
>   	/*
>   	 * Prevent new request from being allocated on the current hctx.
>   	 *
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index 21140132a30d..600c5dd1a069 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -403,6 +403,7 @@ enum {
>   	 */
>   	BLK_MQ_F_STACKING	= 1 << 2,
>   	BLK_MQ_F_TAG_HCTX_SHARED = 1 << 3,
> +	BLK_MQ_F_NOT_USE_MANAGED_IRQ = 1 << 4,

Many block drivers don't use managed interrupts - to be proper, why not 
set this everywhere (which doesn't use managed interrupts)? I know why, 
but it's odd.

As an alternative, if the default queue mapping was used (in 
blk_mq_map_queues()), then that's the same thing as 
BLK_MQ_F_NOT_USE_MANAGED_IRQ in reality, right? If so, could we 
alternatively check for that somehow?

Thanks,
John

>   	BLK_MQ_F_BLOCKING	= 1 << 5,
>   	BLK_MQ_F_NO_SCHED	= 1 << 6,
>   	BLK_MQ_F_ALLOC_POLICY_START_BIT = 8,
> 


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/2] blk-mq: not deactivate hctx if the device doesn't use managed irq
@ 2021-06-29 15:49     ` John Garry
  0 siblings, 0 replies; 50+ messages in thread
From: John Garry @ 2021-06-29 15:49 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block, linux-nvme, Christoph Hellwig
  Cc: Sagi Grimberg, Daniel Wagner, Wen Xiong

On 29/06/2021 08:49, Ming Lei wrote:
> hctx is deactivated when all CPU in hctx->cpumask become offline by
> draining all requests originated from this hctx and moving new
> allocation to active hctx. This way is for avoiding inflight IO when
> the managed irq is shutdown.
> 
> Some drivers(nvme fc, rdma, tcp, loop) doesn't use managed irq, so
> they needn't to deactivate hctx. Also, they are the only user of
> blk_mq_alloc_request_hctx() which is used for connecting io queue.
> And their requirement is that the connect request can be submitted
> via one specified hctx on which all CPU in its hctx->cpumask may have
> become offline.
> 
> Address the requirement for nvme fc/rdma/loop, so the reported kernel
> panic on the following line in blk_mq_alloc_request_hctx() can be fixed.
> 
> 	data.ctx = __blk_mq_get_ctx(q, cpu)
> 
> Cc: Sagi Grimberg <sagi@grimberg.me>
> Cc: Daniel Wagner <dwagner@suse.de>
> Cc: Wen Xiong <wenxiong@us.ibm.com>
> Cc: John Garry <john.garry@huawei.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>   block/blk-mq.c         | 6 +++++-
>   include/linux/blk-mq.h | 1 +
>   2 files changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index df5dc3b756f5..74632f50d969 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -494,7 +494,7 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
>   	data.hctx = q->queue_hw_ctx[hctx_idx];
>   	if (!blk_mq_hw_queue_mapped(data.hctx))
>   		goto out_queue_exit;
> -	cpu = cpumask_first_and(data.hctx->cpumask, cpu_online_mask);
> +	cpu = cpumask_first(data.hctx->cpumask);
>   	data.ctx = __blk_mq_get_ctx(q, cpu);
>   
>   	if (!q->elevator)
> @@ -2570,6 +2570,10 @@ static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node)
>   	    !blk_mq_last_cpu_in_hctx(cpu, hctx))
>   		return 0;
>   
> +	/* Controller doesn't use managed IRQ, no need to deactivate hctx */
> +	if (hctx->flags & BLK_MQ_F_NOT_USE_MANAGED_IRQ)
> +		return 0;
Is there anything to be gained in registering the CPU hotplug handler 
for the hctx in this case at all?

> +
>   	/*
>   	 * Prevent new request from being allocated on the current hctx.
>   	 *
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index 21140132a30d..600c5dd1a069 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -403,6 +403,7 @@ enum {
>   	 */
>   	BLK_MQ_F_STACKING	= 1 << 2,
>   	BLK_MQ_F_TAG_HCTX_SHARED = 1 << 3,
> +	BLK_MQ_F_NOT_USE_MANAGED_IRQ = 1 << 4,

Many block drivers don't use managed interrupts - to be proper, why not 
set this everywhere (which doesn't use managed interrupts)? I know why, 
but it's odd.

As an alternative, if the default queue mapping was used (in 
blk_mq_map_queues()), then that's the same thing as 
BLK_MQ_F_NOT_USE_MANAGED_IRQ in reality, right? If so, could we 
alternatively check for that somehow?

Thanks,
John

>   	BLK_MQ_F_BLOCKING	= 1 << 5,
>   	BLK_MQ_F_NO_SCHED	= 1 << 6,
>   	BLK_MQ_F_ALLOC_POLICY_START_BIT = 8,
> 


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/2] blk-mq: not deactivate hctx if the device doesn't use managed irq
  2021-06-29  7:49   ` Ming Lei
@ 2021-06-29 23:30     ` Damien Le Moal
  -1 siblings, 0 replies; 50+ messages in thread
From: Damien Le Moal @ 2021-06-29 23:30 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block, linux-nvme, Christoph Hellwig
  Cc: Sagi Grimberg, Daniel Wagner, Wen Xiong, John Garry

On 2021/06/29 16:50, Ming Lei wrote:
> hctx is deactivated when all CPU in hctx->cpumask become offline by
> draining all requests originated from this hctx and moving new
> allocation to active hctx. This way is for avoiding inflight IO when
> the managed irq is shutdown.
> 
> Some drivers(nvme fc, rdma, tcp, loop) doesn't use managed irq, so
> they needn't to deactivate hctx. Also, they are the only user of
> blk_mq_alloc_request_hctx() which is used for connecting io queue.
> And their requirement is that the connect request can be submitted
> via one specified hctx on which all CPU in its hctx->cpumask may have
> become offline.
> 
> Address the requirement for nvme fc/rdma/loop, so the reported kernel
> panic on the following line in blk_mq_alloc_request_hctx() can be fixed.
> 
> 	data.ctx = __blk_mq_get_ctx(q, cpu)
> 
> Cc: Sagi Grimberg <sagi@grimberg.me>
> Cc: Daniel Wagner <dwagner@suse.de>
> Cc: Wen Xiong <wenxiong@us.ibm.com>
> Cc: John Garry <john.garry@huawei.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  block/blk-mq.c         | 6 +++++-
>  include/linux/blk-mq.h | 1 +
>  2 files changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index df5dc3b756f5..74632f50d969 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -494,7 +494,7 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
>  	data.hctx = q->queue_hw_ctx[hctx_idx];
>  	if (!blk_mq_hw_queue_mapped(data.hctx))
>  		goto out_queue_exit;
> -	cpu = cpumask_first_and(data.hctx->cpumask, cpu_online_mask);
> +	cpu = cpumask_first(data.hctx->cpumask);
>  	data.ctx = __blk_mq_get_ctx(q, cpu);
>  
>  	if (!q->elevator)
> @@ -2570,6 +2570,10 @@ static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node)
>  	    !blk_mq_last_cpu_in_hctx(cpu, hctx))
>  		return 0;
>  
> +	/* Controller doesn't use managed IRQ, no need to deactivate hctx */
> +	if (hctx->flags & BLK_MQ_F_NOT_USE_MANAGED_IRQ)
> +		return 0;
> +
>  	/*
>  	 * Prevent new request from being allocated on the current hctx.
>  	 *
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index 21140132a30d..600c5dd1a069 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -403,6 +403,7 @@ enum {
>  	 */
>  	BLK_MQ_F_STACKING	= 1 << 2,
>  	BLK_MQ_F_TAG_HCTX_SHARED = 1 << 3,
> +	BLK_MQ_F_NOT_USE_MANAGED_IRQ = 1 << 4,

My 2 cents: BLK_MQ_F_NO_MANAGED_IRQ may be a better/shorter name.

>  	BLK_MQ_F_BLOCKING	= 1 << 5,
>  	BLK_MQ_F_NO_SCHED	= 1 << 6,
>  	BLK_MQ_F_ALLOC_POLICY_START_BIT = 8,
> 


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/2] blk-mq: not deactivate hctx if the device doesn't use managed irq
@ 2021-06-29 23:30     ` Damien Le Moal
  0 siblings, 0 replies; 50+ messages in thread
From: Damien Le Moal @ 2021-06-29 23:30 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block, linux-nvme, Christoph Hellwig
  Cc: Sagi Grimberg, Daniel Wagner, Wen Xiong, John Garry

On 2021/06/29 16:50, Ming Lei wrote:
> hctx is deactivated when all CPU in hctx->cpumask become offline by
> draining all requests originated from this hctx and moving new
> allocation to active hctx. This way is for avoiding inflight IO when
> the managed irq is shutdown.
> 
> Some drivers(nvme fc, rdma, tcp, loop) doesn't use managed irq, so
> they needn't to deactivate hctx. Also, they are the only user of
> blk_mq_alloc_request_hctx() which is used for connecting io queue.
> And their requirement is that the connect request can be submitted
> via one specified hctx on which all CPU in its hctx->cpumask may have
> become offline.
> 
> Address the requirement for nvme fc/rdma/loop, so the reported kernel
> panic on the following line in blk_mq_alloc_request_hctx() can be fixed.
> 
> 	data.ctx = __blk_mq_get_ctx(q, cpu)
> 
> Cc: Sagi Grimberg <sagi@grimberg.me>
> Cc: Daniel Wagner <dwagner@suse.de>
> Cc: Wen Xiong <wenxiong@us.ibm.com>
> Cc: John Garry <john.garry@huawei.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  block/blk-mq.c         | 6 +++++-
>  include/linux/blk-mq.h | 1 +
>  2 files changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index df5dc3b756f5..74632f50d969 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -494,7 +494,7 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
>  	data.hctx = q->queue_hw_ctx[hctx_idx];
>  	if (!blk_mq_hw_queue_mapped(data.hctx))
>  		goto out_queue_exit;
> -	cpu = cpumask_first_and(data.hctx->cpumask, cpu_online_mask);
> +	cpu = cpumask_first(data.hctx->cpumask);
>  	data.ctx = __blk_mq_get_ctx(q, cpu);
>  
>  	if (!q->elevator)
> @@ -2570,6 +2570,10 @@ static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node)
>  	    !blk_mq_last_cpu_in_hctx(cpu, hctx))
>  		return 0;
>  
> +	/* Controller doesn't use managed IRQ, no need to deactivate hctx */
> +	if (hctx->flags & BLK_MQ_F_NOT_USE_MANAGED_IRQ)
> +		return 0;
> +
>  	/*
>  	 * Prevent new request from being allocated on the current hctx.
>  	 *
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index 21140132a30d..600c5dd1a069 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -403,6 +403,7 @@ enum {
>  	 */
>  	BLK_MQ_F_STACKING	= 1 << 2,
>  	BLK_MQ_F_TAG_HCTX_SHARED = 1 << 3,
> +	BLK_MQ_F_NOT_USE_MANAGED_IRQ = 1 << 4,

My 2 cents: BLK_MQ_F_NO_MANAGED_IRQ may be a better/shorter name.

>  	BLK_MQ_F_BLOCKING	= 1 << 5,
>  	BLK_MQ_F_NO_SCHED	= 1 << 6,
>  	BLK_MQ_F_ALLOC_POLICY_START_BIT = 8,
> 


-- 
Damien Le Moal
Western Digital Research

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/2] blk-mq: not deactivate hctx if the device doesn't use managed irq
  2021-06-29 15:49     ` John Garry
@ 2021-06-30  0:32       ` Ming Lei
  -1 siblings, 0 replies; 50+ messages in thread
From: Ming Lei @ 2021-06-30  0:32 UTC (permalink / raw)
  To: John Garry
  Cc: Jens Axboe, linux-block, linux-nvme, Christoph Hellwig,
	Sagi Grimberg, Daniel Wagner, Wen Xiong

On Tue, Jun 29, 2021 at 04:49:56PM +0100, John Garry wrote:
> On 29/06/2021 08:49, Ming Lei wrote:
> > hctx is deactivated when all CPU in hctx->cpumask become offline by
> > draining all requests originated from this hctx and moving new
> > allocation to active hctx. This way is for avoiding inflight IO when
> > the managed irq is shutdown.
> > 
> > Some drivers(nvme fc, rdma, tcp, loop) doesn't use managed irq, so
> > they needn't to deactivate hctx. Also, they are the only user of
> > blk_mq_alloc_request_hctx() which is used for connecting io queue.
> > And their requirement is that the connect request can be submitted
> > via one specified hctx on which all CPU in its hctx->cpumask may have
> > become offline.
> > 
> > Address the requirement for nvme fc/rdma/loop, so the reported kernel
> > panic on the following line in blk_mq_alloc_request_hctx() can be fixed.
> > 
> > 	data.ctx = __blk_mq_get_ctx(q, cpu)
> > 
> > Cc: Sagi Grimberg <sagi@grimberg.me>
> > Cc: Daniel Wagner <dwagner@suse.de>
> > Cc: Wen Xiong <wenxiong@us.ibm.com>
> > Cc: John Garry <john.garry@huawei.com>
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >   block/blk-mq.c         | 6 +++++-
> >   include/linux/blk-mq.h | 1 +
> >   2 files changed, 6 insertions(+), 1 deletion(-)
> > 
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index df5dc3b756f5..74632f50d969 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -494,7 +494,7 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
> >   	data.hctx = q->queue_hw_ctx[hctx_idx];
> >   	if (!blk_mq_hw_queue_mapped(data.hctx))
> >   		goto out_queue_exit;
> > -	cpu = cpumask_first_and(data.hctx->cpumask, cpu_online_mask);
> > +	cpu = cpumask_first(data.hctx->cpumask);
> >   	data.ctx = __blk_mq_get_ctx(q, cpu);
> >   	if (!q->elevator)
> > @@ -2570,6 +2570,10 @@ static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node)
> >   	    !blk_mq_last_cpu_in_hctx(cpu, hctx))
> >   		return 0;
> > +	/* Controller doesn't use managed IRQ, no need to deactivate hctx */
> > +	if (hctx->flags & BLK_MQ_F_NOT_USE_MANAGED_IRQ)
> > +		return 0;
> Is there anything to be gained in registering the CPU hotplug handler for
> the hctx in this case at all?

Indeed, we may replace BLK_MQ_F_STACKING with BLK_MQ_F_NOT_USE_MANAGED_IRQ
(BLK_MQ_F_NO_MANAGED_IRQ) too, and the latter is one general solution.

> 
> > +
> >   	/*
> >   	 * Prevent new request from being allocated on the current hctx.
> >   	 *
> > diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> > index 21140132a30d..600c5dd1a069 100644
> > --- a/include/linux/blk-mq.h
> > +++ b/include/linux/blk-mq.h
> > @@ -403,6 +403,7 @@ enum {
> >   	 */
> >   	BLK_MQ_F_STACKING	= 1 << 2,
> >   	BLK_MQ_F_TAG_HCTX_SHARED = 1 << 3,
> > +	BLK_MQ_F_NOT_USE_MANAGED_IRQ = 1 << 4,
> 
> Many block drivers don't use managed interrupts - to be proper, why not set
> this everywhere (which doesn't use managed interrupts)? I know why, but it's
> odd.

It is just one small optimization in slow path for other drivers, not sure these
drivers are interested in such change. It only serves as fix for callers of
blk_mq_alloc_request_hctx().

Anyway, we can document the situation.

> 
> As an alternative, if the default queue mapping was used (in
> blk_mq_map_queues()), then that's the same thing as
> BLK_MQ_F_NOT_USE_MANAGED_IRQ in reality, right? If so, could we
> alternatively check for that somehow?

This way can't work, for example of NVMe PCI, managed irq is used for
the default/write queues, but poll queues uses blk_mq_map_queues().

Also it can't cover all cases, such as MVMe RDMA.

Using managed irq or not is thing of driver's choice, and not sure if it
is good for block layer to provide such abstract. I'd suggest to fix the
issue still by passing one flag, given we needn't it everywhere so far.

 
Thanks,
Ming


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/2] blk-mq: not deactivate hctx if the device doesn't use managed irq
@ 2021-06-30  0:32       ` Ming Lei
  0 siblings, 0 replies; 50+ messages in thread
From: Ming Lei @ 2021-06-30  0:32 UTC (permalink / raw)
  To: John Garry
  Cc: Jens Axboe, linux-block, linux-nvme, Christoph Hellwig,
	Sagi Grimberg, Daniel Wagner, Wen Xiong

On Tue, Jun 29, 2021 at 04:49:56PM +0100, John Garry wrote:
> On 29/06/2021 08:49, Ming Lei wrote:
> > hctx is deactivated when all CPU in hctx->cpumask become offline by
> > draining all requests originated from this hctx and moving new
> > allocation to active hctx. This way is for avoiding inflight IO when
> > the managed irq is shutdown.
> > 
> > Some drivers(nvme fc, rdma, tcp, loop) doesn't use managed irq, so
> > they needn't to deactivate hctx. Also, they are the only user of
> > blk_mq_alloc_request_hctx() which is used for connecting io queue.
> > And their requirement is that the connect request can be submitted
> > via one specified hctx on which all CPU in its hctx->cpumask may have
> > become offline.
> > 
> > Address the requirement for nvme fc/rdma/loop, so the reported kernel
> > panic on the following line in blk_mq_alloc_request_hctx() can be fixed.
> > 
> > 	data.ctx = __blk_mq_get_ctx(q, cpu)
> > 
> > Cc: Sagi Grimberg <sagi@grimberg.me>
> > Cc: Daniel Wagner <dwagner@suse.de>
> > Cc: Wen Xiong <wenxiong@us.ibm.com>
> > Cc: John Garry <john.garry@huawei.com>
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >   block/blk-mq.c         | 6 +++++-
> >   include/linux/blk-mq.h | 1 +
> >   2 files changed, 6 insertions(+), 1 deletion(-)
> > 
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index df5dc3b756f5..74632f50d969 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -494,7 +494,7 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
> >   	data.hctx = q->queue_hw_ctx[hctx_idx];
> >   	if (!blk_mq_hw_queue_mapped(data.hctx))
> >   		goto out_queue_exit;
> > -	cpu = cpumask_first_and(data.hctx->cpumask, cpu_online_mask);
> > +	cpu = cpumask_first(data.hctx->cpumask);
> >   	data.ctx = __blk_mq_get_ctx(q, cpu);
> >   	if (!q->elevator)
> > @@ -2570,6 +2570,10 @@ static int blk_mq_hctx_notify_offline(unsigned int cpu, struct hlist_node *node)
> >   	    !blk_mq_last_cpu_in_hctx(cpu, hctx))
> >   		return 0;
> > +	/* Controller doesn't use managed IRQ, no need to deactivate hctx */
> > +	if (hctx->flags & BLK_MQ_F_NOT_USE_MANAGED_IRQ)
> > +		return 0;
> Is there anything to be gained in registering the CPU hotplug handler for
> the hctx in this case at all?

Indeed, we may replace BLK_MQ_F_STACKING with BLK_MQ_F_NOT_USE_MANAGED_IRQ
(BLK_MQ_F_NO_MANAGED_IRQ) too, and the latter is one general solution.

> 
> > +
> >   	/*
> >   	 * Prevent new request from being allocated on the current hctx.
> >   	 *
> > diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> > index 21140132a30d..600c5dd1a069 100644
> > --- a/include/linux/blk-mq.h
> > +++ b/include/linux/blk-mq.h
> > @@ -403,6 +403,7 @@ enum {
> >   	 */
> >   	BLK_MQ_F_STACKING	= 1 << 2,
> >   	BLK_MQ_F_TAG_HCTX_SHARED = 1 << 3,
> > +	BLK_MQ_F_NOT_USE_MANAGED_IRQ = 1 << 4,
> 
> Many block drivers don't use managed interrupts - to be proper, why not set
> this everywhere (which doesn't use managed interrupts)? I know why, but it's
> odd.

It is just one small optimization in slow path for other drivers, not sure these
drivers are interested in such change. It only serves as fix for callers of
blk_mq_alloc_request_hctx().

Anyway, we can document the situation.

> 
> As an alternative, if the default queue mapping was used (in
> blk_mq_map_queues()), then that's the same thing as
> BLK_MQ_F_NOT_USE_MANAGED_IRQ in reality, right? If so, could we
> alternatively check for that somehow?

This way can't work, for example of NVMe PCI, managed irq is used for
the default/write queues, but poll queues uses blk_mq_map_queues().

Also it can't cover all cases, such as MVMe RDMA.

Using managed irq or not is thing of driver's choice, and not sure if it
is good for block layer to provide such abstract. I'd suggest to fix the
issue still by passing one flag, given we needn't it everywhere so far.

 
Thanks,
Ming


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 2/2] nvme: pass BLK_MQ_F_NOT_USE_MANAGED_IRQ for fc/rdma/tcp/loop
  2021-06-29  7:49   ` Ming Lei
@ 2021-06-30  8:15     ` Hannes Reinecke
  -1 siblings, 0 replies; 50+ messages in thread
From: Hannes Reinecke @ 2021-06-30  8:15 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block, linux-nvme, Christoph Hellwig
  Cc: Sagi Grimberg, Daniel Wagner, Wen Xiong, John Garry

On 6/29/21 9:49 AM, Ming Lei wrote:
> All the 4 host drivers don't use managed irq for completing request, so
> it is correct to pass the flag to blk-mq.
> 
> Secondly with this flag, blk-mq will help us dispatch connect request
> allocated via blk_mq_alloc_request_hctx() to driver even though all
> CPU in the specified hctx's cpumask are offline.
> 
How is this supposed to work?
To my understanding, only cpus in the hctx cpumask are eligible to send 
I/O. So if all of these CPUs are offline, who exactly is submitting I/O?
More to the point, which CPU will be submitting the connect request?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 2/2] nvme: pass BLK_MQ_F_NOT_USE_MANAGED_IRQ for fc/rdma/tcp/loop
@ 2021-06-30  8:15     ` Hannes Reinecke
  0 siblings, 0 replies; 50+ messages in thread
From: Hannes Reinecke @ 2021-06-30  8:15 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block, linux-nvme, Christoph Hellwig
  Cc: Sagi Grimberg, Daniel Wagner, Wen Xiong, John Garry

On 6/29/21 9:49 AM, Ming Lei wrote:
> All the 4 host drivers don't use managed irq for completing request, so
> it is correct to pass the flag to blk-mq.
> 
> Secondly with this flag, blk-mq will help us dispatch connect request
> allocated via blk_mq_alloc_request_hctx() to driver even though all
> CPU in the specified hctx's cpumask are offline.
> 
How is this supposed to work?
To my understanding, only cpus in the hctx cpumask are eligible to send 
I/O. So if all of these CPUs are offline, who exactly is submitting I/O?
More to the point, which CPU will be submitting the connect request?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/2] blk-mq: fix blk_mq_alloc_request_hctx
  2021-06-29  7:49 ` Ming Lei
@ 2021-06-30  8:18   ` Hannes Reinecke
  -1 siblings, 0 replies; 50+ messages in thread
From: Hannes Reinecke @ 2021-06-30  8:18 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block, linux-nvme, Christoph Hellwig
  Cc: Sagi Grimberg, Daniel Wagner, Wen Xiong, John Garry

On 6/29/21 9:49 AM, Ming Lei wrote:
> Hi,
> 
> blk_mq_alloc_request_hctx() is used by NVMe fc/rdma/tcp/loop to connect
> io queue. Also the sw ctx is chosen as the 1st online cpu in hctx->cpumask.
> However, all cpus in hctx->cpumask may be offline.
> 
> This usage model isn't well supported by blk-mq which supposes allocator is
> always done on one online CPU in hctx->cpumask. This assumption is
> related with managed irq, which also requires blk-mq to drain inflight
> request in this hctx when the last cpu in hctx->cpumask is going to
> offline.
> 
> However, NVMe fc/rdma/tcp/loop don't use managed irq, so we should allow
> them to ask for request allocation when the specified hctx is inactive
> (all cpus in hctx->cpumask are offline).
> 
> Fix blk_mq_alloc_request_hctx() by adding/passing flag of
> BLK_MQ_F_NOT_USE_MANAGED_IRQ.
> 
> 
> Ming Lei (2):
>    blk-mq: not deactivate hctx if the device doesn't use managed irq
>    nvme: pass BLK_MQ_F_NOT_USE_MANAGED_IRQ for fc/rdma/tcp/loop
> 
>   block/blk-mq.c             | 6 +++++-
>   drivers/nvme/host/fc.c     | 3 ++-
>   drivers/nvme/host/rdma.c   | 3 ++-
>   drivers/nvme/host/tcp.c    | 3 ++-
>   drivers/nvme/target/loop.c | 3 ++-
>   include/linux/blk-mq.h     | 1 +
>   6 files changed, 14 insertions(+), 5 deletions(-)
> 
> Cc: Sagi Grimberg <sagi@grimberg.me>
> Cc: Daniel Wagner <dwagner@suse. thede>
> Cc: Wen Xiong <wenxiong@us.ibm.com>
> Cc: John Garry <john.garry@huawei.com>
> 
> 
I have my misgivings about this patchset.
To my understanding, only CPUs present in the hctx cpumask are eligible 
to submit I/O to that hctx.
Consequently if all cpus in that mask are offline, where is the point of 
even transmitting a 'connect' request?
Shouldn't we rather modify the tagset to only refer to the current 
online CPUs _only_, thereby never submit a connect request for hctx with 
only offline CPUs?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/2] blk-mq: fix blk_mq_alloc_request_hctx
@ 2021-06-30  8:18   ` Hannes Reinecke
  0 siblings, 0 replies; 50+ messages in thread
From: Hannes Reinecke @ 2021-06-30  8:18 UTC (permalink / raw)
  To: Ming Lei, Jens Axboe, linux-block, linux-nvme, Christoph Hellwig
  Cc: Sagi Grimberg, Daniel Wagner, Wen Xiong, John Garry

On 6/29/21 9:49 AM, Ming Lei wrote:
> Hi,
> 
> blk_mq_alloc_request_hctx() is used by NVMe fc/rdma/tcp/loop to connect
> io queue. Also the sw ctx is chosen as the 1st online cpu in hctx->cpumask.
> However, all cpus in hctx->cpumask may be offline.
> 
> This usage model isn't well supported by blk-mq which supposes allocator is
> always done on one online CPU in hctx->cpumask. This assumption is
> related with managed irq, which also requires blk-mq to drain inflight
> request in this hctx when the last cpu in hctx->cpumask is going to
> offline.
> 
> However, NVMe fc/rdma/tcp/loop don't use managed irq, so we should allow
> them to ask for request allocation when the specified hctx is inactive
> (all cpus in hctx->cpumask are offline).
> 
> Fix blk_mq_alloc_request_hctx() by adding/passing flag of
> BLK_MQ_F_NOT_USE_MANAGED_IRQ.
> 
> 
> Ming Lei (2):
>    blk-mq: not deactivate hctx if the device doesn't use managed irq
>    nvme: pass BLK_MQ_F_NOT_USE_MANAGED_IRQ for fc/rdma/tcp/loop
> 
>   block/blk-mq.c             | 6 +++++-
>   drivers/nvme/host/fc.c     | 3 ++-
>   drivers/nvme/host/rdma.c   | 3 ++-
>   drivers/nvme/host/tcp.c    | 3 ++-
>   drivers/nvme/target/loop.c | 3 ++-
>   include/linux/blk-mq.h     | 1 +
>   6 files changed, 14 insertions(+), 5 deletions(-)
> 
> Cc: Sagi Grimberg <sagi@grimberg.me>
> Cc: Daniel Wagner <dwagner@suse. thede>
> Cc: Wen Xiong <wenxiong@us.ibm.com>
> Cc: John Garry <john.garry@huawei.com>
> 
> 
I have my misgivings about this patchset.
To my understanding, only CPUs present in the hctx cpumask are eligible 
to submit I/O to that hctx.
Consequently if all cpus in that mask are offline, where is the point of 
even transmitting a 'connect' request?
Shouldn't we rather modify the tagset to only refer to the current 
online CPUs _only_, thereby never submit a connect request for hctx with 
only offline CPUs?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/2] blk-mq: fix blk_mq_alloc_request_hctx
  2021-06-30  8:18   ` Hannes Reinecke
@ 2021-06-30  8:42     ` Ming Lei
  -1 siblings, 0 replies; 50+ messages in thread
From: Ming Lei @ 2021-06-30  8:42 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Jens Axboe, linux-block, linux-nvme, Christoph Hellwig,
	Sagi Grimberg, Daniel Wagner, Wen Xiong, John Garry

On Wed, Jun 30, 2021 at 10:18:37AM +0200, Hannes Reinecke wrote:
> On 6/29/21 9:49 AM, Ming Lei wrote:
> > Hi,
> > 
> > blk_mq_alloc_request_hctx() is used by NVMe fc/rdma/tcp/loop to connect
> > io queue. Also the sw ctx is chosen as the 1st online cpu in hctx->cpumask.
> > However, all cpus in hctx->cpumask may be offline.
> > 
> > This usage model isn't well supported by blk-mq which supposes allocator is
> > always done on one online CPU in hctx->cpumask. This assumption is
> > related with managed irq, which also requires blk-mq to drain inflight
> > request in this hctx when the last cpu in hctx->cpumask is going to
> > offline.
> > 
> > However, NVMe fc/rdma/tcp/loop don't use managed irq, so we should allow
> > them to ask for request allocation when the specified hctx is inactive
> > (all cpus in hctx->cpumask are offline).
> > 
> > Fix blk_mq_alloc_request_hctx() by adding/passing flag of
> > BLK_MQ_F_NOT_USE_MANAGED_IRQ.
> > 
> > 
> > Ming Lei (2):
> >    blk-mq: not deactivate hctx if the device doesn't use managed irq
> >    nvme: pass BLK_MQ_F_NOT_USE_MANAGED_IRQ for fc/rdma/tcp/loop
> > 
> >   block/blk-mq.c             | 6 +++++-
> >   drivers/nvme/host/fc.c     | 3 ++-
> >   drivers/nvme/host/rdma.c   | 3 ++-
> >   drivers/nvme/host/tcp.c    | 3 ++-
> >   drivers/nvme/target/loop.c | 3 ++-
> >   include/linux/blk-mq.h     | 1 +
> >   6 files changed, 14 insertions(+), 5 deletions(-)
> > 
> > Cc: Sagi Grimberg <sagi@grimberg.me>
> > Cc: Daniel Wagner <dwagner@suse. thede>
> > Cc: Wen Xiong <wenxiong@us.ibm.com>
> > Cc: John Garry <john.garry@huawei.com>
> > 
> > 
> I have my misgivings about this patchset.
> To my understanding, only CPUs present in the hctx cpumask are eligible to
> submit I/O to that hctx.

It is just true for managed irq, and should be CPUs online.

However, no such constraint for non managed irq, since irq may migrate to
other online CPUs if all CPUs in irq's current affinity become offline.

> Consequently if all cpus in that mask are offline, where is the point of
> even transmitting a 'connect' request?

nvmef requires to submit the connect request via one specified hctx
which index has to be same with the io queue's index.

Almost all nvmef drivers fail to setup controller in case of
connect io queue error.

Also CPU can become offline & online, especially it is done in
lots of sanity test.

So we should allow to allocate the connect request successful, and
submit it to drivers given it is allowed in this way for non-managed
irq.

> Shouldn't we rather modify the tagset to only refer to the current online
> CPUs _only_, thereby never submit a connect request for hctx with only
> offline CPUs?

Then you may setup very less io queues, and performance may suffer even
though lots of CPUs become online later.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/2] blk-mq: fix blk_mq_alloc_request_hctx
@ 2021-06-30  8:42     ` Ming Lei
  0 siblings, 0 replies; 50+ messages in thread
From: Ming Lei @ 2021-06-30  8:42 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Jens Axboe, linux-block, linux-nvme, Christoph Hellwig,
	Sagi Grimberg, Daniel Wagner, Wen Xiong, John Garry

On Wed, Jun 30, 2021 at 10:18:37AM +0200, Hannes Reinecke wrote:
> On 6/29/21 9:49 AM, Ming Lei wrote:
> > Hi,
> > 
> > blk_mq_alloc_request_hctx() is used by NVMe fc/rdma/tcp/loop to connect
> > io queue. Also the sw ctx is chosen as the 1st online cpu in hctx->cpumask.
> > However, all cpus in hctx->cpumask may be offline.
> > 
> > This usage model isn't well supported by blk-mq which supposes allocator is
> > always done on one online CPU in hctx->cpumask. This assumption is
> > related with managed irq, which also requires blk-mq to drain inflight
> > request in this hctx when the last cpu in hctx->cpumask is going to
> > offline.
> > 
> > However, NVMe fc/rdma/tcp/loop don't use managed irq, so we should allow
> > them to ask for request allocation when the specified hctx is inactive
> > (all cpus in hctx->cpumask are offline).
> > 
> > Fix blk_mq_alloc_request_hctx() by adding/passing flag of
> > BLK_MQ_F_NOT_USE_MANAGED_IRQ.
> > 
> > 
> > Ming Lei (2):
> >    blk-mq: not deactivate hctx if the device doesn't use managed irq
> >    nvme: pass BLK_MQ_F_NOT_USE_MANAGED_IRQ for fc/rdma/tcp/loop
> > 
> >   block/blk-mq.c             | 6 +++++-
> >   drivers/nvme/host/fc.c     | 3 ++-
> >   drivers/nvme/host/rdma.c   | 3 ++-
> >   drivers/nvme/host/tcp.c    | 3 ++-
> >   drivers/nvme/target/loop.c | 3 ++-
> >   include/linux/blk-mq.h     | 1 +
> >   6 files changed, 14 insertions(+), 5 deletions(-)
> > 
> > Cc: Sagi Grimberg <sagi@grimberg.me>
> > Cc: Daniel Wagner <dwagner@suse. thede>
> > Cc: Wen Xiong <wenxiong@us.ibm.com>
> > Cc: John Garry <john.garry@huawei.com>
> > 
> > 
> I have my misgivings about this patchset.
> To my understanding, only CPUs present in the hctx cpumask are eligible to
> submit I/O to that hctx.

It is just true for managed irq, and should be CPUs online.

However, no such constraint for non managed irq, since irq may migrate to
other online CPUs if all CPUs in irq's current affinity become offline.

> Consequently if all cpus in that mask are offline, where is the point of
> even transmitting a 'connect' request?

nvmef requires to submit the connect request via one specified hctx
which index has to be same with the io queue's index.

Almost all nvmef drivers fail to setup controller in case of
connect io queue error.

Also CPU can become offline & online, especially it is done in
lots of sanity test.

So we should allow to allocate the connect request successful, and
submit it to drivers given it is allowed in this way for non-managed
irq.

> Shouldn't we rather modify the tagset to only refer to the current online
> CPUs _only_, thereby never submit a connect request for hctx with only
> offline CPUs?

Then you may setup very less io queues, and performance may suffer even
though lots of CPUs become online later.


Thanks,
Ming


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 2/2] nvme: pass BLK_MQ_F_NOT_USE_MANAGED_IRQ for fc/rdma/tcp/loop
  2021-06-30  8:15     ` Hannes Reinecke
@ 2021-06-30  8:47       ` Ming Lei
  -1 siblings, 0 replies; 50+ messages in thread
From: Ming Lei @ 2021-06-30  8:47 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Jens Axboe, linux-block, linux-nvme, Christoph Hellwig,
	Sagi Grimberg, Daniel Wagner, Wen Xiong, John Garry

On Wed, Jun 30, 2021 at 10:15:46AM +0200, Hannes Reinecke wrote:
> On 6/29/21 9:49 AM, Ming Lei wrote:
> > All the 4 host drivers don't use managed irq for completing request, so
> > it is correct to pass the flag to blk-mq.
> > 
> > Secondly with this flag, blk-mq will help us dispatch connect request
> > allocated via blk_mq_alloc_request_hctx() to driver even though all
> > CPU in the specified hctx's cpumask are offline.
> > 
> How is this supposed to work?
> To my understanding, only cpus in the hctx cpumask are eligible to send I/O.
> So if all of these CPUs are offline, who exactly is submitting I/O?
> More to the point, which CPU will be submitting the connect request?

Please see __blk_mq_delay_run_hw_queue(), which will pass WORK_CPU_UNBOUND 
to kblockd_mod_delayed_work_on() for this situation.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 2/2] nvme: pass BLK_MQ_F_NOT_USE_MANAGED_IRQ for fc/rdma/tcp/loop
@ 2021-06-30  8:47       ` Ming Lei
  0 siblings, 0 replies; 50+ messages in thread
From: Ming Lei @ 2021-06-30  8:47 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Jens Axboe, linux-block, linux-nvme, Christoph Hellwig,
	Sagi Grimberg, Daniel Wagner, Wen Xiong, John Garry

On Wed, Jun 30, 2021 at 10:15:46AM +0200, Hannes Reinecke wrote:
> On 6/29/21 9:49 AM, Ming Lei wrote:
> > All the 4 host drivers don't use managed irq for completing request, so
> > it is correct to pass the flag to blk-mq.
> > 
> > Secondly with this flag, blk-mq will help us dispatch connect request
> > allocated via blk_mq_alloc_request_hctx() to driver even though all
> > CPU in the specified hctx's cpumask are offline.
> > 
> How is this supposed to work?
> To my understanding, only cpus in the hctx cpumask are eligible to send I/O.
> So if all of these CPUs are offline, who exactly is submitting I/O?
> More to the point, which CPU will be submitting the connect request?

Please see __blk_mq_delay_run_hw_queue(), which will pass WORK_CPU_UNBOUND 
to kblockd_mod_delayed_work_on() for this situation.


Thanks,
Ming


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/2] blk-mq: not deactivate hctx if the device doesn't use managed irq
  2021-06-30  0:32       ` Ming Lei
@ 2021-06-30  9:25         ` John Garry
  -1 siblings, 0 replies; 50+ messages in thread
From: John Garry @ 2021-06-30  9:25 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, linux-nvme, Christoph Hellwig,
	Sagi Grimberg, Daniel Wagner, Wen Xiong

On 30/06/2021 01:32, Ming Lei wrote:
>> Many block drivers don't use managed interrupts - to be proper, why not set
>> this everywhere (which doesn't use managed interrupts)? I know why, but it's
>> odd.
> It is just one small optimization in slow path for other drivers, not sure these
> drivers are interested in such change. It only serves as fix for callers of
> blk_mq_alloc_request_hctx().
> 
> Anyway, we can document the situation.
> 
>> As an alternative, if the default queue mapping was used (in
>> blk_mq_map_queues()), then that's the same thing as
>> BLK_MQ_F_NOT_USE_MANAGED_IRQ in reality, right? If so, could we
>> alternatively check for that somehow?
> This way can't work, for example of NVMe PCI, managed irq is used for
> the default/write queues, but poll queues uses blk_mq_map_queues().
> 
> Also it can't cover all cases, such as MVMe RDMA.
> 
> Using managed irq or not is thing of driver's choice, and not sure if it
> is good for block layer to provide such abstract. I'd suggest to fix the
> issue still by passing one flag, given we needn't it everywhere so far.

Sure, but I am just trying to suggest a more automatic way of passing 
this info. Like, for example, if we use blk_mq_pci_map_queues() on the 
qmap, then you prob want blk-mq managed interrupt support (for that 
qmap), i.e. CPU hotplug handlers.

Thanks,
John

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/2] blk-mq: not deactivate hctx if the device doesn't use managed irq
@ 2021-06-30  9:25         ` John Garry
  0 siblings, 0 replies; 50+ messages in thread
From: John Garry @ 2021-06-30  9:25 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, linux-nvme, Christoph Hellwig,
	Sagi Grimberg, Daniel Wagner, Wen Xiong

On 30/06/2021 01:32, Ming Lei wrote:
>> Many block drivers don't use managed interrupts - to be proper, why not set
>> this everywhere (which doesn't use managed interrupts)? I know why, but it's
>> odd.
> It is just one small optimization in slow path for other drivers, not sure these
> drivers are interested in such change. It only serves as fix for callers of
> blk_mq_alloc_request_hctx().
> 
> Anyway, we can document the situation.
> 
>> As an alternative, if the default queue mapping was used (in
>> blk_mq_map_queues()), then that's the same thing as
>> BLK_MQ_F_NOT_USE_MANAGED_IRQ in reality, right? If so, could we
>> alternatively check for that somehow?
> This way can't work, for example of NVMe PCI, managed irq is used for
> the default/write queues, but poll queues uses blk_mq_map_queues().
> 
> Also it can't cover all cases, such as MVMe RDMA.
> 
> Using managed irq or not is thing of driver's choice, and not sure if it
> is good for block layer to provide such abstract. I'd suggest to fix the
> issue still by passing one flag, given we needn't it everywhere so far.

Sure, but I am just trying to suggest a more automatic way of passing 
this info. Like, for example, if we use blk_mq_pci_map_queues() on the 
qmap, then you prob want blk-mq managed interrupt support (for that 
qmap), i.e. CPU hotplug handlers.

Thanks,
John

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/2] blk-mq: fix blk_mq_alloc_request_hctx
  2021-06-30  8:42     ` Ming Lei
@ 2021-06-30  9:43       ` Hannes Reinecke
  -1 siblings, 0 replies; 50+ messages in thread
From: Hannes Reinecke @ 2021-06-30  9:43 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, linux-nvme, Christoph Hellwig,
	Sagi Grimberg, Daniel Wagner, Wen Xiong, John Garry

On 6/30/21 10:42 AM, Ming Lei wrote:
> On Wed, Jun 30, 2021 at 10:18:37AM +0200, Hannes Reinecke wrote:
>> On 6/29/21 9:49 AM, Ming Lei wrote:
>>> Hi,
>>>
>>> blk_mq_alloc_request_hctx() is used by NVMe fc/rdma/tcp/loop to connect
>>> io queue. Also the sw ctx is chosen as the 1st online cpu in hctx->cpumask.
>>> However, all cpus in hctx->cpumask may be offline.
>>>
>>> This usage model isn't well supported by blk-mq which supposes allocator is
>>> always done on one online CPU in hctx->cpumask. This assumption is
>>> related with managed irq, which also requires blk-mq to drain inflight
>>> request in this hctx when the last cpu in hctx->cpumask is going to
>>> offline.
>>>
>>> However, NVMe fc/rdma/tcp/loop don't use managed irq, so we should allow
>>> them to ask for request allocation when the specified hctx is inactive
>>> (all cpus in hctx->cpumask are offline).
>>>
>>> Fix blk_mq_alloc_request_hctx() by adding/passing flag of
>>> BLK_MQ_F_NOT_USE_MANAGED_IRQ.
>>>
>>>
>>> Ming Lei (2):
>>>     blk-mq: not deactivate hctx if the device doesn't use managed irq
>>>     nvme: pass BLK_MQ_F_NOT_USE_MANAGED_IRQ for fc/rdma/tcp/loop
>>>
>>>    block/blk-mq.c             | 6 +++++-
>>>    drivers/nvme/host/fc.c     | 3 ++-
>>>    drivers/nvme/host/rdma.c   | 3 ++-
>>>    drivers/nvme/host/tcp.c    | 3 ++-
>>>    drivers/nvme/target/loop.c | 3 ++-
>>>    include/linux/blk-mq.h     | 1 +
>>>    6 files changed, 14 insertions(+), 5 deletions(-)
>>>
>>> Cc: Sagi Grimberg <sagi@grimberg.me>
>>> Cc: Daniel Wagner <dwagner@suse. thede>
>>> Cc: Wen Xiong <wenxiong@us.ibm.com>
>>> Cc: John Garry <john.garry@huawei.com>
>>>
>>>
>> I have my misgivings about this patchset.
>> To my understanding, only CPUs present in the hctx cpumask are eligible to
>> submit I/O to that hctx.
> 
> It is just true for managed irq, and should be CPUs online.
> 
> However, no such constraint for non managed irq, since irq may migrate to
> other online CPUs if all CPUs in irq's current affinity become offline.
> 

But there shouldn't be any I/O pending during CPU offline (cf 
blk_mq_hctx_notify_offline()), so no interrupts should be triggered, 
either, no?

>> Consequently if all cpus in that mask are offline, where is the point of
>> even transmitting a 'connect' request?
> 
> nvmef requires to submit the connect request via one specified hctx
> which index has to be same with the io queue's index.
> 
> Almost all nvmef drivers fail to setup controller in case of
> connect io queue error.
> 

And I would prefer to fix that, namely allowing blk-mq to run on a 
sparse set of io queues.
The remaining io queues can be connected once the first cpu in the hctx 
cpumask is onlined; we already have blk_mq_hctx_notify_online(), which 
could easily be expanded to connect to relevant I/O queue...

> Also CPU can become offline & online, especially it is done in
> lots of sanity test.
> 

True, but then again all I/O on the hctx should be quiesced during cpu 
offline.

> So we should allow to allocate the connect request successful, and
> submit it to drivers given it is allowed in this way for non-managed
> irq.
> 

I'd rather not do this, as the 'connect' command runs on the 'normal' 
I/O tagset, and hence runs into the risk of being issues against 
non-existing CPUs.

>> Shouldn't we rather modify the tagset to only refer to the current online
>> CPUs _only_, thereby never submit a connect request for hctx with only
>> offline CPUs?
> 
> Then you may setup very less io queues, and performance may suffer even
> though lots of CPUs become online later.
> ;
Only if we stay with the reduced number of I/O queues. Which is not what 
I'm proposing; I'd rather prefer to connect and disconnect queues from 
the cpu hotplug handler. For starters we could even trigger a reset once 
the first cpu within a hctx is onlined.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/2] blk-mq: fix blk_mq_alloc_request_hctx
@ 2021-06-30  9:43       ` Hannes Reinecke
  0 siblings, 0 replies; 50+ messages in thread
From: Hannes Reinecke @ 2021-06-30  9:43 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, linux-block, linux-nvme, Christoph Hellwig,
	Sagi Grimberg, Daniel Wagner, Wen Xiong, John Garry

On 6/30/21 10:42 AM, Ming Lei wrote:
> On Wed, Jun 30, 2021 at 10:18:37AM +0200, Hannes Reinecke wrote:
>> On 6/29/21 9:49 AM, Ming Lei wrote:
>>> Hi,
>>>
>>> blk_mq_alloc_request_hctx() is used by NVMe fc/rdma/tcp/loop to connect
>>> io queue. Also the sw ctx is chosen as the 1st online cpu in hctx->cpumask.
>>> However, all cpus in hctx->cpumask may be offline.
>>>
>>> This usage model isn't well supported by blk-mq which supposes allocator is
>>> always done on one online CPU in hctx->cpumask. This assumption is
>>> related with managed irq, which also requires blk-mq to drain inflight
>>> request in this hctx when the last cpu in hctx->cpumask is going to
>>> offline.
>>>
>>> However, NVMe fc/rdma/tcp/loop don't use managed irq, so we should allow
>>> them to ask for request allocation when the specified hctx is inactive
>>> (all cpus in hctx->cpumask are offline).
>>>
>>> Fix blk_mq_alloc_request_hctx() by adding/passing flag of
>>> BLK_MQ_F_NOT_USE_MANAGED_IRQ.
>>>
>>>
>>> Ming Lei (2):
>>>     blk-mq: not deactivate hctx if the device doesn't use managed irq
>>>     nvme: pass BLK_MQ_F_NOT_USE_MANAGED_IRQ for fc/rdma/tcp/loop
>>>
>>>    block/blk-mq.c             | 6 +++++-
>>>    drivers/nvme/host/fc.c     | 3 ++-
>>>    drivers/nvme/host/rdma.c   | 3 ++-
>>>    drivers/nvme/host/tcp.c    | 3 ++-
>>>    drivers/nvme/target/loop.c | 3 ++-
>>>    include/linux/blk-mq.h     | 1 +
>>>    6 files changed, 14 insertions(+), 5 deletions(-)
>>>
>>> Cc: Sagi Grimberg <sagi@grimberg.me>
>>> Cc: Daniel Wagner <dwagner@suse. thede>
>>> Cc: Wen Xiong <wenxiong@us.ibm.com>
>>> Cc: John Garry <john.garry@huawei.com>
>>>
>>>
>> I have my misgivings about this patchset.
>> To my understanding, only CPUs present in the hctx cpumask are eligible to
>> submit I/O to that hctx.
> 
> It is just true for managed irq, and should be CPUs online.
> 
> However, no such constraint for non managed irq, since irq may migrate to
> other online CPUs if all CPUs in irq's current affinity become offline.
> 

But there shouldn't be any I/O pending during CPU offline (cf 
blk_mq_hctx_notify_offline()), so no interrupts should be triggered, 
either, no?

>> Consequently if all cpus in that mask are offline, where is the point of
>> even transmitting a 'connect' request?
> 
> nvmef requires to submit the connect request via one specified hctx
> which index has to be same with the io queue's index.
> 
> Almost all nvmef drivers fail to setup controller in case of
> connect io queue error.
> 

And I would prefer to fix that, namely allowing blk-mq to run on a 
sparse set of io queues.
The remaining io queues can be connected once the first cpu in the hctx 
cpumask is onlined; we already have blk_mq_hctx_notify_online(), which 
could easily be expanded to connect to relevant I/O queue...

> Also CPU can become offline & online, especially it is done in
> lots of sanity test.
> 

True, but then again all I/O on the hctx should be quiesced during cpu 
offline.

> So we should allow to allocate the connect request successful, and
> submit it to drivers given it is allowed in this way for non-managed
> irq.
> 

I'd rather not do this, as the 'connect' command runs on the 'normal' 
I/O tagset, and hence runs into the risk of being issues against 
non-existing CPUs.

>> Shouldn't we rather modify the tagset to only refer to the current online
>> CPUs _only_, thereby never submit a connect request for hctx with only
>> offline CPUs?
> 
> Then you may setup very less io queues, and performance may suffer even
> though lots of CPUs become online later.
> ;
Only if we stay with the reduced number of I/O queues. Which is not what 
I'm proposing; I'd rather prefer to connect and disconnect queues from 
the cpu hotplug handler. For starters we could even trigger a reset once 
the first cpu within a hctx is onlined.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/2] blk-mq: fix blk_mq_alloc_request_hctx
  2021-06-30  9:43       ` Hannes Reinecke
@ 2021-06-30  9:53         ` Ming Lei
  -1 siblings, 0 replies; 50+ messages in thread
From: Ming Lei @ 2021-06-30  9:53 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Jens Axboe, linux-block, linux-nvme, Christoph Hellwig,
	Sagi Grimberg, Daniel Wagner, Wen Xiong, John Garry

On Wed, Jun 30, 2021 at 11:43:41AM +0200, Hannes Reinecke wrote:
> On 6/30/21 10:42 AM, Ming Lei wrote:
> > On Wed, Jun 30, 2021 at 10:18:37AM +0200, Hannes Reinecke wrote:
> > > On 6/29/21 9:49 AM, Ming Lei wrote:
> > > > Hi,
> > > > 
> > > > blk_mq_alloc_request_hctx() is used by NVMe fc/rdma/tcp/loop to connect
> > > > io queue. Also the sw ctx is chosen as the 1st online cpu in hctx->cpumask.
> > > > However, all cpus in hctx->cpumask may be offline.
> > > > 
> > > > This usage model isn't well supported by blk-mq which supposes allocator is
> > > > always done on one online CPU in hctx->cpumask. This assumption is
> > > > related with managed irq, which also requires blk-mq to drain inflight
> > > > request in this hctx when the last cpu in hctx->cpumask is going to
> > > > offline.
> > > > 
> > > > However, NVMe fc/rdma/tcp/loop don't use managed irq, so we should allow
> > > > them to ask for request allocation when the specified hctx is inactive
> > > > (all cpus in hctx->cpumask are offline).
> > > > 
> > > > Fix blk_mq_alloc_request_hctx() by adding/passing flag of
> > > > BLK_MQ_F_NOT_USE_MANAGED_IRQ.
> > > > 
> > > > 
> > > > Ming Lei (2):
> > > >     blk-mq: not deactivate hctx if the device doesn't use managed irq
> > > >     nvme: pass BLK_MQ_F_NOT_USE_MANAGED_IRQ for fc/rdma/tcp/loop
> > > > 
> > > >    block/blk-mq.c             | 6 +++++-
> > > >    drivers/nvme/host/fc.c     | 3 ++-
> > > >    drivers/nvme/host/rdma.c   | 3 ++-
> > > >    drivers/nvme/host/tcp.c    | 3 ++-
> > > >    drivers/nvme/target/loop.c | 3 ++-
> > > >    include/linux/blk-mq.h     | 1 +
> > > >    6 files changed, 14 insertions(+), 5 deletions(-)
> > > > 
> > > > Cc: Sagi Grimberg <sagi@grimberg.me>
> > > > Cc: Daniel Wagner <dwagner@suse. thede>
> > > > Cc: Wen Xiong <wenxiong@us.ibm.com>
> > > > Cc: John Garry <john.garry@huawei.com>
> > > > 
> > > > 
> > > I have my misgivings about this patchset.
> > > To my understanding, only CPUs present in the hctx cpumask are eligible to
> > > submit I/O to that hctx.
> > 
> > It is just true for managed irq, and should be CPUs online.
> > 
> > However, no such constraint for non managed irq, since irq may migrate to
> > other online CPUs if all CPUs in irq's current affinity become offline.
> > 
> 
> But there shouldn't be any I/O pending during CPU offline (cf
> blk_mq_hctx_notify_offline()), so no interrupts should be triggered, either,
> no?
> 
> > > Consequently if all cpus in that mask are offline, where is the point of
> > > even transmitting a 'connect' request?
> > 
> > nvmef requires to submit the connect request via one specified hctx
> > which index has to be same with the io queue's index.
> > 
> > Almost all nvmef drivers fail to setup controller in case of
> > connect io queue error.
> > 
> 
> And I would prefer to fix that, namely allowing blk-mq to run on a sparse
> set of io queues.
> The remaining io queues can be connected once the first cpu in the hctx
> cpumask is onlined; we already have blk_mq_hctx_notify_online(), which could
> easily be expanded to connect to relevant I/O queue...

Then you need a big patches for doing that.

> 
> > Also CPU can become offline & online, especially it is done in
> > lots of sanity test.
> > 
> 
> True, but then again all I/O on the hctx should be quiesced during cpu
> offline.

Again that is only necessary for managed irq.

> 
> > So we should allow to allocate the connect request successful, and
> > submit it to drivers given it is allowed in this way for non-managed
> > irq.
> > 
> 
> I'd rather not do this, as the 'connect' command runs on the 'normal' I/O
> tagset, and hence runs into the risk of being issues against non-existing
> CPUs.

Can you explain what the risk is?

> 
> > > Shouldn't we rather modify the tagset to only refer to the current online
> > > CPUs _only_, thereby never submit a connect request for hctx with only
> > > offline CPUs?
> > 
> > Then you may setup very less io queues, and performance may suffer even
> > though lots of CPUs become online later.
> > ;
> Only if we stay with the reduced number of I/O queues. Which is not what I'm
> proposing; I'd rather prefer to connect and disconnect queues from the cpu
> hotplug handler. For starters we could even trigger a reset once the first
> cpu within a hctx is onlined.

Yeah, that need one big/complicated patchset, but not see any advantages
over this simple approach.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/2] blk-mq: fix blk_mq_alloc_request_hctx
@ 2021-06-30  9:53         ` Ming Lei
  0 siblings, 0 replies; 50+ messages in thread
From: Ming Lei @ 2021-06-30  9:53 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Jens Axboe, linux-block, linux-nvme, Christoph Hellwig,
	Sagi Grimberg, Daniel Wagner, Wen Xiong, John Garry

On Wed, Jun 30, 2021 at 11:43:41AM +0200, Hannes Reinecke wrote:
> On 6/30/21 10:42 AM, Ming Lei wrote:
> > On Wed, Jun 30, 2021 at 10:18:37AM +0200, Hannes Reinecke wrote:
> > > On 6/29/21 9:49 AM, Ming Lei wrote:
> > > > Hi,
> > > > 
> > > > blk_mq_alloc_request_hctx() is used by NVMe fc/rdma/tcp/loop to connect
> > > > io queue. Also the sw ctx is chosen as the 1st online cpu in hctx->cpumask.
> > > > However, all cpus in hctx->cpumask may be offline.
> > > > 
> > > > This usage model isn't well supported by blk-mq which supposes allocator is
> > > > always done on one online CPU in hctx->cpumask. This assumption is
> > > > related with managed irq, which also requires blk-mq to drain inflight
> > > > request in this hctx when the last cpu in hctx->cpumask is going to
> > > > offline.
> > > > 
> > > > However, NVMe fc/rdma/tcp/loop don't use managed irq, so we should allow
> > > > them to ask for request allocation when the specified hctx is inactive
> > > > (all cpus in hctx->cpumask are offline).
> > > > 
> > > > Fix blk_mq_alloc_request_hctx() by adding/passing flag of
> > > > BLK_MQ_F_NOT_USE_MANAGED_IRQ.
> > > > 
> > > > 
> > > > Ming Lei (2):
> > > >     blk-mq: not deactivate hctx if the device doesn't use managed irq
> > > >     nvme: pass BLK_MQ_F_NOT_USE_MANAGED_IRQ for fc/rdma/tcp/loop
> > > > 
> > > >    block/blk-mq.c             | 6 +++++-
> > > >    drivers/nvme/host/fc.c     | 3 ++-
> > > >    drivers/nvme/host/rdma.c   | 3 ++-
> > > >    drivers/nvme/host/tcp.c    | 3 ++-
> > > >    drivers/nvme/target/loop.c | 3 ++-
> > > >    include/linux/blk-mq.h     | 1 +
> > > >    6 files changed, 14 insertions(+), 5 deletions(-)
> > > > 
> > > > Cc: Sagi Grimberg <sagi@grimberg.me>
> > > > Cc: Daniel Wagner <dwagner@suse. thede>
> > > > Cc: Wen Xiong <wenxiong@us.ibm.com>
> > > > Cc: John Garry <john.garry@huawei.com>
> > > > 
> > > > 
> > > I have my misgivings about this patchset.
> > > To my understanding, only CPUs present in the hctx cpumask are eligible to
> > > submit I/O to that hctx.
> > 
> > It is just true for managed irq, and should be CPUs online.
> > 
> > However, no such constraint for non managed irq, since irq may migrate to
> > other online CPUs if all CPUs in irq's current affinity become offline.
> > 
> 
> But there shouldn't be any I/O pending during CPU offline (cf
> blk_mq_hctx_notify_offline()), so no interrupts should be triggered, either,
> no?
> 
> > > Consequently if all cpus in that mask are offline, where is the point of
> > > even transmitting a 'connect' request?
> > 
> > nvmef requires to submit the connect request via one specified hctx
> > which index has to be same with the io queue's index.
> > 
> > Almost all nvmef drivers fail to setup controller in case of
> > connect io queue error.
> > 
> 
> And I would prefer to fix that, namely allowing blk-mq to run on a sparse
> set of io queues.
> The remaining io queues can be connected once the first cpu in the hctx
> cpumask is onlined; we already have blk_mq_hctx_notify_online(), which could
> easily be expanded to connect to relevant I/O queue...

Then you need a big patches for doing that.

> 
> > Also CPU can become offline & online, especially it is done in
> > lots of sanity test.
> > 
> 
> True, but then again all I/O on the hctx should be quiesced during cpu
> offline.

Again that is only necessary for managed irq.

> 
> > So we should allow to allocate the connect request successful, and
> > submit it to drivers given it is allowed in this way for non-managed
> > irq.
> > 
> 
> I'd rather not do this, as the 'connect' command runs on the 'normal' I/O
> tagset, and hence runs into the risk of being issues against non-existing
> CPUs.

Can you explain what the risk is?

> 
> > > Shouldn't we rather modify the tagset to only refer to the current online
> > > CPUs _only_, thereby never submit a connect request for hctx with only
> > > offline CPUs?
> > 
> > Then you may setup very less io queues, and performance may suffer even
> > though lots of CPUs become online later.
> > ;
> Only if we stay with the reduced number of I/O queues. Which is not what I'm
> proposing; I'd rather prefer to connect and disconnect queues from the cpu
> hotplug handler. For starters we could even trigger a reset once the first
> cpu within a hctx is onlined.

Yeah, that need one big/complicated patchset, but not see any advantages
over this simple approach.


Thanks,
Ming


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/2] blk-mq: not deactivate hctx if the device doesn't use managed irq
  2021-06-29 23:30     ` Damien Le Moal
@ 2021-06-30 18:58       ` Sagi Grimberg
  -1 siblings, 0 replies; 50+ messages in thread
From: Sagi Grimberg @ 2021-06-30 18:58 UTC (permalink / raw)
  To: Damien Le Moal, Ming Lei, Jens Axboe, linux-block, linux-nvme,
	Christoph Hellwig
  Cc: Daniel Wagner, Wen Xiong, John Garry


>> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
>> index 21140132a30d..600c5dd1a069 100644
>> --- a/include/linux/blk-mq.h
>> +++ b/include/linux/blk-mq.h
>> @@ -403,6 +403,7 @@ enum {
>>   	 */
>>   	BLK_MQ_F_STACKING	= 1 << 2,
>>   	BLK_MQ_F_TAG_HCTX_SHARED = 1 << 3,
>> +	BLK_MQ_F_NOT_USE_MANAGED_IRQ = 1 << 4,
> 
> My 2 cents: BLK_MQ_F_NO_MANAGED_IRQ may be a better/shorter name.

Maybe BLK_MQ_F_UNMANAGED_IRQ?

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/2] blk-mq: not deactivate hctx if the device doesn't use managed irq
@ 2021-06-30 18:58       ` Sagi Grimberg
  0 siblings, 0 replies; 50+ messages in thread
From: Sagi Grimberg @ 2021-06-30 18:58 UTC (permalink / raw)
  To: Damien Le Moal, Ming Lei, Jens Axboe, linux-block, linux-nvme,
	Christoph Hellwig
  Cc: Daniel Wagner, Wen Xiong, John Garry


>> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
>> index 21140132a30d..600c5dd1a069 100644
>> --- a/include/linux/blk-mq.h
>> +++ b/include/linux/blk-mq.h
>> @@ -403,6 +403,7 @@ enum {
>>   	 */
>>   	BLK_MQ_F_STACKING	= 1 << 2,
>>   	BLK_MQ_F_TAG_HCTX_SHARED = 1 << 3,
>> +	BLK_MQ_F_NOT_USE_MANAGED_IRQ = 1 << 4,
> 
> My 2 cents: BLK_MQ_F_NO_MANAGED_IRQ may be a better/shorter name.

Maybe BLK_MQ_F_UNMANAGED_IRQ?

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/2] blk-mq: fix blk_mq_alloc_request_hctx
  2021-06-30  9:53         ` Ming Lei
@ 2021-06-30 18:59           ` Sagi Grimberg
  -1 siblings, 0 replies; 50+ messages in thread
From: Sagi Grimberg @ 2021-06-30 18:59 UTC (permalink / raw)
  To: Ming Lei, Hannes Reinecke
  Cc: Jens Axboe, linux-block, linux-nvme, Christoph Hellwig,
	Daniel Wagner, Wen Xiong, John Garry


>>>> Shouldn't we rather modify the tagset to only refer to the current online
>>>> CPUs _only_, thereby never submit a connect request for hctx with only
>>>> offline CPUs?
>>>
>>> Then you may setup very less io queues, and performance may suffer even
>>> though lots of CPUs become online later.
>>> ;
>> Only if we stay with the reduced number of I/O queues. Which is not what I'm
>> proposing; I'd rather prefer to connect and disconnect queues from the cpu
>> hotplug handler. For starters we could even trigger a reset once the first
>> cpu within a hctx is onlined.
> 
> Yeah, that need one big/complicated patchset, but not see any advantages
> over this simple approach.

I tend to agree with Ming here.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/2] blk-mq: fix blk_mq_alloc_request_hctx
@ 2021-06-30 18:59           ` Sagi Grimberg
  0 siblings, 0 replies; 50+ messages in thread
From: Sagi Grimberg @ 2021-06-30 18:59 UTC (permalink / raw)
  To: Ming Lei, Hannes Reinecke
  Cc: Jens Axboe, linux-block, linux-nvme, Christoph Hellwig,
	Daniel Wagner, Wen Xiong, John Garry


>>>> Shouldn't we rather modify the tagset to only refer to the current online
>>>> CPUs _only_, thereby never submit a connect request for hctx with only
>>>> offline CPUs?
>>>
>>> Then you may setup very less io queues, and performance may suffer even
>>> though lots of CPUs become online later.
>>> ;
>> Only if we stay with the reduced number of I/O queues. Which is not what I'm
>> proposing; I'd rather prefer to connect and disconnect queues from the cpu
>> hotplug handler. For starters we could even trigger a reset once the first
>> cpu within a hctx is onlined.
> 
> Yeah, that need one big/complicated patchset, but not see any advantages
> over this simple approach.

I tend to agree with Ming here.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/2] blk-mq: fix blk_mq_alloc_request_hctx
  2021-06-30 18:59           ` Sagi Grimberg
@ 2021-06-30 19:46             ` Hannes Reinecke
  -1 siblings, 0 replies; 50+ messages in thread
From: Hannes Reinecke @ 2021-06-30 19:46 UTC (permalink / raw)
  To: Sagi Grimberg, Ming Lei
  Cc: Jens Axboe, linux-block, linux-nvme, Christoph Hellwig,
	Daniel Wagner, Wen Xiong, John Garry

On 6/30/21 8:59 PM, Sagi Grimberg wrote:
> 
>>>>> Shouldn't we rather modify the tagset to only refer to the current 
>>>>> online
>>>>> CPUs _only_, thereby never submit a connect request for hctx with only
>>>>> offline CPUs?
>>>>
>>>> Then you may setup very less io queues, and performance may suffer even
>>>> though lots of CPUs become online later.
>>>> ;
>>> Only if we stay with the reduced number of I/O queues. Which is not 
>>> what I'm
>>> proposing; I'd rather prefer to connect and disconnect queues from 
>>> the cpu
>>> hotplug handler. For starters we could even trigger a reset once the 
>>> first
>>> cpu within a hctx is onlined.
>>
>> Yeah, that need one big/complicated patchset, but not see any advantages
>> over this simple approach.
> 
> I tend to agree with Ming here.

Actually, Daniel and me came to a slightly different idea: use cpu 
hotplug notifier.
Thing is, blk-mq already has cpu hotplug notifier, which should ensure 
that no I/O is pending during cpu hotplug.
If we now add a nvme cpu hotplug notifier which essentially kicks off a 
reset once all cpu in a hctx are offline the reset logic will rearrange 
the queues to match the current cpu layout.
And when the cpus are getting onlined we'll do another reset.

Daniel is currently preparing a patch; let's see how it goes.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/2] blk-mq: fix blk_mq_alloc_request_hctx
@ 2021-06-30 19:46             ` Hannes Reinecke
  0 siblings, 0 replies; 50+ messages in thread
From: Hannes Reinecke @ 2021-06-30 19:46 UTC (permalink / raw)
  To: Sagi Grimberg, Ming Lei
  Cc: Jens Axboe, linux-block, linux-nvme, Christoph Hellwig,
	Daniel Wagner, Wen Xiong, John Garry

On 6/30/21 8:59 PM, Sagi Grimberg wrote:
> 
>>>>> Shouldn't we rather modify the tagset to only refer to the current 
>>>>> online
>>>>> CPUs _only_, thereby never submit a connect request for hctx with only
>>>>> offline CPUs?
>>>>
>>>> Then you may setup very less io queues, and performance may suffer even
>>>> though lots of CPUs become online later.
>>>> ;
>>> Only if we stay with the reduced number of I/O queues. Which is not 
>>> what I'm
>>> proposing; I'd rather prefer to connect and disconnect queues from 
>>> the cpu
>>> hotplug handler. For starters we could even trigger a reset once the 
>>> first
>>> cpu within a hctx is onlined.
>>
>> Yeah, that need one big/complicated patchset, but not see any advantages
>> over this simple approach.
> 
> I tend to agree with Ming here.

Actually, Daniel and me came to a slightly different idea: use cpu 
hotplug notifier.
Thing is, blk-mq already has cpu hotplug notifier, which should ensure 
that no I/O is pending during cpu hotplug.
If we now add a nvme cpu hotplug notifier which essentially kicks off a 
reset once all cpu in a hctx are offline the reset logic will rearrange 
the queues to match the current cpu layout.
And when the cpus are getting onlined we'll do another reset.

Daniel is currently preparing a patch; let's see how it goes.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/2] blk-mq: not deactivate hctx if the device doesn't use managed irq
  2021-06-30 18:58       ` Sagi Grimberg
@ 2021-06-30 21:57         ` Damien Le Moal
  -1 siblings, 0 replies; 50+ messages in thread
From: Damien Le Moal @ 2021-06-30 21:57 UTC (permalink / raw)
  To: Sagi Grimberg, Ming Lei, Jens Axboe, linux-block, linux-nvme,
	Christoph Hellwig
  Cc: Daniel Wagner, Wen Xiong, John Garry

On 2021/07/01 3:58, Sagi Grimberg wrote:
> 
>>> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
>>> index 21140132a30d..600c5dd1a069 100644
>>> --- a/include/linux/blk-mq.h
>>> +++ b/include/linux/blk-mq.h
>>> @@ -403,6 +403,7 @@ enum {
>>>   	 */
>>>   	BLK_MQ_F_STACKING	= 1 << 2,
>>>   	BLK_MQ_F_TAG_HCTX_SHARED = 1 << 3,
>>> +	BLK_MQ_F_NOT_USE_MANAGED_IRQ = 1 << 4,
>>
>> My 2 cents: BLK_MQ_F_NO_MANAGED_IRQ may be a better/shorter name.
> 
> Maybe BLK_MQ_F_UNMANAGED_IRQ?

Even better :)


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/2] blk-mq: not deactivate hctx if the device doesn't use managed irq
@ 2021-06-30 21:57         ` Damien Le Moal
  0 siblings, 0 replies; 50+ messages in thread
From: Damien Le Moal @ 2021-06-30 21:57 UTC (permalink / raw)
  To: Sagi Grimberg, Ming Lei, Jens Axboe, linux-block, linux-nvme,
	Christoph Hellwig
  Cc: Daniel Wagner, Wen Xiong, John Garry

On 2021/07/01 3:58, Sagi Grimberg wrote:
> 
>>> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
>>> index 21140132a30d..600c5dd1a069 100644
>>> --- a/include/linux/blk-mq.h
>>> +++ b/include/linux/blk-mq.h
>>> @@ -403,6 +403,7 @@ enum {
>>>   	 */
>>>   	BLK_MQ_F_STACKING	= 1 << 2,
>>>   	BLK_MQ_F_TAG_HCTX_SHARED = 1 << 3,
>>> +	BLK_MQ_F_NOT_USE_MANAGED_IRQ = 1 << 4,
>>
>> My 2 cents: BLK_MQ_F_NO_MANAGED_IRQ may be a better/shorter name.
> 
> Maybe BLK_MQ_F_UNMANAGED_IRQ?

Even better :)


-- 
Damien Le Moal
Western Digital Research

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/2] blk-mq: fix blk_mq_alloc_request_hctx
  2021-06-30 19:46             ` Hannes Reinecke
@ 2021-06-30 23:59               ` Ming Lei
  -1 siblings, 0 replies; 50+ messages in thread
From: Ming Lei @ 2021-06-30 23:59 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Sagi Grimberg, Jens Axboe, linux-block, linux-nvme,
	Christoph Hellwig, Daniel Wagner, Wen Xiong, John Garry

On Wed, Jun 30, 2021 at 09:46:35PM +0200, Hannes Reinecke wrote:
> On 6/30/21 8:59 PM, Sagi Grimberg wrote:
> > 
> > > > > > Shouldn't we rather modify the tagset to only refer to
> > > > > > the current online
> > > > > > CPUs _only_, thereby never submit a connect request for hctx with only
> > > > > > offline CPUs?
> > > > > 
> > > > > Then you may setup very less io queues, and performance may suffer even
> > > > > though lots of CPUs become online later.
> > > > > ;
> > > > Only if we stay with the reduced number of I/O queues. Which is
> > > > not what I'm
> > > > proposing; I'd rather prefer to connect and disconnect queues
> > > > from the cpu
> > > > hotplug handler. For starters we could even trigger a reset once
> > > > the first
> > > > cpu within a hctx is onlined.
> > > 
> > > Yeah, that need one big/complicated patchset, but not see any advantages
> > > over this simple approach.
> > 
> > I tend to agree with Ming here.
> 
> Actually, Daniel and me came to a slightly different idea: use cpu hotplug
> notifier.
> Thing is, blk-mq already has cpu hotplug notifier, which should ensure that
> no I/O is pending during cpu hotplug.

Why should we ensure that for non-managed irq?

> If we now add a nvme cpu hotplug notifier which essentially kicks off a
> reset once all cpu in a hctx are offline the reset logic will rearrange the
> queues to match the current cpu layout.
> And when the cpus are getting onlined we'll do another reset.
> 
> Daniel is currently preparing a patch; let's see how it goes.

What is the advantage of that big change over this simple way?

Thanks, 
Ming


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/2] blk-mq: fix blk_mq_alloc_request_hctx
@ 2021-06-30 23:59               ` Ming Lei
  0 siblings, 0 replies; 50+ messages in thread
From: Ming Lei @ 2021-06-30 23:59 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Sagi Grimberg, Jens Axboe, linux-block, linux-nvme,
	Christoph Hellwig, Daniel Wagner, Wen Xiong, John Garry

On Wed, Jun 30, 2021 at 09:46:35PM +0200, Hannes Reinecke wrote:
> On 6/30/21 8:59 PM, Sagi Grimberg wrote:
> > 
> > > > > > Shouldn't we rather modify the tagset to only refer to
> > > > > > the current online
> > > > > > CPUs _only_, thereby never submit a connect request for hctx with only
> > > > > > offline CPUs?
> > > > > 
> > > > > Then you may setup very less io queues, and performance may suffer even
> > > > > though lots of CPUs become online later.
> > > > > ;
> > > > Only if we stay with the reduced number of I/O queues. Which is
> > > > not what I'm
> > > > proposing; I'd rather prefer to connect and disconnect queues
> > > > from the cpu
> > > > hotplug handler. For starters we could even trigger a reset once
> > > > the first
> > > > cpu within a hctx is onlined.
> > > 
> > > Yeah, that need one big/complicated patchset, but not see any advantages
> > > over this simple approach.
> > 
> > I tend to agree with Ming here.
> 
> Actually, Daniel and me came to a slightly different idea: use cpu hotplug
> notifier.
> Thing is, blk-mq already has cpu hotplug notifier, which should ensure that
> no I/O is pending during cpu hotplug.

Why should we ensure that for non-managed irq?

> If we now add a nvme cpu hotplug notifier which essentially kicks off a
> reset once all cpu in a hctx are offline the reset logic will rearrange the
> queues to match the current cpu layout.
> And when the cpus are getting onlined we'll do another reset.
> 
> Daniel is currently preparing a patch; let's see how it goes.

What is the advantage of that big change over this simple way?

Thanks, 
Ming


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/2] blk-mq: fix blk_mq_alloc_request_hctx
  2021-06-30 23:59               ` Ming Lei
@ 2021-07-01  8:00                 ` Hannes Reinecke
  -1 siblings, 0 replies; 50+ messages in thread
From: Hannes Reinecke @ 2021-07-01  8:00 UTC (permalink / raw)
  To: Ming Lei
  Cc: Sagi Grimberg, Jens Axboe, linux-block, linux-nvme,
	Christoph Hellwig, Daniel Wagner, Wen Xiong, John Garry

On 7/1/21 1:59 AM, Ming Lei wrote:
> On Wed, Jun 30, 2021 at 09:46:35PM +0200, Hannes Reinecke wrote:
>> On 6/30/21 8:59 PM, Sagi Grimberg wrote:
>>>
>>>>>>> Shouldn't we rather modify the tagset to only refer to
>>>>>>> the current online
>>>>>>> CPUs _only_, thereby never submit a connect request for hctx with only
>>>>>>> offline CPUs?
>>>>>>
>>>>>> Then you may setup very less io queues, and performance may suffer even
>>>>>> though lots of CPUs become online later.
>>>>>> ;
>>>>> Only if we stay with the reduced number of I/O queues. Which is
>>>>> not what I'm
>>>>> proposing; I'd rather prefer to connect and disconnect queues
>>>>> from the cpu
>>>>> hotplug handler. For starters we could even trigger a reset once
>>>>> the first
>>>>> cpu within a hctx is onlined.
>>>>
>>>> Yeah, that need one big/complicated patchset, but not see any advantages
>>>> over this simple approach.
>>>
>>> I tend to agree with Ming here.
>>
>> Actually, Daniel and me came to a slightly different idea: use cpu hotplug
>> notifier.
>> Thing is, blk-mq already has cpu hotplug notifier, which should ensure that
>> no I/O is pending during cpu hotplug.
> 
> Why should we ensure that for non-managed irq?
> 

While not strictly necessary, it does align the hctx layout with the 
current CPU topology.
As such we won't have any 'stale' CPUs or queues in the hctx layout, and 
with that avoid any issues we'll be having due to inactive CPUs in the 
cpumask.

>> If we now add a nvme cpu hotplug notifier which essentially kicks off a
>> reset once all cpu in a hctx are offline the reset logic will rearrange the
>> queues to match the current cpu layout.
>> And when the cpus are getting onlined we'll do another reset.
>>
>> Daniel is currently preparing a patch; let's see how it goes.
> 
> What is the advantage of that big change over this simple way?
> 

With the simple way we might (and, as the results show, do) run the 
nvme_ctrl_reset() in parallel to CPU hotplug.
This leads to quite some complexity, and as we've seen is triggering 
quite some race conditions.

Hence I do think we need to synchronize nvme_ctrl_reset() with CPU 
hotplug, to ensure that the reset handler is completed before the cpu is 
completely torn down.

But synchronizing with nvme_ctrl_reset() means to issue a flush_work(); 
and if we do that we might as well go full circle and call 
nvme_ctrl_reset_sync(). And as this will rearrange the hctx to match the 
current CPU topology we don't need to fiddle with it in the hotplug 
handler, and can leave everything to nvme_ctrl_reset().

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/2] blk-mq: fix blk_mq_alloc_request_hctx
@ 2021-07-01  8:00                 ` Hannes Reinecke
  0 siblings, 0 replies; 50+ messages in thread
From: Hannes Reinecke @ 2021-07-01  8:00 UTC (permalink / raw)
  To: Ming Lei
  Cc: Sagi Grimberg, Jens Axboe, linux-block, linux-nvme,
	Christoph Hellwig, Daniel Wagner, Wen Xiong, John Garry

On 7/1/21 1:59 AM, Ming Lei wrote:
> On Wed, Jun 30, 2021 at 09:46:35PM +0200, Hannes Reinecke wrote:
>> On 6/30/21 8:59 PM, Sagi Grimberg wrote:
>>>
>>>>>>> Shouldn't we rather modify the tagset to only refer to
>>>>>>> the current online
>>>>>>> CPUs _only_, thereby never submit a connect request for hctx with only
>>>>>>> offline CPUs?
>>>>>>
>>>>>> Then you may setup very less io queues, and performance may suffer even
>>>>>> though lots of CPUs become online later.
>>>>>> ;
>>>>> Only if we stay with the reduced number of I/O queues. Which is
>>>>> not what I'm
>>>>> proposing; I'd rather prefer to connect and disconnect queues
>>>>> from the cpu
>>>>> hotplug handler. For starters we could even trigger a reset once
>>>>> the first
>>>>> cpu within a hctx is onlined.
>>>>
>>>> Yeah, that need one big/complicated patchset, but not see any advantages
>>>> over this simple approach.
>>>
>>> I tend to agree with Ming here.
>>
>> Actually, Daniel and me came to a slightly different idea: use cpu hotplug
>> notifier.
>> Thing is, blk-mq already has cpu hotplug notifier, which should ensure that
>> no I/O is pending during cpu hotplug.
> 
> Why should we ensure that for non-managed irq?
> 

While not strictly necessary, it does align the hctx layout with the 
current CPU topology.
As such we won't have any 'stale' CPUs or queues in the hctx layout, and 
with that avoid any issues we'll be having due to inactive CPUs in the 
cpumask.

>> If we now add a nvme cpu hotplug notifier which essentially kicks off a
>> reset once all cpu in a hctx are offline the reset logic will rearrange the
>> queues to match the current cpu layout.
>> And when the cpus are getting onlined we'll do another reset.
>>
>> Daniel is currently preparing a patch; let's see how it goes.
> 
> What is the advantage of that big change over this simple way?
> 

With the simple way we might (and, as the results show, do) run the 
nvme_ctrl_reset() in parallel to CPU hotplug.
This leads to quite some complexity, and as we've seen is triggering 
quite some race conditions.

Hence I do think we need to synchronize nvme_ctrl_reset() with CPU 
hotplug, to ensure that the reset handler is completed before the cpu is 
completely torn down.

But synchronizing with nvme_ctrl_reset() means to issue a flush_work(); 
and if we do that we might as well go full circle and call 
nvme_ctrl_reset_sync(). And as this will rearrange the hctx to match the 
current CPU topology we don't need to fiddle with it in the hotplug 
handler, and can leave everything to nvme_ctrl_reset().

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/2] blk-mq: fix blk_mq_alloc_request_hctx
  2021-07-01  8:00                 ` Hannes Reinecke
@ 2021-07-01  9:13                   ` Ming Lei
  -1 siblings, 0 replies; 50+ messages in thread
From: Ming Lei @ 2021-07-01  9:13 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Sagi Grimberg, Jens Axboe, linux-block, linux-nvme,
	Christoph Hellwig, Daniel Wagner, Wen Xiong, John Garry

On Thu, Jul 01, 2021 at 10:00:27AM +0200, Hannes Reinecke wrote:
> On 7/1/21 1:59 AM, Ming Lei wrote:
> > On Wed, Jun 30, 2021 at 09:46:35PM +0200, Hannes Reinecke wrote:
> > > On 6/30/21 8:59 PM, Sagi Grimberg wrote:
> > > > 
> > > > > > > > Shouldn't we rather modify the tagset to only refer to
> > > > > > > > the current online
> > > > > > > > CPUs _only_, thereby never submit a connect request for hctx with only
> > > > > > > > offline CPUs?
> > > > > > > 
> > > > > > > Then you may setup very less io queues, and performance may suffer even
> > > > > > > though lots of CPUs become online later.
> > > > > > > ;
> > > > > > Only if we stay with the reduced number of I/O queues. Which is
> > > > > > not what I'm
> > > > > > proposing; I'd rather prefer to connect and disconnect queues
> > > > > > from the cpu
> > > > > > hotplug handler. For starters we could even trigger a reset once
> > > > > > the first
> > > > > > cpu within a hctx is onlined.
> > > > > 
> > > > > Yeah, that need one big/complicated patchset, but not see any advantages
> > > > > over this simple approach.
> > > > 
> > > > I tend to agree with Ming here.
> > > 
> > > Actually, Daniel and me came to a slightly different idea: use cpu hotplug
> > > notifier.
> > > Thing is, blk-mq already has cpu hotplug notifier, which should ensure that
> > > no I/O is pending during cpu hotplug.
> > 
> > Why should we ensure that for non-managed irq?
> > 
> 
> While not strictly necessary, it does align the hctx layout with the current
> CPU topology.
> As such we won't have any 'stale' CPUs or queues in the hctx layout, and
> with that avoid any issues we'll be having due to inactive CPUs in the
> cpumask.

We know the exact theory for non-managed interrupt, can you share us
any issue you want to avoid?

> 
> > > If we now add a nvme cpu hotplug notifier which essentially kicks off a
> > > reset once all cpu in a hctx are offline the reset logic will rearrange the
> > > queues to match the current cpu layout.
> > > And when the cpus are getting onlined we'll do another reset.
> > > 
> > > Daniel is currently preparing a patch; let's see how it goes.
> > 
> > What is the advantage of that big change over this simple way?
> > 
> 
> With the simple way we might (and, as the results show, do) run the
> nvme_ctrl_reset() in parallel to CPU hotplug.
> This leads to quite some complexity, and as we've seen is triggering quite
> some race conditions.

CPU hotplug isn't related with nvme reset at all, which is just for
recovering nvme controller to make it working again. How can cpu offline
affect nvme controller?

For NVMe PCI, blk-mq have drained any inflight requests before the hctx
is becoming inactive.

For other NVMe controller which doesn't use managed irq, the inflight
requests can still be completed via the original interrupt, or polling
task, both can been migrated to new online cpu.

I don't know what exact race conditions you are talking about. Please
explain it in details, otherwise it just wastes our time.

> 
> Hence I do think we need to synchronize nvme_ctrl_reset() with CPU hotplug,
> to ensure that the reset handler is completed before the cpu is completely
> torn down.

No, you didn't provide any proof about the necessity of adding the sync. 



Thanks, 
Ming


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/2] blk-mq: fix blk_mq_alloc_request_hctx
@ 2021-07-01  9:13                   ` Ming Lei
  0 siblings, 0 replies; 50+ messages in thread
From: Ming Lei @ 2021-07-01  9:13 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Sagi Grimberg, Jens Axboe, linux-block, linux-nvme,
	Christoph Hellwig, Daniel Wagner, Wen Xiong, John Garry

On Thu, Jul 01, 2021 at 10:00:27AM +0200, Hannes Reinecke wrote:
> On 7/1/21 1:59 AM, Ming Lei wrote:
> > On Wed, Jun 30, 2021 at 09:46:35PM +0200, Hannes Reinecke wrote:
> > > On 6/30/21 8:59 PM, Sagi Grimberg wrote:
> > > > 
> > > > > > > > Shouldn't we rather modify the tagset to only refer to
> > > > > > > > the current online
> > > > > > > > CPUs _only_, thereby never submit a connect request for hctx with only
> > > > > > > > offline CPUs?
> > > > > > > 
> > > > > > > Then you may setup very less io queues, and performance may suffer even
> > > > > > > though lots of CPUs become online later.
> > > > > > > ;
> > > > > > Only if we stay with the reduced number of I/O queues. Which is
> > > > > > not what I'm
> > > > > > proposing; I'd rather prefer to connect and disconnect queues
> > > > > > from the cpu
> > > > > > hotplug handler. For starters we could even trigger a reset once
> > > > > > the first
> > > > > > cpu within a hctx is onlined.
> > > > > 
> > > > > Yeah, that need one big/complicated patchset, but not see any advantages
> > > > > over this simple approach.
> > > > 
> > > > I tend to agree with Ming here.
> > > 
> > > Actually, Daniel and me came to a slightly different idea: use cpu hotplug
> > > notifier.
> > > Thing is, blk-mq already has cpu hotplug notifier, which should ensure that
> > > no I/O is pending during cpu hotplug.
> > 
> > Why should we ensure that for non-managed irq?
> > 
> 
> While not strictly necessary, it does align the hctx layout with the current
> CPU topology.
> As such we won't have any 'stale' CPUs or queues in the hctx layout, and
> with that avoid any issues we'll be having due to inactive CPUs in the
> cpumask.

We know the exact theory for non-managed interrupt, can you share us
any issue you want to avoid?

> 
> > > If we now add a nvme cpu hotplug notifier which essentially kicks off a
> > > reset once all cpu in a hctx are offline the reset logic will rearrange the
> > > queues to match the current cpu layout.
> > > And when the cpus are getting onlined we'll do another reset.
> > > 
> > > Daniel is currently preparing a patch; let's see how it goes.
> > 
> > What is the advantage of that big change over this simple way?
> > 
> 
> With the simple way we might (and, as the results show, do) run the
> nvme_ctrl_reset() in parallel to CPU hotplug.
> This leads to quite some complexity, and as we've seen is triggering quite
> some race conditions.

CPU hotplug isn't related with nvme reset at all, which is just for
recovering nvme controller to make it working again. How can cpu offline
affect nvme controller?

For NVMe PCI, blk-mq have drained any inflight requests before the hctx
is becoming inactive.

For other NVMe controller which doesn't use managed irq, the inflight
requests can still be completed via the original interrupt, or polling
task, both can been migrated to new online cpu.

I don't know what exact race conditions you are talking about. Please
explain it in details, otherwise it just wastes our time.

> 
> Hence I do think we need to synchronize nvme_ctrl_reset() with CPU hotplug,
> to ensure that the reset handler is completed before the cpu is completely
> torn down.

No, you didn't provide any proof about the necessity of adding the sync. 



Thanks, 
Ming


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/2] blk-mq: not deactivate hctx if the device doesn't use managed irq
  2021-06-30  0:32       ` Ming Lei
@ 2021-07-01  9:52         ` Christoph Hellwig
  -1 siblings, 0 replies; 50+ messages in thread
From: Christoph Hellwig @ 2021-07-01  9:52 UTC (permalink / raw)
  To: Ming Lei
  Cc: John Garry, Jens Axboe, linux-block, linux-nvme,
	Christoph Hellwig, Sagi Grimberg, Daniel Wagner, Wen Xiong

On Wed, Jun 30, 2021 at 08:32:46AM +0800, Ming Lei wrote:
> Indeed, we may replace BLK_MQ_F_STACKING with BLK_MQ_F_NOT_USE_MANAGED_IRQ
> (BLK_MQ_F_NO_MANAGED_IRQ) too, and the latter is one general solution.

And now turn the polarity around and make it a BLK_MQ_F_MANAGED_IRQ set
by the map_queue helpers that rely on managed irqs.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/2] blk-mq: not deactivate hctx if the device doesn't use managed irq
@ 2021-07-01  9:52         ` Christoph Hellwig
  0 siblings, 0 replies; 50+ messages in thread
From: Christoph Hellwig @ 2021-07-01  9:52 UTC (permalink / raw)
  To: Ming Lei
  Cc: John Garry, Jens Axboe, linux-block, linux-nvme,
	Christoph Hellwig, Sagi Grimberg, Daniel Wagner, Wen Xiong

On Wed, Jun 30, 2021 at 08:32:46AM +0800, Ming Lei wrote:
> Indeed, we may replace BLK_MQ_F_STACKING with BLK_MQ_F_NOT_USE_MANAGED_IRQ
> (BLK_MQ_F_NO_MANAGED_IRQ) too, and the latter is one general solution.

And now turn the polarity around and make it a BLK_MQ_F_MANAGED_IRQ set
by the map_queue helpers that rely on managed irqs.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/2] blk-mq: not deactivate hctx if the device doesn't use managed irq
  2021-06-30 21:57         ` Damien Le Moal
@ 2021-07-01 14:20           ` Keith Busch
  -1 siblings, 0 replies; 50+ messages in thread
From: Keith Busch @ 2021-07-01 14:20 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Sagi Grimberg, Ming Lei, Jens Axboe, linux-block, linux-nvme,
	Christoph Hellwig, Daniel Wagner, Wen Xiong, John Garry

On Wed, Jun 30, 2021 at 09:57:05PM +0000, Damien Le Moal wrote:
> On 2021/07/01 3:58, Sagi Grimberg wrote:
> > 
> >>> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> >>> index 21140132a30d..600c5dd1a069 100644
> >>> --- a/include/linux/blk-mq.h
> >>> +++ b/include/linux/blk-mq.h
> >>> @@ -403,6 +403,7 @@ enum {
> >>>   	 */
> >>>   	BLK_MQ_F_STACKING	= 1 << 2,
> >>>   	BLK_MQ_F_TAG_HCTX_SHARED = 1 << 3,
> >>> +	BLK_MQ_F_NOT_USE_MANAGED_IRQ = 1 << 4,
> >>
> >> My 2 cents: BLK_MQ_F_NO_MANAGED_IRQ may be a better/shorter name.
> > 
> > Maybe BLK_MQ_F_UNMANAGED_IRQ?
> 
> Even better :)

Since most drivers are unmanaged, shouldn't that be the default? The
managed irq drivers should get to set a flag instead.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 1/2] blk-mq: not deactivate hctx if the device doesn't use managed irq
@ 2021-07-01 14:20           ` Keith Busch
  0 siblings, 0 replies; 50+ messages in thread
From: Keith Busch @ 2021-07-01 14:20 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Sagi Grimberg, Ming Lei, Jens Axboe, linux-block, linux-nvme,
	Christoph Hellwig, Daniel Wagner, Wen Xiong, John Garry

On Wed, Jun 30, 2021 at 09:57:05PM +0000, Damien Le Moal wrote:
> On 2021/07/01 3:58, Sagi Grimberg wrote:
> > 
> >>> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> >>> index 21140132a30d..600c5dd1a069 100644
> >>> --- a/include/linux/blk-mq.h
> >>> +++ b/include/linux/blk-mq.h
> >>> @@ -403,6 +403,7 @@ enum {
> >>>   	 */
> >>>   	BLK_MQ_F_STACKING	= 1 << 2,
> >>>   	BLK_MQ_F_TAG_HCTX_SHARED = 1 << 3,
> >>> +	BLK_MQ_F_NOT_USE_MANAGED_IRQ = 1 << 4,
> >>
> >> My 2 cents: BLK_MQ_F_NO_MANAGED_IRQ may be a better/shorter name.
> > 
> > Maybe BLK_MQ_F_UNMANAGED_IRQ?
> 
> Even better :)

Since most drivers are unmanaged, shouldn't that be the default? The
managed irq drivers should get to set a flag instead.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/2] blk-mq: fix blk_mq_alloc_request_hctx
  2021-06-30 19:46             ` Hannes Reinecke
@ 2021-07-02  9:47               ` Daniel Wagner
  -1 siblings, 0 replies; 50+ messages in thread
From: Daniel Wagner @ 2021-07-02  9:47 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Sagi Grimberg, Ming Lei, Jens Axboe, linux-block, linux-nvme,
	Christoph Hellwig, Wen Xiong, John Garry

On Wed, Jun 30, 2021 at 09:46:35PM +0200, Hannes Reinecke wrote:
> Actually, Daniel and me came to a slightly different idea: use cpu hotplug
> notifier.
> Thing is, blk-mq already has cpu hotplug notifier, which should ensure that
> no I/O is pending during cpu hotplug.
> If we now add a nvme cpu hotplug notifier which essentially kicks off a
> reset once all cpu in a hctx are offline the reset logic will rearrange the
> queues to match the current cpu layout.
> And when the cpus are getting onlined we'll do another reset.
> 
> Daniel is currently preparing a patch; let's see how it goes.

FTR, I can't say it was a huge success. I observed a real problem with
this idea and 'maybe problem. First, we can't use nvme_ctrl_reset_sync()
in the cpuhp path as this will block in various places due to
percpu_rwsem_wait operations [1]. Not really a surprise I'd say.

The maybe problem: I tried the test this time though with
nvme_ctrl_reset() and this worked for a while (a couple of hours of FC
port toggling with stress cpu hotplugging). Though eventually the system
stopped to respond and died. Unfortunately, I don't have any core
dump. So it's not totally clear what's happening.

BTW, I used the same test setup with Ming's patches on top of the two of
mine. The system was still working fine after 4 hours of the stress
testing.

[1] https://paste.opensuse.org/view/raw/62287389

For the fun, here is the nvme_ctrl_reset() patch:

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 66973bb56305..444756ba8dbe 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -20,6 +20,7 @@
 #include <linux/ptrace.h>
 #include <linux/nvme_ioctl.h>
 #include <linux/pm_qos.h>
+#include <linux/cpuhotplug.h>
 #include <asm/unaligned.h>
 
 #include "nvme.h"
@@ -4245,6 +4246,8 @@ EXPORT_SYMBOL_GPL(nvme_start_ctrl);
 
 void nvme_uninit_ctrl(struct nvme_ctrl *ctrl)
 {
+	cpuhp_state_remove_instance_nocalls(CPUHP_AP_NVME_QUEUE_ONLINE,
+					    &ctrl->cpuhp_node);
 	nvme_hwmon_exit(ctrl);
 	nvme_fault_inject_fini(&ctrl->fault_inject);
 	dev_pm_qos_hide_latency_tolerance(ctrl->device);
@@ -4357,6 +4360,13 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
 	if (ret)
 		goto out_free_name;
 
+	ret = cpuhp_state_add_instance_nocalls(CPUHP_AP_NVME_QUEUE_ONLINE,
+					       &ctrl->cpuhp_node);
+	if (ret) {
+		pr_debug("failed to cpuhp notifiers %d\n", ret);
+		goto out_del_device;
+	}
+
 	/*
 	 * Initialize latency tolerance controls.  The sysfs files won't
 	 * be visible to userspace unless the device actually supports APST.
@@ -4369,6 +4379,8 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
 	nvme_mpath_init_ctrl(ctrl);
 
 	return 0;
+out_del_device:
+	cdev_device_del(&ctrl->cdev, ctrl->device);
 out_free_name:
 	nvme_put_ctrl(ctrl);
 	kfree_const(ctrl->device->kobj.name);
@@ -4529,6 +4541,80 @@ static inline void _nvme_check_size(void)
 	BUILD_BUG_ON(sizeof(struct nvme_directive_cmd) != 64);
 }
 
+static int nvme_notify_online(unsigned int cpu, struct hlist_node *node)
+{
+	struct nvme_ctrl *ctrl = hlist_entry_safe(node,
+			struct nvme_ctrl, cpuhp_node);
+	struct nvme_ns *ns;
+	struct blk_mq_hw_ctx *hctx;
+	unsigned int i;
+	bool present = false;
+
+	pr_warn("NQN %s CPU %u online\n", ctrl->opts->subsysnqn, cpu);
+
+	down_read(&ctrl->namespaces_rwsem);
+	list_for_each_entry(ns, &ctrl->namespaces, list) {
+		queue_for_each_hw_ctx(ns->disk->queue, hctx, i) {
+			if (cpumask_test_cpu(cpu, hctx->cpumask)) {
+				present = true;
+				break;
+			}
+		}
+		if (present)
+			break;
+	}
+	up_read(&ctrl->namespaces_rwsem);
+	if (present) {
+		pr_warn("trigger reset... \n");
+		nvme_reset_ctrl_sync(ctrl);
+		pr_warn("executed\n");
+	}
+
+	return 0;
+}
+
+static inline bool nvme_last_cpu_in_hctx(unsigned int cpu,
+					struct blk_mq_hw_ctx *hctx)
+{
+	if (cpumask_next_and(-1, hctx->cpumask, cpu_online_mask) != cpu)
+		return false;
+	if (cpumask_next_and(cpu, hctx->cpumask, cpu_online_mask) < nr_cpu_ids)
+		return false;
+	return true;
+}
+
+static int nvme_notify_offline(unsigned int cpu, struct hlist_node *node)
+{
+	struct nvme_ctrl *ctrl = hlist_entry_safe(node,
+			struct nvme_ctrl, cpuhp_node);
+	struct nvme_ns *ns;
+	struct blk_mq_hw_ctx *hctx;
+	unsigned int i;
+	bool offline = false;
+
+	pr_warn("NQN %s CPU %u offline\n", ctrl->opts->subsysnqn, cpu);
+
+	down_read(&ctrl->namespaces_rwsem);
+	list_for_each_entry(ns, &ctrl->namespaces, list) {
+		queue_for_each_hw_ctx(ns->disk->queue, hctx, i) {
+			pr_warn("htcx %p cpumask %*pbl state x%lx\n",
+				hctx, cpumask_pr_args(hctx->cpumask), hctx->state);
+			if (nvme_last_cpu_in_hctx(cpu, hctx)) {
+				offline = true;
+				break;
+			}
+		}
+		if (offline)
+			break;
+	}
+	up_read(&ctrl->namespaces_rwsem);
+	if (offline) {
+		pr_warn("trigger reset... \n");
+		nvme_reset_ctrl_sync(ctrl);
+		pr_warn("executed\n");
+	}
+	return 0;
+}
 
 static int __init nvme_core_init(void)
 {
@@ -4580,8 +4666,17 @@ static int __init nvme_core_init(void)
 		goto unregister_generic_ns;
 	}
 
+	result = cpuhp_setup_state_multi(CPUHP_AP_NVME_QUEUE_ONLINE,
+					 "nvme:online",
+					 nvme_notify_online,
+					 nvme_notify_offline);
+	if (result < 0)
+		goto destroy_chr_class;
+
 	return 0;
 
+destroy_chr_class:
+	class_destroy(nvme_ns_chr_class);
 unregister_generic_ns:
 	unregister_chrdev_region(nvme_ns_chr_devt, NVME_MINORS);
 destroy_subsys_class:
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 0015860ec12b..34676ff79d40 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -352,6 +352,8 @@ struct nvme_ctrl {
 	unsigned long discard_page_busy;
 
 	struct nvme_fault_inject fault_inject;
+
+	struct hlist_node cpuhp_node;
 };
 
 enum nvme_iopolicy {
diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index 4a62b3980642..5f6cf60bee1f 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -159,6 +159,7 @@ enum cpuhp_state {
 	CPUHP_AP_X86_VDSO_VMA_ONLINE,
 	CPUHP_AP_IRQ_AFFINITY_ONLINE,
 	CPUHP_AP_BLK_MQ_ONLINE,
+	CPUHP_AP_NVME_QUEUE_ONLINE,
 	CPUHP_AP_ARM_MVEBU_SYNC_CLOCKS,
 	CPUHP_AP_X86_INTEL_EPB_ONLINE,
 	CPUHP_AP_PERF_ONLINE,

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH 0/2] blk-mq: fix blk_mq_alloc_request_hctx
@ 2021-07-02  9:47               ` Daniel Wagner
  0 siblings, 0 replies; 50+ messages in thread
From: Daniel Wagner @ 2021-07-02  9:47 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Sagi Grimberg, Ming Lei, Jens Axboe, linux-block, linux-nvme,
	Christoph Hellwig, Wen Xiong, John Garry

On Wed, Jun 30, 2021 at 09:46:35PM +0200, Hannes Reinecke wrote:
> Actually, Daniel and me came to a slightly different idea: use cpu hotplug
> notifier.
> Thing is, blk-mq already has cpu hotplug notifier, which should ensure that
> no I/O is pending during cpu hotplug.
> If we now add a nvme cpu hotplug notifier which essentially kicks off a
> reset once all cpu in a hctx are offline the reset logic will rearrange the
> queues to match the current cpu layout.
> And when the cpus are getting onlined we'll do another reset.
> 
> Daniel is currently preparing a patch; let's see how it goes.

FTR, I can't say it was a huge success. I observed a real problem with
this idea and 'maybe problem. First, we can't use nvme_ctrl_reset_sync()
in the cpuhp path as this will block in various places due to
percpu_rwsem_wait operations [1]. Not really a surprise I'd say.

The maybe problem: I tried the test this time though with
nvme_ctrl_reset() and this worked for a while (a couple of hours of FC
port toggling with stress cpu hotplugging). Though eventually the system
stopped to respond and died. Unfortunately, I don't have any core
dump. So it's not totally clear what's happening.

BTW, I used the same test setup with Ming's patches on top of the two of
mine. The system was still working fine after 4 hours of the stress
testing.

[1] https://paste.opensuse.org/view/raw/62287389

For the fun, here is the nvme_ctrl_reset() patch:

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 66973bb56305..444756ba8dbe 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -20,6 +20,7 @@
 #include <linux/ptrace.h>
 #include <linux/nvme_ioctl.h>
 #include <linux/pm_qos.h>
+#include <linux/cpuhotplug.h>
 #include <asm/unaligned.h>
 
 #include "nvme.h"
@@ -4245,6 +4246,8 @@ EXPORT_SYMBOL_GPL(nvme_start_ctrl);
 
 void nvme_uninit_ctrl(struct nvme_ctrl *ctrl)
 {
+	cpuhp_state_remove_instance_nocalls(CPUHP_AP_NVME_QUEUE_ONLINE,
+					    &ctrl->cpuhp_node);
 	nvme_hwmon_exit(ctrl);
 	nvme_fault_inject_fini(&ctrl->fault_inject);
 	dev_pm_qos_hide_latency_tolerance(ctrl->device);
@@ -4357,6 +4360,13 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
 	if (ret)
 		goto out_free_name;
 
+	ret = cpuhp_state_add_instance_nocalls(CPUHP_AP_NVME_QUEUE_ONLINE,
+					       &ctrl->cpuhp_node);
+	if (ret) {
+		pr_debug("failed to cpuhp notifiers %d\n", ret);
+		goto out_del_device;
+	}
+
 	/*
 	 * Initialize latency tolerance controls.  The sysfs files won't
 	 * be visible to userspace unless the device actually supports APST.
@@ -4369,6 +4379,8 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
 	nvme_mpath_init_ctrl(ctrl);
 
 	return 0;
+out_del_device:
+	cdev_device_del(&ctrl->cdev, ctrl->device);
 out_free_name:
 	nvme_put_ctrl(ctrl);
 	kfree_const(ctrl->device->kobj.name);
@@ -4529,6 +4541,80 @@ static inline void _nvme_check_size(void)
 	BUILD_BUG_ON(sizeof(struct nvme_directive_cmd) != 64);
 }
 
+static int nvme_notify_online(unsigned int cpu, struct hlist_node *node)
+{
+	struct nvme_ctrl *ctrl = hlist_entry_safe(node,
+			struct nvme_ctrl, cpuhp_node);
+	struct nvme_ns *ns;
+	struct blk_mq_hw_ctx *hctx;
+	unsigned int i;
+	bool present = false;
+
+	pr_warn("NQN %s CPU %u online\n", ctrl->opts->subsysnqn, cpu);
+
+	down_read(&ctrl->namespaces_rwsem);
+	list_for_each_entry(ns, &ctrl->namespaces, list) {
+		queue_for_each_hw_ctx(ns->disk->queue, hctx, i) {
+			if (cpumask_test_cpu(cpu, hctx->cpumask)) {
+				present = true;
+				break;
+			}
+		}
+		if (present)
+			break;
+	}
+	up_read(&ctrl->namespaces_rwsem);
+	if (present) {
+		pr_warn("trigger reset... \n");
+		nvme_reset_ctrl_sync(ctrl);
+		pr_warn("executed\n");
+	}
+
+	return 0;
+}
+
+static inline bool nvme_last_cpu_in_hctx(unsigned int cpu,
+					struct blk_mq_hw_ctx *hctx)
+{
+	if (cpumask_next_and(-1, hctx->cpumask, cpu_online_mask) != cpu)
+		return false;
+	if (cpumask_next_and(cpu, hctx->cpumask, cpu_online_mask) < nr_cpu_ids)
+		return false;
+	return true;
+}
+
+static int nvme_notify_offline(unsigned int cpu, struct hlist_node *node)
+{
+	struct nvme_ctrl *ctrl = hlist_entry_safe(node,
+			struct nvme_ctrl, cpuhp_node);
+	struct nvme_ns *ns;
+	struct blk_mq_hw_ctx *hctx;
+	unsigned int i;
+	bool offline = false;
+
+	pr_warn("NQN %s CPU %u offline\n", ctrl->opts->subsysnqn, cpu);
+
+	down_read(&ctrl->namespaces_rwsem);
+	list_for_each_entry(ns, &ctrl->namespaces, list) {
+		queue_for_each_hw_ctx(ns->disk->queue, hctx, i) {
+			pr_warn("htcx %p cpumask %*pbl state x%lx\n",
+				hctx, cpumask_pr_args(hctx->cpumask), hctx->state);
+			if (nvme_last_cpu_in_hctx(cpu, hctx)) {
+				offline = true;
+				break;
+			}
+		}
+		if (offline)
+			break;
+	}
+	up_read(&ctrl->namespaces_rwsem);
+	if (offline) {
+		pr_warn("trigger reset... \n");
+		nvme_reset_ctrl_sync(ctrl);
+		pr_warn("executed\n");
+	}
+	return 0;
+}
 
 static int __init nvme_core_init(void)
 {
@@ -4580,8 +4666,17 @@ static int __init nvme_core_init(void)
 		goto unregister_generic_ns;
 	}
 
+	result = cpuhp_setup_state_multi(CPUHP_AP_NVME_QUEUE_ONLINE,
+					 "nvme:online",
+					 nvme_notify_online,
+					 nvme_notify_offline);
+	if (result < 0)
+		goto destroy_chr_class;
+
 	return 0;
 
+destroy_chr_class:
+	class_destroy(nvme_ns_chr_class);
 unregister_generic_ns:
 	unregister_chrdev_region(nvme_ns_chr_devt, NVME_MINORS);
 destroy_subsys_class:
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 0015860ec12b..34676ff79d40 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -352,6 +352,8 @@ struct nvme_ctrl {
 	unsigned long discard_page_busy;
 
 	struct nvme_fault_inject fault_inject;
+
+	struct hlist_node cpuhp_node;
 };
 
 enum nvme_iopolicy {
diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index 4a62b3980642..5f6cf60bee1f 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -159,6 +159,7 @@ enum cpuhp_state {
 	CPUHP_AP_X86_VDSO_VMA_ONLINE,
 	CPUHP_AP_IRQ_AFFINITY_ONLINE,
 	CPUHP_AP_BLK_MQ_ONLINE,
+	CPUHP_AP_NVME_QUEUE_ONLINE,
 	CPUHP_AP_ARM_MVEBU_SYNC_CLOCKS,
 	CPUHP_AP_X86_INTEL_EPB_ONLINE,
 	CPUHP_AP_PERF_ONLINE,

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2021-07-02  9:48 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-29  7:49 [PATCH 0/2] blk-mq: fix blk_mq_alloc_request_hctx Ming Lei
2021-06-29  7:49 ` Ming Lei
2021-06-29  7:49 ` [PATCH 1/2] blk-mq: not deactivate hctx if the device doesn't use managed irq Ming Lei
2021-06-29  7:49   ` Ming Lei
2021-06-29 12:39   ` Hannes Reinecke
2021-06-29 12:39     ` Hannes Reinecke
2021-06-29 14:17     ` Ming Lei
2021-06-29 14:17       ` Ming Lei
2021-06-29 15:49   ` John Garry
2021-06-29 15:49     ` John Garry
2021-06-30  0:32     ` Ming Lei
2021-06-30  0:32       ` Ming Lei
2021-06-30  9:25       ` John Garry
2021-06-30  9:25         ` John Garry
2021-07-01  9:52       ` Christoph Hellwig
2021-07-01  9:52         ` Christoph Hellwig
2021-06-29 23:30   ` Damien Le Moal
2021-06-29 23:30     ` Damien Le Moal
2021-06-30 18:58     ` Sagi Grimberg
2021-06-30 18:58       ` Sagi Grimberg
2021-06-30 21:57       ` Damien Le Moal
2021-06-30 21:57         ` Damien Le Moal
2021-07-01 14:20         ` Keith Busch
2021-07-01 14:20           ` Keith Busch
2021-06-29  7:49 ` [PATCH 2/2] nvme: pass BLK_MQ_F_NOT_USE_MANAGED_IRQ for fc/rdma/tcp/loop Ming Lei
2021-06-29  7:49   ` Ming Lei
2021-06-30  8:15   ` Hannes Reinecke
2021-06-30  8:15     ` Hannes Reinecke
2021-06-30  8:47     ` Ming Lei
2021-06-30  8:47       ` Ming Lei
2021-06-30  8:18 ` [PATCH 0/2] blk-mq: fix blk_mq_alloc_request_hctx Hannes Reinecke
2021-06-30  8:18   ` Hannes Reinecke
2021-06-30  8:42   ` Ming Lei
2021-06-30  8:42     ` Ming Lei
2021-06-30  9:43     ` Hannes Reinecke
2021-06-30  9:43       ` Hannes Reinecke
2021-06-30  9:53       ` Ming Lei
2021-06-30  9:53         ` Ming Lei
2021-06-30 18:59         ` Sagi Grimberg
2021-06-30 18:59           ` Sagi Grimberg
2021-06-30 19:46           ` Hannes Reinecke
2021-06-30 19:46             ` Hannes Reinecke
2021-06-30 23:59             ` Ming Lei
2021-06-30 23:59               ` Ming Lei
2021-07-01  8:00               ` Hannes Reinecke
2021-07-01  8:00                 ` Hannes Reinecke
2021-07-01  9:13                 ` Ming Lei
2021-07-01  9:13                   ` Ming Lei
2021-07-02  9:47             ` Daniel Wagner
2021-07-02  9:47               ` Daniel Wagner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.