[PATCH 0/2] Fix crash when rescan ns after set queue count cmd timeout

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/2] Fix crash when rescan ns after set queue count cmd timeout
@ 2021-08-03  9:06 Ruozhu Li
  2021-08-03  9:06 ` [PATCH 1/2] nvme-rdma: always try to configure io queue when user wants it Ruozhu Li
  2021-08-03  9:06 ` [PATCH 2/2] nvme: don't do scan work if io queue count is zero Ruozhu Li
  0 siblings, 2 replies; 9+ messages in thread
From: Ruozhu Li @ 2021-08-03  9:06 UTC (permalink / raw)
  To: linux-nvme

Hi,

We got a BUG_ON when rescan ns after set queue count cmd timeout: 
--
BUG_ON(hctx_idx >= ctrl->ctrl.queue_count); //nvme_rdma_init_hctx
--
Call trace:
nvme_rdma_init_hctx+0x58/0x60 [nvme_rdma]
blk_mq_realloc_hw_ctxs+0x140/0x4c0
blk_mq_init_allocated_queue+0x130/0x410
blk_mq_init_queue+0x40/0x88
nvme_validate_ns+0xb8/0x740
nvme_scan_work+0x29c/0x460
process_one_work+0x1f8/0x490
worker_thread+0x50/0x4b8
kthread+0x134/0x138
ret_from_fork+0x10/0x18
--
This happened because: 
1) Host set queue count feature timeout in reconnection, set ctrl->
queue_count to 1, and schedule another reconnect. 
2) Next reconnection succeed but not create any io queues, because
ctrl->queue_count set to 1, host won't configure io queue again.
3) Del/add ns on ctrl causes host rescan ns, kernel BUG_ON when detect
hctx_idx greater than ctrl->queue_count.

Try to fix it with following patches.Any comments and reviews are welcome.

Thanks,
Ruozhu

Ruozhu Li (2):
  nvme-rdma: always try to configure io queue when user wants it
  nvme: don't do scan work if io queue count is zero

 drivers/nvme/host/core.c | 6 ++++--
 drivers/nvme/host/rdma.c | 4 +++-
 2 files changed, 7 insertions(+), 3 deletions(-)

-- 
2.16.4


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/2] nvme-rdma: always try to configure io queue when user wants it
  2021-08-03  9:06 [PATCH 0/2] Fix crash when rescan ns after set queue count cmd timeout Ruozhu Li
@ 2021-08-03  9:06 ` Ruozhu Li
  2021-08-11  1:30   ` Sagi Grimberg
  2021-08-03  9:06 ` [PATCH 2/2] nvme: don't do scan work if io queue count is zero Ruozhu Li
  1 sibling, 1 reply; 9+ messages in thread
From: Ruozhu Li @ 2021-08-03  9:06 UTC (permalink / raw)
  To: linux-nvme

During reconnection, if set queue count feature failed, we will
set ctrl->queue_count to 1, and schedule another reconnection.But when
ctrl->queue_count is 1, we won't try to configure io queue again.

We should always try to configure io queues if user wants it.

Signed-off-by: Ruozhu Li <liruozhu@huawei.com>
---
 drivers/nvme/host/rdma.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 7f6b3a991501..df4f872c05f0 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1089,6 +1089,7 @@ static int nvme_rdma_setup_ctrl(struct nvme_rdma_ctrl *ctrl, bool new)
 {
 	int ret;
 	bool changed;
+	struct nvmf_ctrl_options *opts = ctrl->ctrl.opts;
 
 	ret = nvme_rdma_configure_admin_queue(ctrl, new);
 	if (ret)
@@ -1121,7 +1122,8 @@ static int nvme_rdma_setup_ctrl(struct nvme_rdma_ctrl *ctrl, bool new)
 	if (ctrl->ctrl.sgls & (1 << 20))
 		ctrl->use_inline_data = true;
 
-	if (ctrl->ctrl.queue_count > 1) {
+	if (opts->nr_io_queues || opts->nr_write_queues ||
+	    opts->nr_poll_queues) {
 		ret = nvme_rdma_configure_io_queues(ctrl, new);
 		if (ret)
 			goto destroy_admin;
-- 
2.16.4


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 2/2] nvme: don't do scan work if io queue count is zero
  2021-08-03  9:06 [PATCH 0/2] Fix crash when rescan ns after set queue count cmd timeout Ruozhu Li
  2021-08-03  9:06 ` [PATCH 1/2] nvme-rdma: always try to configure io queue when user wants it Ruozhu Li
@ 2021-08-03  9:06 ` Ruozhu Li
  2021-08-03 15:55   ` Keith Busch
  2021-08-11  1:31   ` Sagi Grimberg
  1 sibling, 2 replies; 9+ messages in thread
From: Ruozhu Li @ 2021-08-03  9:06 UTC (permalink / raw)
  To: linux-nvme

kernel panic when try to rescan ns when io queue count is zero.Because
kernel BUG_ON when hctx_idx is greater than ctrl->queue_count.
--
nvme_rdma_init_hctx+0x58/0x60 [nvme_rdma]
blk_mq_realloc_hw_ctxs+0x140/0x4c0
blk_mq_init_allocated_queue+0x130/0x410
blk_mq_init_queue+0x40/0x88
nvme_validate_ns+0xb8/0x740
nvme_scan_work+0x29c/0x460
process_one_work+0x1f8/0x490
worker_thread+0x50/0x4b8
kthread+0x134/0x138
ret_from_fork+0x10/0x18
--
Defence it by not allowing rescan ns when io queue count is zero.

Signed-off-by: Ruozhu Li <liruozhu@huawei.com>
---
 drivers/nvme/host/core.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index dfd9dec0c1f6..d9f837eb3e26 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -141,7 +141,8 @@ void nvme_queue_scan(struct nvme_ctrl *ctrl)
 	/*
 	 * Only new queue scan work when admin and IO queues are both alive
 	 */
-	if (ctrl->state == NVME_CTRL_LIVE && ctrl->tagset)
+	if (ctrl->state == NVME_CTRL_LIVE &&
+	    ctrl->tagset && ctrl->queue_count > 1)
 		queue_work(nvme_wq, &ctrl->scan_work);
 }
 
@@ -4047,7 +4048,8 @@ static void nvme_scan_work(struct work_struct *work)
 		container_of(work, struct nvme_ctrl, scan_work);
 
 	/* No tagset on a live ctrl means IO queues could not created */
-	if (ctrl->state != NVME_CTRL_LIVE || !ctrl->tagset)
+	if (ctrl->state != NVME_CTRL_LIVE ||
+	    !ctrl->tagset || ctrl->queue_count < 2)
 		return;
 
 	if (test_and_clear_bit(NVME_AER_NOTICE_NS_CHANGED, &ctrl->events)) {
-- 
2.16.4


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] nvme: don't do scan work if io queue count is zero
  2021-08-03  9:06 ` [PATCH 2/2] nvme: don't do scan work if io queue count is zero Ruozhu Li
@ 2021-08-03 15:55   ` Keith Busch
  2021-08-04  3:12     ` liruozhu
  2021-08-11  1:32     ` Sagi Grimberg
  2021-08-11  1:31   ` Sagi Grimberg
  1 sibling, 2 replies; 9+ messages in thread
From: Keith Busch @ 2021-08-03 15:55 UTC (permalink / raw)
  To: Ruozhu Li; +Cc: linux-nvme

On Tue, Aug 03, 2021 at 05:06:30PM +0800, Ruozhu Li wrote:
> kernel panic when try to rescan ns when io queue count is zero.Because
> kernel BUG_ON when hctx_idx is greater than ctrl->queue_count.
> --
> nvme_rdma_init_hctx+0x58/0x60 [nvme_rdma]
> blk_mq_realloc_hw_ctxs+0x140/0x4c0
> blk_mq_init_allocated_queue+0x130/0x410
> blk_mq_init_queue+0x40/0x88
> nvme_validate_ns+0xb8/0x740
> nvme_scan_work+0x29c/0x460
> process_one_work+0x1f8/0x490
> worker_thread+0x50/0x4b8
> kthread+0x134/0x138
> ret_from_fork+0x10/0x18
> --
> Defence it by not allowing rescan ns when io queue count is zero.
> 
> Signed-off-by: Ruozhu Li <liruozhu@huawei.com>
> ---
>  drivers/nvme/host/core.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index dfd9dec0c1f6..d9f837eb3e26 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -141,7 +141,8 @@ void nvme_queue_scan(struct nvme_ctrl *ctrl)
>  	/*
>  	 * Only new queue scan work when admin and IO queues are both alive
>  	 */
> -	if (ctrl->state == NVME_CTRL_LIVE && ctrl->tagset)
> +	if (ctrl->state == NVME_CTRL_LIVE &&
> +	    ctrl->tagset && ctrl->queue_count > 1)

Why is the transport driver leaving an unusable tagset allocated when
there are no IO queues to support it? I think it should free the tagset
in that case. At least that's what pci does.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] nvme: don't do scan work if io queue count is zero
  2021-08-03 15:55   ` Keith Busch
@ 2021-08-04  3:12     ` liruozhu
  2021-08-11  1:32     ` Sagi Grimberg
  1 sibling, 0 replies; 9+ messages in thread
From: liruozhu @ 2021-08-04  3:12 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-nvme

It seems that transport driver will only release the tagset when the 
connection is completely deleted.
In reconnection, if IO tagset is created, transport driver is not 
allowed to reconnect without any IO queues.
See patch("nvme-rdma: fix possible hang when failing to set io queues").
But it didn’t work as expected, the first patch try to fix it.

This patch try to prevent similar logical problems in the transport driver.

Thanks,
Ruozhu
在 2021/8/3 23:55, Keith Busch 写道:
> On Tue, Aug 03, 2021 at 05:06:30PM +0800, Ruozhu Li wrote:
>> kernel panic when try to rescan ns when io queue count is zero.Because
>> kernel BUG_ON when hctx_idx is greater than ctrl->queue_count.
>> --
>> nvme_rdma_init_hctx+0x58/0x60 [nvme_rdma]
>> blk_mq_realloc_hw_ctxs+0x140/0x4c0
>> blk_mq_init_allocated_queue+0x130/0x410
>> blk_mq_init_queue+0x40/0x88
>> nvme_validate_ns+0xb8/0x740
>> nvme_scan_work+0x29c/0x460
>> process_one_work+0x1f8/0x490
>> worker_thread+0x50/0x4b8
>> kthread+0x134/0x138
>> ret_from_fork+0x10/0x18
>> --
>> Defence it by not allowing rescan ns when io queue count is zero.
>>
>> Signed-off-by: Ruozhu Li <liruozhu@huawei.com>
>> ---
>>   drivers/nvme/host/core.c | 6 ++++--
>>   1 file changed, 4 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>> index dfd9dec0c1f6..d9f837eb3e26 100644
>> --- a/drivers/nvme/host/core.c
>> +++ b/drivers/nvme/host/core.c
>> @@ -141,7 +141,8 @@ void nvme_queue_scan(struct nvme_ctrl *ctrl)
>>   	/*
>>   	 * Only new queue scan work when admin and IO queues are both alive
>>   	 */
>> -	if (ctrl->state == NVME_CTRL_LIVE && ctrl->tagset)
>> +	if (ctrl->state == NVME_CTRL_LIVE &&
>> +	    ctrl->tagset && ctrl->queue_count > 1)
> Why is the transport driver leaving an unusable tagset allocated when
> there are no IO queues to support it? I think it should free the tagset
> in that case. At least that's what pci does.
> .

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] nvme-rdma: always try to configure io queue when user wants it
  2021-08-03  9:06 ` [PATCH 1/2] nvme-rdma: always try to configure io queue when user wants it Ruozhu Li
@ 2021-08-11  1:30   ` Sagi Grimberg
  2021-08-11  1:47     ` liruozhu
  0 siblings, 1 reply; 9+ messages in thread
From: Sagi Grimberg @ 2021-08-11  1:30 UTC (permalink / raw)
  To: Ruozhu Li, linux-nvme


> During reconnection, if set queue count feature failed, we will
> set ctrl->queue_count to 1, and schedule another reconnection.But when
> ctrl->queue_count is 1, we won't try to configure io queue again.
> 
> We should always try to configure io queues if user wants it.

I'm assuming this is not relevant anymore?

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] nvme: don't do scan work if io queue count is zero
  2021-08-03  9:06 ` [PATCH 2/2] nvme: don't do scan work if io queue count is zero Ruozhu Li
  2021-08-03 15:55   ` Keith Busch
@ 2021-08-11  1:31   ` Sagi Grimberg
  1 sibling, 0 replies; 9+ messages in thread
From: Sagi Grimberg @ 2021-08-11  1:31 UTC (permalink / raw)
  To: Ruozhu Li, linux-nvme


> kernel panic when try to rescan ns when io queue count is zero.Because
> kernel BUG_ON when hctx_idx is greater than ctrl->queue_count.
> --
> nvme_rdma_init_hctx+0x58/0x60 [nvme_rdma]
> blk_mq_realloc_hw_ctxs+0x140/0x4c0
> blk_mq_init_allocated_queue+0x130/0x410
> blk_mq_init_queue+0x40/0x88
> nvme_validate_ns+0xb8/0x740
> nvme_scan_work+0x29c/0x460
> process_one_work+0x1f8/0x490
> worker_thread+0x50/0x4b8
> kthread+0x134/0x138
> ret_from_fork+0x10/0x18
> --
> Defence it by not allowing rescan ns when io queue count is zero.
> 
> Signed-off-by: Ruozhu Li <liruozhu@huawei.com>
> ---
>   drivers/nvme/host/core.c | 6 ++++--
>   1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index dfd9dec0c1f6..d9f837eb3e26 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -141,7 +141,8 @@ void nvme_queue_scan(struct nvme_ctrl *ctrl)
>   	/*
>   	 * Only new queue scan work when admin and IO queues are both alive
>   	 */
> -	if (ctrl->state == NVME_CTRL_LIVE && ctrl->tagset)
> +	if (ctrl->state == NVME_CTRL_LIVE &&
> +	    ctrl->tagset && ctrl->queue_count > 1)
>   		queue_work(nvme_wq, &ctrl->scan_work);
>   }
>   
> @@ -4047,7 +4048,8 @@ static void nvme_scan_work(struct work_struct *work)
>   		container_of(work, struct nvme_ctrl, scan_work);
>   
>   	/* No tagset on a live ctrl means IO queues could not created */
> -	if (ctrl->state != NVME_CTRL_LIVE || !ctrl->tagset)
> +	if (ctrl->state != NVME_CTRL_LIVE ||
> +	    !ctrl->tagset || ctrl->queue_count < 2)
>   		return;

We can't keep growing sporadic conditionals here, its becoming a mess...

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] nvme: don't do scan work if io queue count is zero
  2021-08-03 15:55   ` Keith Busch
  2021-08-04  3:12     ` liruozhu
@ 2021-08-11  1:32     ` Sagi Grimberg
  1 sibling, 0 replies; 9+ messages in thread
From: Sagi Grimberg @ 2021-08-11  1:32 UTC (permalink / raw)
  To: Keith Busch, Ruozhu Li; +Cc: linux-nvme


> On Tue, Aug 03, 2021 at 05:06:30PM +0800, Ruozhu Li wrote:
>> kernel panic when try to rescan ns when io queue count is zero.Because
>> kernel BUG_ON when hctx_idx is greater than ctrl->queue_count.
>> --
>> nvme_rdma_init_hctx+0x58/0x60 [nvme_rdma]
>> blk_mq_realloc_hw_ctxs+0x140/0x4c0
>> blk_mq_init_allocated_queue+0x130/0x410
>> blk_mq_init_queue+0x40/0x88
>> nvme_validate_ns+0xb8/0x740
>> nvme_scan_work+0x29c/0x460
>> process_one_work+0x1f8/0x490
>> worker_thread+0x50/0x4b8
>> kthread+0x134/0x138
>> ret_from_fork+0x10/0x18
>> --
>> Defence it by not allowing rescan ns when io queue count is zero.
>>
>> Signed-off-by: Ruozhu Li <liruozhu@huawei.com>
>> ---
>>   drivers/nvme/host/core.c | 6 ++++--
>>   1 file changed, 4 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>> index dfd9dec0c1f6..d9f837eb3e26 100644
>> --- a/drivers/nvme/host/core.c
>> +++ b/drivers/nvme/host/core.c
>> @@ -141,7 +141,8 @@ void nvme_queue_scan(struct nvme_ctrl *ctrl)
>>   	/*
>>   	 * Only new queue scan work when admin and IO queues are both alive
>>   	 */
>> -	if (ctrl->state == NVME_CTRL_LIVE && ctrl->tagset)
>> +	if (ctrl->state == NVME_CTRL_LIVE &&
>> +	    ctrl->tagset && ctrl->queue_count > 1)
> 
> Why is the transport driver leaving an unusable tagset allocated when
> there are no IO queues to support it? I think it should free the tagset
> in that case. At least that's what pci does.

We can't release the tagset as that will fail I/O (which is probably
fine for mpath) but for non-mpath we want I/O to block.

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/2] nvme-rdma: always try to configure io queue when user wants it
  2021-08-11  1:30   ` Sagi Grimberg
@ 2021-08-11  1:47     ` liruozhu
  0 siblings, 0 replies; 9+ messages in thread
From: liruozhu @ 2021-08-11  1:47 UTC (permalink / raw)
  To: Sagi Grimberg, linux-nvme

Because another solution has been applied, this problem has been solved. 
You can safely ignore this patch set now.

Thanks,
Ruozhu

On 2021/8/11 9:30, Sagi Grimberg wrote:
>
>> During reconnection, if set queue count feature failed, we will
>> set ctrl->queue_count to 1, and schedule another reconnection.But when
>> ctrl->queue_count is 1, we won't try to configure io queue again.
>>
>> We should always try to configure io queues if user wants it.
>
> I'm assuming this is not relevant anymore?
> .

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2021-08-11  1:47 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-03  9:06 [PATCH 0/2] Fix crash when rescan ns after set queue count cmd timeout Ruozhu Li
2021-08-03  9:06 ` [PATCH 1/2] nvme-rdma: always try to configure io queue when user wants it Ruozhu Li
2021-08-11  1:30   ` Sagi Grimberg
2021-08-11  1:47     ` liruozhu
2021-08-03  9:06 ` [PATCH 2/2] nvme: don't do scan work if io queue count is zero Ruozhu Li
2021-08-03 15:55   ` Keith Busch
2021-08-04  3:12     ` liruozhu
2021-08-11  1:32     ` Sagi Grimberg
2021-08-11  1:31   ` Sagi Grimberg

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.