stable.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4] nvme: rdma/tcp: fix list corruption with anatt timer
@ 2021-04-27  9:31 mwilck
  2021-04-29 12:24 ` Christoph Hellwig
  0 siblings, 1 reply; 4+ messages in thread
From: mwilck @ 2021-04-27  9:31 UTC (permalink / raw)
  To: Keith Busch, Sagi Grimberg, Christoph Hellwig, Chao Leng
  Cc: Hannes Reinecke, Daniel Wagner, linux-nvme, Martin Wilck, stable

From: Martin Wilck <mwilck@suse.com>

We have observed a few crashes run_timer_softirq(), where a broken
timer_list struct belonging to an anatt_timer was encountered. The broken
structures look like this, and we see actually multiple ones attached to
the same timer base:

crash> struct timer_list 0xffff92471bcfdc90
struct timer_list {
  entry = {
    next = 0xdead000000000122,  // LIST_POISON2
    pprev = 0x0
  },
  expires = 4296022933,
  function = 0xffffffffc06de5e0 <nvme_anatt_timeout>,
  flags = 20
}

If such a timer is encountered in run_timer_softirq(), the kernel
crashes. The test scenario was an I/O load test with lots of NVMe
controllers, some of which were removed and re-added on the storage side.

I think this may happen if the rdma recovery_work starts, in this call
chain:

nvme_rdma_error_recovery_work()
  /* this stops all sorts of activity for the controller, but not
     the multipath-related work queue and timer */
  nvme_rdma_reconnect_or_remove(ctrl)
    => kicks reconnect_work

work queue: reconnect_work

nvme_rdma_reconnect_ctrl_work()
  nvme_rdma_setup_ctrl()
    nvme_rdma_configure_admin_queue()
       nvme_init_identify()
          nvme_mpath_init()
	     # this sets some fields of the timer_list without taking a lock
             timer_setup()
             nvme_read_ana_log()
	         mod_timer() or del_timer_sync()

Similar for TCP. The idea for the patch is based on the observation that
nvme_rdma_reset_ctrl_work() calls nvme_stop_ctrl()->nvme_mpath_stop(),
whereas nvme_rdma_error_recovery_work() stops only the keepalive timer, but
not the anatt timer. Also, nvme_mpath_init() is the only place where
the anatt_timer structure is accessed without locking.

[The following paragraph was contributed by Chao Leng <lengchao@huawei.com>]

The process maybe:1.ana_work add the timer;2.error recovery occurs,
in reconnecting, reinitialize the timer and call nvme_read_ana_log,
nvme_read_ana_log may add the timer again.
The same timer is added twice, crash will happens later.

This situation has actually been observed in a crash dump, where we
found an anatt timer pending that had been started ~80s ago, despite a
log message telling that the anatt timer for the same controller had
timed out a few seconds ago. This could only be explained by the same
timer having been attached multiple times.

Signed-off-by: Martin Wilck <mwilck@suse.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chao Leng <lengchao@huawei.com>
Cc: stable@vger.kernel.org

----
Changes in v4: Updated commit message with Chao Leng's analysis, as
suggested by Daniel Wagner.

Changes in v3: Changed the subject line, as suggested by Sagi Grimberg

Changes in v2: Moved call to nvme_mpath_stop() further down, directly before
the call of nvme_rdma_reconnect_or_remove() (Chao Leng)
---
 drivers/nvme/host/multipath.c | 1 +
 drivers/nvme/host/rdma.c      | 1 +
 drivers/nvme/host/tcp.c       | 1 +
 3 files changed, 3 insertions(+)

diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index a1d476e1ac02..c63dd5dfa7ff 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -586,6 +586,7 @@ void nvme_mpath_stop(struct nvme_ctrl *ctrl)
 	del_timer_sync(&ctrl->anatt_timer);
 	cancel_work_sync(&ctrl->ana_work);
 }
+EXPORT_SYMBOL_GPL(nvme_mpath_stop);
 
 #define SUBSYS_ATTR_RW(_name, _mode, _show, _store)  \
 	struct device_attribute subsys_attr_##_name =	\
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index be905d4fdb47..fc07a7b0dc1d 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1202,6 +1202,7 @@ static void nvme_rdma_error_recovery_work(struct work_struct *work)
 		return;
 	}
 
+	nvme_mpath_stop(&ctrl->ctrl);
 	nvme_rdma_reconnect_or_remove(ctrl);
 }
 
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index a0f00cb8f9f3..46287b4f4d10 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -2068,6 +2068,7 @@ static void nvme_tcp_error_recovery_work(struct work_struct *work)
 		return;
 	}
 
+	nvme_mpath_stop(ctrl);
 	nvme_tcp_reconnect_or_remove(ctrl);
 }
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH v4] nvme: rdma/tcp: fix list corruption with anatt timer
  2021-04-27  9:31 [PATCH v4] nvme: rdma/tcp: fix list corruption with anatt timer mwilck
@ 2021-04-29 12:24 ` Christoph Hellwig
  2021-05-04  7:52   ` Christoph Hellwig
  0 siblings, 1 reply; 4+ messages in thread
From: Christoph Hellwig @ 2021-04-29 12:24 UTC (permalink / raw)
  To: mwilck
  Cc: Keith Busch, Sagi Grimberg, Christoph Hellwig, Chao Leng,
	Hannes Reinecke, Daniel Wagner, linux-nvme, stable

Martin,

can you give this patch a spin and check if this solves your issue?

diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 0d0de3433f37..68f4d9d0ce58 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -780,6 +780,8 @@ void nvme_mpath_remove_disk(struct nvme_ns_head *head)
 
 int nvme_mpath_init(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
 {
+	size_t max_transfer_size = ctrl->max_hw_sectors << SECTOR_SHIFT;
+	size_t ana_log_size;
 	int error;
 
 	/* check if multipath is enabled and we have the capability */
@@ -787,47 +789,45 @@ int nvme_mpath_init(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
 	    !(ctrl->subsys->cmic & NVME_CTRL_CMIC_ANA))
 		return 0;
 
+	if (!ctrl->identified) {
+		mutex_init(&ctrl->ana_lock);
+		timer_setup(&ctrl->anatt_timer, nvme_anatt_timeout, 0);
+		INIT_WORK(&ctrl->ana_work, nvme_ana_work);
+	}
+
 	ctrl->anacap = id->anacap;
 	ctrl->anatt = id->anatt;
 	ctrl->nanagrpid = le32_to_cpu(id->nanagrpid);
 	ctrl->anagrpmax = le32_to_cpu(id->anagrpmax);
 
-	mutex_init(&ctrl->ana_lock);
-	timer_setup(&ctrl->anatt_timer, nvme_anatt_timeout, 0);
-	ctrl->ana_log_size = sizeof(struct nvme_ana_rsp_hdr) +
-		ctrl->nanagrpid * sizeof(struct nvme_ana_group_desc);
-	ctrl->ana_log_size += ctrl->max_namespaces * sizeof(__le32);
-
-	if (ctrl->ana_log_size > ctrl->max_hw_sectors << SECTOR_SHIFT) {
+	ana_log_size = sizeof(struct nvme_ana_rsp_hdr) +
+		ctrl->nanagrpid * sizeof(struct nvme_ana_group_desc) +
+		ctrl->max_namespaces * sizeof(__le32);
+	if (ana_log_size > max_transfer_size) {
 		dev_err(ctrl->device,
-			"ANA log page size (%zd) larger than MDTS (%d).\n",
-			ctrl->ana_log_size,
-			ctrl->max_hw_sectors << SECTOR_SHIFT);
+			"ANA log page size (%zd) larger than MDTS (%zd).\n",
+			ana_log_size, max_transfer_size);
 		dev_err(ctrl->device, "disabling ANA support.\n");
 		return 0;
 	}
 
-	INIT_WORK(&ctrl->ana_work, nvme_ana_work);
-	kfree(ctrl->ana_log_buf);
-	ctrl->ana_log_buf = kmalloc(ctrl->ana_log_size, GFP_KERNEL);
-	if (!ctrl->ana_log_buf) {
-		error = -ENOMEM;
-		goto out;
+	if (ana_log_size > ctrl->ana_log_size) {
+		nvme_mpath_uninit(ctrl);
+		ctrl->ana_log_buf = kmalloc(ctrl->ana_log_size, GFP_KERNEL);
+		if (!ctrl->ana_log_buf)
+			return -ENOMEM;
+		ctrl->ana_log_size = ana_log_size;
 	}
 
 	error = nvme_read_ana_log(ctrl);
 	if (error)
-		goto out_free_ana_log_buf;
-	return 0;
-out_free_ana_log_buf:
-	kfree(ctrl->ana_log_buf);
-	ctrl->ana_log_buf = NULL;
-out:
+		nvme_mpath_uninit(ctrl);
 	return error;
 }
 
 void nvme_mpath_uninit(struct nvme_ctrl *ctrl)
 {
+	nvme_mpath_stop(ctrl);
 	kfree(ctrl->ana_log_buf);
 	ctrl->ana_log_buf = NULL;
 }

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH v4] nvme: rdma/tcp: fix list corruption with anatt timer
  2021-04-29 12:24 ` Christoph Hellwig
@ 2021-05-04  7:52   ` Christoph Hellwig
  2021-05-04  8:08     ` Martin Wilck
  0 siblings, 1 reply; 4+ messages in thread
From: Christoph Hellwig @ 2021-05-04  7:52 UTC (permalink / raw)
  To: mwilck
  Cc: Keith Busch, Sagi Grimberg, Christoph Hellwig, Chao Leng,
	Hannes Reinecke, Daniel Wagner, linux-nvme, stable

On Thu, Apr 29, 2021 at 02:24:33PM +0200, Christoph Hellwig wrote:
> Martin,
> 
> can you give this patch a spin and check if this solves your issue?

ping?

> 
> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> index 0d0de3433f37..68f4d9d0ce58 100644
> --- a/drivers/nvme/host/multipath.c
> +++ b/drivers/nvme/host/multipath.c
> @@ -780,6 +780,8 @@ void nvme_mpath_remove_disk(struct nvme_ns_head *head)
>  
>  int nvme_mpath_init(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
>  {
> +	size_t max_transfer_size = ctrl->max_hw_sectors << SECTOR_SHIFT;
> +	size_t ana_log_size;
>  	int error;
>  
>  	/* check if multipath is enabled and we have the capability */
> @@ -787,47 +789,45 @@ int nvme_mpath_init(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
>  	    !(ctrl->subsys->cmic & NVME_CTRL_CMIC_ANA))
>  		return 0;
>  
> +	if (!ctrl->identified) {
> +		mutex_init(&ctrl->ana_lock);
> +		timer_setup(&ctrl->anatt_timer, nvme_anatt_timeout, 0);
> +		INIT_WORK(&ctrl->ana_work, nvme_ana_work);
> +	}
> +
>  	ctrl->anacap = id->anacap;
>  	ctrl->anatt = id->anatt;
>  	ctrl->nanagrpid = le32_to_cpu(id->nanagrpid);
>  	ctrl->anagrpmax = le32_to_cpu(id->anagrpmax);
>  
> -	mutex_init(&ctrl->ana_lock);
> -	timer_setup(&ctrl->anatt_timer, nvme_anatt_timeout, 0);
> -	ctrl->ana_log_size = sizeof(struct nvme_ana_rsp_hdr) +
> -		ctrl->nanagrpid * sizeof(struct nvme_ana_group_desc);
> -	ctrl->ana_log_size += ctrl->max_namespaces * sizeof(__le32);
> -
> -	if (ctrl->ana_log_size > ctrl->max_hw_sectors << SECTOR_SHIFT) {
> +	ana_log_size = sizeof(struct nvme_ana_rsp_hdr) +
> +		ctrl->nanagrpid * sizeof(struct nvme_ana_group_desc) +
> +		ctrl->max_namespaces * sizeof(__le32);
> +	if (ana_log_size > max_transfer_size) {
>  		dev_err(ctrl->device,
> -			"ANA log page size (%zd) larger than MDTS (%d).\n",
> -			ctrl->ana_log_size,
> -			ctrl->max_hw_sectors << SECTOR_SHIFT);
> +			"ANA log page size (%zd) larger than MDTS (%zd).\n",
> +			ana_log_size, max_transfer_size);
>  		dev_err(ctrl->device, "disabling ANA support.\n");
>  		return 0;
>  	}
>  
> -	INIT_WORK(&ctrl->ana_work, nvme_ana_work);
> -	kfree(ctrl->ana_log_buf);
> -	ctrl->ana_log_buf = kmalloc(ctrl->ana_log_size, GFP_KERNEL);
> -	if (!ctrl->ana_log_buf) {
> -		error = -ENOMEM;
> -		goto out;
> +	if (ana_log_size > ctrl->ana_log_size) {
> +		nvme_mpath_uninit(ctrl);
> +		ctrl->ana_log_buf = kmalloc(ctrl->ana_log_size, GFP_KERNEL);
> +		if (!ctrl->ana_log_buf)
> +			return -ENOMEM;
> +		ctrl->ana_log_size = ana_log_size;
>  	}
>  
>  	error = nvme_read_ana_log(ctrl);
>  	if (error)
> -		goto out_free_ana_log_buf;
> -	return 0;
> -out_free_ana_log_buf:
> -	kfree(ctrl->ana_log_buf);
> -	ctrl->ana_log_buf = NULL;
> -out:
> +		nvme_mpath_uninit(ctrl);
>  	return error;
>  }
>  
>  void nvme_mpath_uninit(struct nvme_ctrl *ctrl)
>  {
> +	nvme_mpath_stop(ctrl);
>  	kfree(ctrl->ana_log_buf);
>  	ctrl->ana_log_buf = NULL;
>  }
---end quoted text---

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v4] nvme: rdma/tcp: fix list corruption with anatt timer
  2021-05-04  7:52   ` Christoph Hellwig
@ 2021-05-04  8:08     ` Martin Wilck
  0 siblings, 0 replies; 4+ messages in thread
From: Martin Wilck @ 2021-05-04  8:08 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Keith Busch, Sagi Grimberg, Chao Leng, Hannes Reinecke,
	Daniel Wagner, linux-nvme, stable

Hello Christoph,

On Tue, 2021-05-04 at 09:52 +0200, Christoph Hellwig wrote:
> On Thu, Apr 29, 2021 at 02:24:33PM +0200, Christoph Hellwig wrote:
> > Martin,
> > 
> > can you give this patch a spin and check if this solves your issue?
> 
> ping?

I've provided a test kernel with your patch to the SUSE partner in
question, but it's hard to reproduce, the test will take time.

Regards
Martin



^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2021-05-04  8:08 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-27  9:31 [PATCH v4] nvme: rdma/tcp: fix list corruption with anatt timer mwilck
2021-04-29 12:24 ` Christoph Hellwig
2021-05-04  7:52   ` Christoph Hellwig
2021-05-04  8:08     ` Martin Wilck

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).