Re: [PATCH] nvme: rdma/tcp: call nvme_mpath_stop() from reconnect workqueue

From: Chao Leng <lengchao@huawei.com>
To: <mwilck@suse.com>, Keith Busch <kbusch@kernel.org>,
	Sagi Grimberg <sagi@grimberg.me>, Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>, Daniel Wagner <dwagner@suse.de>,
	<linux-nvme@lists.infradead.org>
Subject: Re: [PATCH] nvme: rdma/tcp: call nvme_mpath_stop() from reconnect workqueue
Date: Fri, 23 Apr 2021 16:35:18 +0800	[thread overview]
Message-ID: <1c178648-8740-401b-86cb-ce65ffc7d7dc@huawei.com> (raw)
In-Reply-To: <20210422152219.7067-1-mwilck@suse.com>


On 2021/4/22 23:22, mwilck@suse.com wrote:
> From: Martin Wilck <mwilck@suse.com>
> 
> We have observed a few crashes run_timer_softirq(), where a broken
> timer_list struct belonging to an anatt_timer was encountered. The broken
> structures look like this, and we see actually multiple ones attached to
> the same timer base:
> 
> crash> struct timer_list 0xffff92471bcfdc90
> struct timer_list {
>    entry = {
>      next = 0xdead000000000122,  // LIST_POISON2
>      pprev = 0x0
>    },
>    expires = 4296022933,
>    function = 0xffffffffc06de5e0 <nvme_anatt_timeout>,
>    flags = 20
> }
> 
> If such a timer is encountered in run_timer_softirq(), the kernel
> crashes. The test scenario was an I/O load test with lots of NVMe
> controllers, some of which were removed and re-added on the storage side.
> 
> I think this may happen if the rdma recovery_work starts, in this call
> chain:
> 
> nvme_rdma_error_recovery_work()
>    /* this stops all sorts of activity for the controller, but not
>       the multipath-related work queue and timer */
>    nvme_rdma_reconnect_or_remove(ctrl)
>      => kicks reconnect_work
> 
> work queue: reconnect_work
> 
> nvme_rdma_reconnect_ctrl_work()
>    nvme_rdma_setup_ctrl()
>      nvme_rdma_configure_admin_queue()
>         nvme_init_identify()
>            nvme_mpath_init()
> 	     # this sets some fields of the timer_list without taking a lock
>               timer_setup()
>               nvme_read_ana_log()
> 	         mod_timer() or del_timer_sync()
> 
> Similar for TCP. The idea for the patch is based on the observation that
> nvme_rdma_reset_ctrl_work() calls nvme_stop_ctrl()->nvme_mpath_stop(),
> whereas nvme_rdma_error_recovery_work() stops only the keepalive timer, but
> not the anatt timer.
> 
> I admit that the root cause analysis isn't rock solid yet. In particular, I
> can't explain why we see LIST_POISON2 in the "next" pointer, which would
> indicate that the timer has been detached before; yet we find it linked to
> the timer base when the crash occurs.
> 
> OTOH, the anatt_timer is only touched in nvme_mpath_init() (see above) and
> nvme_mpath_stop(), so the hypothesis that modifying active timers may cause
> the issue isn't totally out of sight. I suspect that the LIST_POISON2 may
> come to pass in multiple steps.
> 
> If anyone has better ideas, please advise. The issue occurs very
> sporadically; verifying this by testing will be difficult.
> 
> Signed-off-by: Martin Wilck <mwilck@suse.com>
> ---
>   drivers/nvme/host/multipath.c | 1 +
>   drivers/nvme/host/rdma.c      | 1 +
>   drivers/nvme/host/tcp.c       | 1 +
>   3 files changed, 3 insertions(+)
> 
> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> index a1d476e1ac02..c63dd5dfa7ff 100644
> --- a/drivers/nvme/host/multipath.c
> +++ b/drivers/nvme/host/multipath.c
> @@ -586,6 +586,7 @@ void nvme_mpath_stop(struct nvme_ctrl *ctrl)
>   	del_timer_sync(&ctrl->anatt_timer);
>   	cancel_work_sync(&ctrl->ana_work);
>   }
> +EXPORT_SYMBOL_GPL(nvme_mpath_stop);
>   
>   #define SUBSYS_ATTR_RW(_name, _mode, _show, _store)  \
>   	struct device_attribute subsys_attr_##_name =	\
> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index be905d4fdb47..062f3be0bb4f 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -1189,6 +1189,7 @@ static void nvme_rdma_error_recovery_work(struct work_struct *work)
>   	struct nvme_rdma_ctrl *ctrl = container_of(work,
>   			struct nvme_rdma_ctrl, err_work);
>   
> +	nvme_mpath_stop(&ctrl->ctrl);
If ana_work is running, this may cause wait for long time, because
nvme_get_log may time out(default 60s). If work with multipathing,
it will cause fail over delay, service will pause long time.
It is not we expected.
We just need to do this before reconnecting, so move it until
calling nvme_rdma_reconnect_or_remove.
Like this:
+	nvme_mpath_stop(ctrl);
	nvme_rdma_reconnect_or_remove(ctrl);

>   	nvme_stop_keep_alive(&ctrl->ctrl);
>   	nvme_rdma_teardown_io_queues(ctrl, false);
>   	nvme_start_queues(&ctrl->ctrl);
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index a0f00cb8f9f3..ac9212a2de59 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -2054,6 +2054,7 @@ static void nvme_tcp_error_recovery_work(struct work_struct *work)
>   				struct nvme_tcp_ctrl, err_work);
>   	struct nvme_ctrl *ctrl = &tcp_ctrl->ctrl;
>   
> +	nvme_mpath_stop(ctrl);
>   	nvme_stop_keep_alive(ctrl);
>   	nvme_tcp_teardown_io_queues(ctrl, false);
>   	/* unquiesce to fail fast pending requests */
> 

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme