* [PATCH] nvme: rdma/tcp: call nvme_mpath_stop() from reconnect workqueue
@ 2021-04-22 15:22 mwilck
2021-04-23 8:35 ` Chao Leng
2021-04-23 14:07 ` Ewan D. Milne
0 siblings, 2 replies; 4+ messages in thread
From: mwilck @ 2021-04-22 15:22 UTC (permalink / raw)
To: Keith Busch, Sagi Grimberg, Christoph Hellwig
Cc: Hannes Reinecke, Daniel Wagner, linux-nvme, Martin Wilck
From: Martin Wilck <mwilck@suse.com>
We have observed a few crashes run_timer_softirq(), where a broken
timer_list struct belonging to an anatt_timer was encountered. The broken
structures look like this, and we see actually multiple ones attached to
the same timer base:
crash> struct timer_list 0xffff92471bcfdc90
struct timer_list {
entry = {
next = 0xdead000000000122, // LIST_POISON2
pprev = 0x0
},
expires = 4296022933,
function = 0xffffffffc06de5e0 <nvme_anatt_timeout>,
flags = 20
}
If such a timer is encountered in run_timer_softirq(), the kernel
crashes. The test scenario was an I/O load test with lots of NVMe
controllers, some of which were removed and re-added on the storage side.
I think this may happen if the rdma recovery_work starts, in this call
chain:
nvme_rdma_error_recovery_work()
/* this stops all sorts of activity for the controller, but not
the multipath-related work queue and timer */
nvme_rdma_reconnect_or_remove(ctrl)
=> kicks reconnect_work
work queue: reconnect_work
nvme_rdma_reconnect_ctrl_work()
nvme_rdma_setup_ctrl()
nvme_rdma_configure_admin_queue()
nvme_init_identify()
nvme_mpath_init()
# this sets some fields of the timer_list without taking a lock
timer_setup()
nvme_read_ana_log()
mod_timer() or del_timer_sync()
Similar for TCP. The idea for the patch is based on the observation that
nvme_rdma_reset_ctrl_work() calls nvme_stop_ctrl()->nvme_mpath_stop(),
whereas nvme_rdma_error_recovery_work() stops only the keepalive timer, but
not the anatt timer.
I admit that the root cause analysis isn't rock solid yet. In particular, I
can't explain why we see LIST_POISON2 in the "next" pointer, which would
indicate that the timer has been detached before; yet we find it linked to
the timer base when the crash occurs.
OTOH, the anatt_timer is only touched in nvme_mpath_init() (see above) and
nvme_mpath_stop(), so the hypothesis that modifying active timers may cause
the issue isn't totally out of sight. I suspect that the LIST_POISON2 may
come to pass in multiple steps.
If anyone has better ideas, please advise. The issue occurs very
sporadically; verifying this by testing will be difficult.
Signed-off-by: Martin Wilck <mwilck@suse.com>
---
drivers/nvme/host/multipath.c | 1 +
drivers/nvme/host/rdma.c | 1 +
drivers/nvme/host/tcp.c | 1 +
3 files changed, 3 insertions(+)
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index a1d476e1ac02..c63dd5dfa7ff 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -586,6 +586,7 @@ void nvme_mpath_stop(struct nvme_ctrl *ctrl)
del_timer_sync(&ctrl->anatt_timer);
cancel_work_sync(&ctrl->ana_work);
}
+EXPORT_SYMBOL_GPL(nvme_mpath_stop);
#define SUBSYS_ATTR_RW(_name, _mode, _show, _store) \
struct device_attribute subsys_attr_##_name = \
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index be905d4fdb47..062f3be0bb4f 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1189,6 +1189,7 @@ static void nvme_rdma_error_recovery_work(struct work_struct *work)
struct nvme_rdma_ctrl *ctrl = container_of(work,
struct nvme_rdma_ctrl, err_work);
+ nvme_mpath_stop(&ctrl->ctrl);
nvme_stop_keep_alive(&ctrl->ctrl);
nvme_rdma_teardown_io_queues(ctrl, false);
nvme_start_queues(&ctrl->ctrl);
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index a0f00cb8f9f3..ac9212a2de59 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -2054,6 +2054,7 @@ static void nvme_tcp_error_recovery_work(struct work_struct *work)
struct nvme_tcp_ctrl, err_work);
struct nvme_ctrl *ctrl = &tcp_ctrl->ctrl;
+ nvme_mpath_stop(ctrl);
nvme_stop_keep_alive(ctrl);
nvme_tcp_teardown_io_queues(ctrl, false);
/* unquiesce to fail fast pending requests */
--
2.31.1
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [PATCH] nvme: rdma/tcp: call nvme_mpath_stop() from reconnect workqueue
2021-04-22 15:22 [PATCH] nvme: rdma/tcp: call nvme_mpath_stop() from reconnect workqueue mwilck
@ 2021-04-23 8:35 ` Chao Leng
2021-04-23 9:33 ` Martin Wilck
2021-04-23 14:07 ` Ewan D. Milne
1 sibling, 1 reply; 4+ messages in thread
From: Chao Leng @ 2021-04-23 8:35 UTC (permalink / raw)
To: mwilck, Keith Busch, Sagi Grimberg, Christoph Hellwig
Cc: Hannes Reinecke, Daniel Wagner, linux-nvme
On 2021/4/22 23:22, mwilck@suse.com wrote:
> From: Martin Wilck <mwilck@suse.com>
>
> We have observed a few crashes run_timer_softirq(), where a broken
> timer_list struct belonging to an anatt_timer was encountered. The broken
> structures look like this, and we see actually multiple ones attached to
> the same timer base:
>
> crash> struct timer_list 0xffff92471bcfdc90
> struct timer_list {
> entry = {
> next = 0xdead000000000122, // LIST_POISON2
> pprev = 0x0
> },
> expires = 4296022933,
> function = 0xffffffffc06de5e0 <nvme_anatt_timeout>,
> flags = 20
> }
>
> If such a timer is encountered in run_timer_softirq(), the kernel
> crashes. The test scenario was an I/O load test with lots of NVMe
> controllers, some of which were removed and re-added on the storage side.
>
> I think this may happen if the rdma recovery_work starts, in this call
> chain:
>
> nvme_rdma_error_recovery_work()
> /* this stops all sorts of activity for the controller, but not
> the multipath-related work queue and timer */
> nvme_rdma_reconnect_or_remove(ctrl)
> => kicks reconnect_work
>
> work queue: reconnect_work
>
> nvme_rdma_reconnect_ctrl_work()
> nvme_rdma_setup_ctrl()
> nvme_rdma_configure_admin_queue()
> nvme_init_identify()
> nvme_mpath_init()
> # this sets some fields of the timer_list without taking a lock
> timer_setup()
> nvme_read_ana_log()
> mod_timer() or del_timer_sync()
>
> Similar for TCP. The idea for the patch is based on the observation that
> nvme_rdma_reset_ctrl_work() calls nvme_stop_ctrl()->nvme_mpath_stop(),
> whereas nvme_rdma_error_recovery_work() stops only the keepalive timer, but
> not the anatt timer.
>
> I admit that the root cause analysis isn't rock solid yet. In particular, I
> can't explain why we see LIST_POISON2 in the "next" pointer, which would
> indicate that the timer has been detached before; yet we find it linked to
> the timer base when the crash occurs.
>
> OTOH, the anatt_timer is only touched in nvme_mpath_init() (see above) and
> nvme_mpath_stop(), so the hypothesis that modifying active timers may cause
> the issue isn't totally out of sight. I suspect that the LIST_POISON2 may
> come to pass in multiple steps.
>
> If anyone has better ideas, please advise. The issue occurs very
> sporadically; verifying this by testing will be difficult.
>
> Signed-off-by: Martin Wilck <mwilck@suse.com>
> ---
> drivers/nvme/host/multipath.c | 1 +
> drivers/nvme/host/rdma.c | 1 +
> drivers/nvme/host/tcp.c | 1 +
> 3 files changed, 3 insertions(+)
>
> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> index a1d476e1ac02..c63dd5dfa7ff 100644
> --- a/drivers/nvme/host/multipath.c
> +++ b/drivers/nvme/host/multipath.c
> @@ -586,6 +586,7 @@ void nvme_mpath_stop(struct nvme_ctrl *ctrl)
> del_timer_sync(&ctrl->anatt_timer);
> cancel_work_sync(&ctrl->ana_work);
> }
> +EXPORT_SYMBOL_GPL(nvme_mpath_stop);
>
> #define SUBSYS_ATTR_RW(_name, _mode, _show, _store) \
> struct device_attribute subsys_attr_##_name = \
> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index be905d4fdb47..062f3be0bb4f 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -1189,6 +1189,7 @@ static void nvme_rdma_error_recovery_work(struct work_struct *work)
> struct nvme_rdma_ctrl *ctrl = container_of(work,
> struct nvme_rdma_ctrl, err_work);
>
> + nvme_mpath_stop(&ctrl->ctrl);
If ana_work is running, this may cause wait for long time, because
nvme_get_log may time out(default 60s). If work with multipathing,
it will cause fail over delay, service will pause long time.
It is not we expected.
We just need to do this before reconnecting, so move it until
calling nvme_rdma_reconnect_or_remove.
Like this:
+ nvme_mpath_stop(ctrl);
nvme_rdma_reconnect_or_remove(ctrl);
> nvme_stop_keep_alive(&ctrl->ctrl);
> nvme_rdma_teardown_io_queues(ctrl, false);
> nvme_start_queues(&ctrl->ctrl);
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index a0f00cb8f9f3..ac9212a2de59 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -2054,6 +2054,7 @@ static void nvme_tcp_error_recovery_work(struct work_struct *work)
> struct nvme_tcp_ctrl, err_work);
> struct nvme_ctrl *ctrl = &tcp_ctrl->ctrl;
>
> + nvme_mpath_stop(ctrl);
> nvme_stop_keep_alive(ctrl);
> nvme_tcp_teardown_io_queues(ctrl, false);
> /* unquiesce to fail fast pending requests */
>
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH] nvme: rdma/tcp: call nvme_mpath_stop() from reconnect workqueue
2021-04-23 8:35 ` Chao Leng
@ 2021-04-23 9:33 ` Martin Wilck
0 siblings, 0 replies; 4+ messages in thread
From: Martin Wilck @ 2021-04-23 9:33 UTC (permalink / raw)
To: Chao Leng, Keith Busch, Sagi Grimberg, Christoph Hellwig
Cc: Hannes Reinecke, Daniel Wagner, linux-nvme
On Fri, 2021-04-23 at 16:35 +0800, Chao Leng wrote:
>
> If ana_work is running, this may cause wait for long time, because
> nvme_get_log may time out(default 60s). If work with multipathing,
> it will cause fail over delay, service will pause long time.
> It is not we expected.
> We just need to do this before reconnecting, so move it until
> calling nvme_rdma_reconnect_or_remove.
> Like this:
> ...
Thanks, that makes sense. I'll look into it.
Martin
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH] nvme: rdma/tcp: call nvme_mpath_stop() from reconnect workqueue
2021-04-22 15:22 [PATCH] nvme: rdma/tcp: call nvme_mpath_stop() from reconnect workqueue mwilck
2021-04-23 8:35 ` Chao Leng
@ 2021-04-23 14:07 ` Ewan D. Milne
1 sibling, 0 replies; 4+ messages in thread
From: Ewan D. Milne @ 2021-04-23 14:07 UTC (permalink / raw)
To: linux-nvme
On Thu, 2021-04-22 at 17:22 +0200, mwilck@suse.com wrote:
> From: Martin Wilck <mwilck@suse.com>
>
> We have observed a few crashes run_timer_softirq(), where a broken
> timer_list struct belonging to an anatt_timer was encountered. The
> broken
> structures look like this, and we see actually multiple ones attached
> to
> the same timer base:
>
> crash> struct timer_list 0xffff92471bcfdc90
> struct timer_list {
> entry = {
> next = 0xdead000000000122, // LIST_POISON2
> pprev = 0x0
> },
> expires = 4296022933,
> function = 0xffffffffc06de5e0 <nvme_anatt_timeout>,
> flags = 20
> }
>
> If such a timer is encountered in run_timer_softirq(), the kernel
> crashes. The test scenario was an I/O load test with lots of NVMe
> controllers, some of which were removed and re-added on the storage
> side.
>
> I think this may happen if the rdma recovery_work starts, in this
> call
> chain:
>
> nvme_rdma_error_recovery_work()
> /* this stops all sorts of activity for the controller, but not
> the multipath-related work queue and timer */
> nvme_rdma_reconnect_or_remove(ctrl)
> => kicks reconnect_work
>
> work queue: reconnect_work
>
> nvme_rdma_reconnect_ctrl_work()
> nvme_rdma_setup_ctrl()
> nvme_rdma_configure_admin_queue()
> nvme_init_identify()
> nvme_mpath_init()
> # this sets some fields of the timer_list without taking a
> lock
> timer_setup()
> nvme_read_ana_log()
> mod_timer() or del_timer_sync()
>
> Similar for TCP. The idea for the patch is based on the observation
> that
> nvme_rdma_reset_ctrl_work() calls nvme_stop_ctrl()-
> >nvme_mpath_stop(),
> whereas nvme_rdma_error_recovery_work() stops only the keepalive
> timer, but
> not the anatt timer.
>
> I admit that the root cause analysis isn't rock solid yet. In
> particular, I
> can't explain why we see LIST_POISON2 in the "next" pointer, which
> would
> indicate that the timer has been detached before; yet we find it
> linked to
> the timer base when the crash occurs.
>
> OTOH, the anatt_timer is only touched in nvme_mpath_init() (see
> above) and
> nvme_mpath_stop(), so the hypothesis that modifying active timers may
> cause
> the issue isn't totally out of sight. I suspect that the LIST_POISON2
> may
> come to pass in multiple steps.
>
> If anyone has better ideas, please advise. The issue occurs very
> sporadically; verifying this by testing will be difficult.
>
We have a couple reports of this same crash, one when traversing the
timer list and one when removing the timer. Note LIST_POISON2 in RAX.
I was also thinking that stopping would resolve it.
It's somewhat hard to reproduce so it may take a while to verify
but we'll test this patch out.
-Ewan
crash> bt
PID: 0 TASK: ffffffffbae12780 CPU: 0 COMMAND: "swapper/0"
#0 [ffff922477a03cd8] machine_kexec at ffffffffb9c5bf3e
#1 [ffff922477a03d30] __crash_kexec at ffffffffb9d6072d
#2 [ffff922477a03df8] crash_kexec at ffffffffb9d6160d
#3 [ffff922477a03e10] oops_end at ffffffffb9c22d4d
#4 [ffff922477a03e30] general_protection at ffffffffba6011fe
[exception RIP: run_timer_softirq+294]
RIP: ffffffffb9d3ea36 RSP: ffff922477a03ee0 RFLAGS: 00010086
RAX: dead000000000200 RBX: ffff922477a1aa80 RCX: 0000000000000100
RDX: ffff922477a03ef0 RSI: 0000000000000002 RDI: ffff922477a1aa80
RBP: 0000000000000001 R8: ffffffffbae12780 R9: 0000000000000000
R10: 0000000000000310 R11: 000000000b5ad84c R12: 000000011dd60000
R13: ffff922477a03ef0 R14: ffff922472de0f80 R15: ffffffffbae05100
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#5 [ffff922477a03f68] __softirqentry_text_start at ffffffffba8000e4
#6 [ffff922477a03fc8] irq_exit at ffffffffb9cbc1d7
#7 [ffff922477a03fd8] smp_apic_timer_interrupt at ffffffffba6027e4
#8 [ffff922477a03ff0] apic_timer_interrupt at ffffffffba601d6f
--- <IRQ stack> ---
#9 [ffffffffbae03dd8] apic_timer_interrupt at ffffffffba601d6f
[exception RIP: native_safe_halt+14]
RIP: ffffffffba4d667e RSP: ffffffffbae03e80 RFLAGS: 00000246
RAX: ffffffffba4d62d0 RBX: 0000000000000000 RCX: 00000000000f4240
RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000082
RBP: 0000000000000000 R8: 000266f562c6c886 R9: 0000000000000001
R10: 000000000000031f R11: ffff9224722cb400 R12: ffffffffffffffff
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
ORIG_RAX: ffffffffffffff13 CS: 0010 SS: 0018
#10 [ffffffffbae03e80] default_idle at ffffffffba4d62ec
#11 [ffffffffbae03ea8] do_idle at ffffffffb9cebbb7
#12 [ffffffffbae03ef0] cpu_startup_entry at ffffffffb9cebe0f
#13 [ffffffffbae03f10] start_kernel at ffffffffbb6b21e5
#14 [ffffffffbae03f50] secondary_startup_64 at ffffffffb9c000e7
(different crash)
PID: 69334 TASK: ffff8d02a652dd00 CPU: 0 COMMAND: "kworker/u2:2"
#0 [ffffa658418cfb28] machine_kexec at ffffffffb565bf3e
#1 [ffffa658418cfb80] __crash_kexec at ffffffffb576072d
#2 [ffffa658418cfc48] crash_kexec at ffffffffb576160d
#3 [ffffa658418cfc60] oops_end at ffffffffb5622d4d
#4 [ffffa658418cfc80] general_protection at ffffffffb60011fe
[exception RIP: detach_if_pending+58]
RIP: ffffffffb573cdfa RSP: ffffa658418cfd38 RFLAGS: 00010086
RAX: dead000000000200 RBX: ffff8d02aac74f80 RCX: 0000000000000000
RDX: ffffa65841d5bb68 RSI: ffff8d02afa1aa80 RDI: ffff8d02aac74f80
RBP: ffff8d02aac74f80 R8: ffff8d02a652dd00 R9: 0000000000000000
R10: 0000000002f41000 R11: 0000000000000000 R12: 0000000000000001
R13: 0000000000000000 R14: ffff8d02a7621e80 R15: ffff8d02aac750a0
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#5 [ffffa658418cfd50] try_to_del_timer_sync at ffffffffb573cf4f
#6 [ffffa658418cfd78] del_timer_sync at ffffffffb573cfb5
#7 [ffffa658418cfd88] nvme_read_ana_log at ffffffffc01df657 [nvme_core]
#8 [ffffa658418cfdb8] nvme_mpath_init at ffffffffc01e0642 [nvme_core]
#9 [ffffa658418cfdd0] nvme_init_identify at ffffffffc01ddba8 [nvme_core]
#10 [ffffa658418cfe38] nvme_tcp_setup_ctrl at ffffffffc040e2e9 [nvme_tcp]
#11 [ffffa658418cfe80] nvme_tcp_reconnect_ctrl_work at ffffffffc040e4bf [nvme_tcp]
#12 [ffffa658418cfe98] process_one_work at ffffffffb56d3477
#13 [ffffa658418cfed8] worker_thread at ffffffffb56d3b40
#14 [ffffa658418cff10] kthread at ffffffffb56d9502
#15 [ffffa658418cff50] ret_from_fork at ffffffffb6000255
_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2021-04-23 14:07 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-22 15:22 [PATCH] nvme: rdma/tcp: call nvme_mpath_stop() from reconnect workqueue mwilck
2021-04-23 8:35 ` Chao Leng
2021-04-23 9:33 ` Martin Wilck
2021-04-23 14:07 ` Ewan D. Milne
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.