All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] nvme: rdma/tcp: call nvme_mpath_stop() from reconnect workqueue
@ 2021-04-22 15:22 mwilck
  2021-04-23  8:35 ` Chao Leng
  2021-04-23 14:07 ` Ewan D. Milne
  0 siblings, 2 replies; 4+ messages in thread
From: mwilck @ 2021-04-22 15:22 UTC (permalink / raw)
  To: Keith Busch, Sagi Grimberg, Christoph Hellwig
  Cc: Hannes Reinecke, Daniel Wagner, linux-nvme, Martin Wilck

From: Martin Wilck <mwilck@suse.com>

We have observed a few crashes run_timer_softirq(), where a broken
timer_list struct belonging to an anatt_timer was encountered. The broken
structures look like this, and we see actually multiple ones attached to
the same timer base:

crash> struct timer_list 0xffff92471bcfdc90
struct timer_list {
  entry = {
    next = 0xdead000000000122,  // LIST_POISON2
    pprev = 0x0
  },
  expires = 4296022933,
  function = 0xffffffffc06de5e0 <nvme_anatt_timeout>,
  flags = 20
}

If such a timer is encountered in run_timer_softirq(), the kernel
crashes. The test scenario was an I/O load test with lots of NVMe
controllers, some of which were removed and re-added on the storage side.

I think this may happen if the rdma recovery_work starts, in this call
chain:

nvme_rdma_error_recovery_work()
  /* this stops all sorts of activity for the controller, but not
     the multipath-related work queue and timer */
  nvme_rdma_reconnect_or_remove(ctrl)
    => kicks reconnect_work

work queue: reconnect_work

nvme_rdma_reconnect_ctrl_work()
  nvme_rdma_setup_ctrl()
    nvme_rdma_configure_admin_queue()
       nvme_init_identify()
          nvme_mpath_init()
	     # this sets some fields of the timer_list without taking a lock
             timer_setup()
             nvme_read_ana_log()
	         mod_timer() or del_timer_sync()

Similar for TCP. The idea for the patch is based on the observation that
nvme_rdma_reset_ctrl_work() calls nvme_stop_ctrl()->nvme_mpath_stop(),
whereas nvme_rdma_error_recovery_work() stops only the keepalive timer, but
not the anatt timer.

I admit that the root cause analysis isn't rock solid yet. In particular, I
can't explain why we see LIST_POISON2 in the "next" pointer, which would
indicate that the timer has been detached before; yet we find it linked to
the timer base when the crash occurs.

OTOH, the anatt_timer is only touched in nvme_mpath_init() (see above) and
nvme_mpath_stop(), so the hypothesis that modifying active timers may cause
the issue isn't totally out of sight. I suspect that the LIST_POISON2 may
come to pass in multiple steps.

If anyone has better ideas, please advise. The issue occurs very
sporadically; verifying this by testing will be difficult.

Signed-off-by: Martin Wilck <mwilck@suse.com>
---
 drivers/nvme/host/multipath.c | 1 +
 drivers/nvme/host/rdma.c      | 1 +
 drivers/nvme/host/tcp.c       | 1 +
 3 files changed, 3 insertions(+)

diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index a1d476e1ac02..c63dd5dfa7ff 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -586,6 +586,7 @@ void nvme_mpath_stop(struct nvme_ctrl *ctrl)
 	del_timer_sync(&ctrl->anatt_timer);
 	cancel_work_sync(&ctrl->ana_work);
 }
+EXPORT_SYMBOL_GPL(nvme_mpath_stop);
 
 #define SUBSYS_ATTR_RW(_name, _mode, _show, _store)  \
 	struct device_attribute subsys_attr_##_name =	\
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index be905d4fdb47..062f3be0bb4f 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1189,6 +1189,7 @@ static void nvme_rdma_error_recovery_work(struct work_struct *work)
 	struct nvme_rdma_ctrl *ctrl = container_of(work,
 			struct nvme_rdma_ctrl, err_work);
 
+	nvme_mpath_stop(&ctrl->ctrl);
 	nvme_stop_keep_alive(&ctrl->ctrl);
 	nvme_rdma_teardown_io_queues(ctrl, false);
 	nvme_start_queues(&ctrl->ctrl);
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index a0f00cb8f9f3..ac9212a2de59 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -2054,6 +2054,7 @@ static void nvme_tcp_error_recovery_work(struct work_struct *work)
 				struct nvme_tcp_ctrl, err_work);
 	struct nvme_ctrl *ctrl = &tcp_ctrl->ctrl;
 
+	nvme_mpath_stop(ctrl);
 	nvme_stop_keep_alive(ctrl);
 	nvme_tcp_teardown_io_queues(ctrl, false);
 	/* unquiesce to fail fast pending requests */
-- 
2.31.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] nvme: rdma/tcp: call nvme_mpath_stop() from reconnect workqueue
  2021-04-22 15:22 [PATCH] nvme: rdma/tcp: call nvme_mpath_stop() from reconnect workqueue mwilck
@ 2021-04-23  8:35 ` Chao Leng
  2021-04-23  9:33   ` Martin Wilck
  2021-04-23 14:07 ` Ewan D. Milne
  1 sibling, 1 reply; 4+ messages in thread
From: Chao Leng @ 2021-04-23  8:35 UTC (permalink / raw)
  To: mwilck, Keith Busch, Sagi Grimberg, Christoph Hellwig
  Cc: Hannes Reinecke, Daniel Wagner, linux-nvme



On 2021/4/22 23:22, mwilck@suse.com wrote:
> From: Martin Wilck <mwilck@suse.com>
> 
> We have observed a few crashes run_timer_softirq(), where a broken
> timer_list struct belonging to an anatt_timer was encountered. The broken
> structures look like this, and we see actually multiple ones attached to
> the same timer base:
> 
> crash> struct timer_list 0xffff92471bcfdc90
> struct timer_list {
>    entry = {
>      next = 0xdead000000000122,  // LIST_POISON2
>      pprev = 0x0
>    },
>    expires = 4296022933,
>    function = 0xffffffffc06de5e0 <nvme_anatt_timeout>,
>    flags = 20
> }
> 
> If such a timer is encountered in run_timer_softirq(), the kernel
> crashes. The test scenario was an I/O load test with lots of NVMe
> controllers, some of which were removed and re-added on the storage side.
> 
> I think this may happen if the rdma recovery_work starts, in this call
> chain:
> 
> nvme_rdma_error_recovery_work()
>    /* this stops all sorts of activity for the controller, but not
>       the multipath-related work queue and timer */
>    nvme_rdma_reconnect_or_remove(ctrl)
>      => kicks reconnect_work
> 
> work queue: reconnect_work
> 
> nvme_rdma_reconnect_ctrl_work()
>    nvme_rdma_setup_ctrl()
>      nvme_rdma_configure_admin_queue()
>         nvme_init_identify()
>            nvme_mpath_init()
> 	     # this sets some fields of the timer_list without taking a lock
>               timer_setup()
>               nvme_read_ana_log()
> 	         mod_timer() or del_timer_sync()
> 
> Similar for TCP. The idea for the patch is based on the observation that
> nvme_rdma_reset_ctrl_work() calls nvme_stop_ctrl()->nvme_mpath_stop(),
> whereas nvme_rdma_error_recovery_work() stops only the keepalive timer, but
> not the anatt timer.
> 
> I admit that the root cause analysis isn't rock solid yet. In particular, I
> can't explain why we see LIST_POISON2 in the "next" pointer, which would
> indicate that the timer has been detached before; yet we find it linked to
> the timer base when the crash occurs.
> 
> OTOH, the anatt_timer is only touched in nvme_mpath_init() (see above) and
> nvme_mpath_stop(), so the hypothesis that modifying active timers may cause
> the issue isn't totally out of sight. I suspect that the LIST_POISON2 may
> come to pass in multiple steps.
> 
> If anyone has better ideas, please advise. The issue occurs very
> sporadically; verifying this by testing will be difficult.
> 
> Signed-off-by: Martin Wilck <mwilck@suse.com>
> ---
>   drivers/nvme/host/multipath.c | 1 +
>   drivers/nvme/host/rdma.c      | 1 +
>   drivers/nvme/host/tcp.c       | 1 +
>   3 files changed, 3 insertions(+)
> 
> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> index a1d476e1ac02..c63dd5dfa7ff 100644
> --- a/drivers/nvme/host/multipath.c
> +++ b/drivers/nvme/host/multipath.c
> @@ -586,6 +586,7 @@ void nvme_mpath_stop(struct nvme_ctrl *ctrl)
>   	del_timer_sync(&ctrl->anatt_timer);
>   	cancel_work_sync(&ctrl->ana_work);
>   }
> +EXPORT_SYMBOL_GPL(nvme_mpath_stop);
>   
>   #define SUBSYS_ATTR_RW(_name, _mode, _show, _store)  \
>   	struct device_attribute subsys_attr_##_name =	\
> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index be905d4fdb47..062f3be0bb4f 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -1189,6 +1189,7 @@ static void nvme_rdma_error_recovery_work(struct work_struct *work)
>   	struct nvme_rdma_ctrl *ctrl = container_of(work,
>   			struct nvme_rdma_ctrl, err_work);
>   
> +	nvme_mpath_stop(&ctrl->ctrl);
If ana_work is running, this may cause wait for long time, because
nvme_get_log may time out(default 60s). If work with multipathing,
it will cause fail over delay, service will pause long time.
It is not we expected.
We just need to do this before reconnecting, so move it until
calling nvme_rdma_reconnect_or_remove.
Like this:
+	nvme_mpath_stop(ctrl);
	nvme_rdma_reconnect_or_remove(ctrl);

>   	nvme_stop_keep_alive(&ctrl->ctrl);
>   	nvme_rdma_teardown_io_queues(ctrl, false);
>   	nvme_start_queues(&ctrl->ctrl);
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index a0f00cb8f9f3..ac9212a2de59 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -2054,6 +2054,7 @@ static void nvme_tcp_error_recovery_work(struct work_struct *work)
>   				struct nvme_tcp_ctrl, err_work);
>   	struct nvme_ctrl *ctrl = &tcp_ctrl->ctrl;
>   
> +	nvme_mpath_stop(ctrl);
>   	nvme_stop_keep_alive(ctrl);
>   	nvme_tcp_teardown_io_queues(ctrl, false);
>   	/* unquiesce to fail fast pending requests */
> 

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] nvme: rdma/tcp: call nvme_mpath_stop() from reconnect workqueue
  2021-04-23  8:35 ` Chao Leng
@ 2021-04-23  9:33   ` Martin Wilck
  0 siblings, 0 replies; 4+ messages in thread
From: Martin Wilck @ 2021-04-23  9:33 UTC (permalink / raw)
  To: Chao Leng, Keith Busch, Sagi Grimberg, Christoph Hellwig
  Cc: Hannes Reinecke, Daniel Wagner, linux-nvme

On Fri, 2021-04-23 at 16:35 +0800, Chao Leng wrote:
> 
> If ana_work is running, this may cause wait for long time, because
> nvme_get_log may time out(default 60s). If work with multipathing,
> it will cause fail over delay, service will pause long time.
> It is not we expected.
> We just need to do this before reconnecting, so move it until
> calling nvme_rdma_reconnect_or_remove.
> Like this:
> ...

Thanks, that makes sense. I'll look into it.

Martin



_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] nvme: rdma/tcp: call nvme_mpath_stop() from reconnect workqueue
  2021-04-22 15:22 [PATCH] nvme: rdma/tcp: call nvme_mpath_stop() from reconnect workqueue mwilck
  2021-04-23  8:35 ` Chao Leng
@ 2021-04-23 14:07 ` Ewan D. Milne
  1 sibling, 0 replies; 4+ messages in thread
From: Ewan D. Milne @ 2021-04-23 14:07 UTC (permalink / raw)
  To: linux-nvme

On Thu, 2021-04-22 at 17:22 +0200, mwilck@suse.com wrote:
> From: Martin Wilck <mwilck@suse.com>
> 
> We have observed a few crashes run_timer_softirq(), where a broken
> timer_list struct belonging to an anatt_timer was encountered. The
> broken
> structures look like this, and we see actually multiple ones attached
> to
> the same timer base:
> 
> crash> struct timer_list 0xffff92471bcfdc90
> struct timer_list {
>   entry = {
>     next = 0xdead000000000122,  // LIST_POISON2
>     pprev = 0x0
>   },
>   expires = 4296022933,
>   function = 0xffffffffc06de5e0 <nvme_anatt_timeout>,
>   flags = 20
> }
> 
> If such a timer is encountered in run_timer_softirq(), the kernel
> crashes. The test scenario was an I/O load test with lots of NVMe
> controllers, some of which were removed and re-added on the storage
> side.
> 
> I think this may happen if the rdma recovery_work starts, in this
> call
> chain:
> 
> nvme_rdma_error_recovery_work()
>   /* this stops all sorts of activity for the controller, but not
>      the multipath-related work queue and timer */
>   nvme_rdma_reconnect_or_remove(ctrl)
>     => kicks reconnect_work
> 
> work queue: reconnect_work
> 
> nvme_rdma_reconnect_ctrl_work()
>   nvme_rdma_setup_ctrl()
>     nvme_rdma_configure_admin_queue()
>        nvme_init_identify()
>           nvme_mpath_init()
> 	     # this sets some fields of the timer_list without taking a
> lock
>              timer_setup()
>              nvme_read_ana_log()
> 	         mod_timer() or del_timer_sync()
> 
> Similar for TCP. The idea for the patch is based on the observation
> that
> nvme_rdma_reset_ctrl_work() calls nvme_stop_ctrl()-
> >nvme_mpath_stop(),
> whereas nvme_rdma_error_recovery_work() stops only the keepalive
> timer, but
> not the anatt timer.
> 
> I admit that the root cause analysis isn't rock solid yet. In
> particular, I
> can't explain why we see LIST_POISON2 in the "next" pointer, which
> would
> indicate that the timer has been detached before; yet we find it
> linked to
> the timer base when the crash occurs.
> 
> OTOH, the anatt_timer is only touched in nvme_mpath_init() (see
> above) and
> nvme_mpath_stop(), so the hypothesis that modifying active timers may
> cause
> the issue isn't totally out of sight. I suspect that the LIST_POISON2
> may
> come to pass in multiple steps.
> 
> If anyone has better ideas, please advise. The issue occurs very
> sporadically; verifying this by testing will be difficult.
> 

We have a couple reports of this same crash, one when traversing the
timer list and one when removing the timer.  Note LIST_POISON2 in RAX.
I was also thinking that stopping would resolve it.

It's somewhat hard to reproduce so it may take a while to verify
but we'll test this patch out.

-Ewan

crash> bt
PID: 0      TASK: ffffffffbae12780  CPU: 0   COMMAND: "swapper/0"
 #0 [ffff922477a03cd8] machine_kexec at ffffffffb9c5bf3e
 #1 [ffff922477a03d30] __crash_kexec at ffffffffb9d6072d
 #2 [ffff922477a03df8] crash_kexec at ffffffffb9d6160d
 #3 [ffff922477a03e10] oops_end at ffffffffb9c22d4d
 #4 [ffff922477a03e30] general_protection at ffffffffba6011fe
    [exception RIP: run_timer_softirq+294]
    RIP: ffffffffb9d3ea36  RSP: ffff922477a03ee0  RFLAGS: 00010086
    RAX: dead000000000200  RBX: ffff922477a1aa80  RCX: 0000000000000100
    RDX: ffff922477a03ef0  RSI: 0000000000000002  RDI: ffff922477a1aa80
    RBP: 0000000000000001   R8: ffffffffbae12780   R9: 0000000000000000
    R10: 0000000000000310  R11: 000000000b5ad84c  R12: 000000011dd60000
    R13: ffff922477a03ef0  R14: ffff922472de0f80  R15: ffffffffbae05100
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #5 [ffff922477a03f68] __softirqentry_text_start at ffffffffba8000e4
 #6 [ffff922477a03fc8] irq_exit at ffffffffb9cbc1d7
 #7 [ffff922477a03fd8] smp_apic_timer_interrupt at ffffffffba6027e4
 #8 [ffff922477a03ff0] apic_timer_interrupt at ffffffffba601d6f
--- <IRQ stack> ---
 #9 [ffffffffbae03dd8] apic_timer_interrupt at ffffffffba601d6f
    [exception RIP: native_safe_halt+14]
    RIP: ffffffffba4d667e  RSP: ffffffffbae03e80  RFLAGS: 00000246
    RAX: ffffffffba4d62d0  RBX: 0000000000000000  RCX: 00000000000f4240
    RDX: 0000000000000001  RSI: 0000000000000000  RDI: 0000000000000082
    RBP: 0000000000000000   R8: 000266f562c6c886   R9: 0000000000000001
    R10: 000000000000031f  R11: ffff9224722cb400  R12: ffffffffffffffff
    R13: 0000000000000000  R14: 0000000000000000  R15: 0000000000000000
    ORIG_RAX: ffffffffffffff13  CS: 0010  SS: 0018
#10 [ffffffffbae03e80] default_idle at ffffffffba4d62ec
#11 [ffffffffbae03ea8] do_idle at ffffffffb9cebbb7
#12 [ffffffffbae03ef0] cpu_startup_entry at ffffffffb9cebe0f
#13 [ffffffffbae03f10] start_kernel at ffffffffbb6b21e5
#14 [ffffffffbae03f50] secondary_startup_64 at ffffffffb9c000e7

(different crash)


PID: 69334  TASK: ffff8d02a652dd00  CPU: 0   COMMAND: "kworker/u2:2"
 #0 [ffffa658418cfb28] machine_kexec at ffffffffb565bf3e
 #1 [ffffa658418cfb80] __crash_kexec at ffffffffb576072d
 #2 [ffffa658418cfc48] crash_kexec at ffffffffb576160d
 #3 [ffffa658418cfc60] oops_end at ffffffffb5622d4d
 #4 [ffffa658418cfc80] general_protection at ffffffffb60011fe
    [exception RIP: detach_if_pending+58]
    RIP: ffffffffb573cdfa  RSP: ffffa658418cfd38  RFLAGS: 00010086
    RAX: dead000000000200  RBX: ffff8d02aac74f80  RCX: 0000000000000000
    RDX: ffffa65841d5bb68  RSI: ffff8d02afa1aa80  RDI: ffff8d02aac74f80
    RBP: ffff8d02aac74f80   R8: ffff8d02a652dd00   R9: 0000000000000000
    R10: 0000000002f41000  R11: 0000000000000000  R12: 0000000000000001
    R13: 0000000000000000  R14: ffff8d02a7621e80  R15: ffff8d02aac750a0
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #5 [ffffa658418cfd50] try_to_del_timer_sync at ffffffffb573cf4f
 #6 [ffffa658418cfd78] del_timer_sync at ffffffffb573cfb5
 #7 [ffffa658418cfd88] nvme_read_ana_log at ffffffffc01df657 [nvme_core]
 #8 [ffffa658418cfdb8] nvme_mpath_init at ffffffffc01e0642 [nvme_core]
 #9 [ffffa658418cfdd0] nvme_init_identify at ffffffffc01ddba8 [nvme_core]
#10 [ffffa658418cfe38] nvme_tcp_setup_ctrl at ffffffffc040e2e9 [nvme_tcp]
#11 [ffffa658418cfe80] nvme_tcp_reconnect_ctrl_work at ffffffffc040e4bf [nvme_tcp]
#12 [ffffa658418cfe98] process_one_work at ffffffffb56d3477
#13 [ffffa658418cfed8] worker_thread at ffffffffb56d3b40
#14 [ffffa658418cff10] kthread at ffffffffb56d9502
#15 [ffffa658418cff50] ret_from_fork at ffffffffb6000255


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2021-04-23 14:07 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-22 15:22 [PATCH] nvme: rdma/tcp: call nvme_mpath_stop() from reconnect workqueue mwilck
2021-04-23  8:35 ` Chao Leng
2021-04-23  9:33   ` Martin Wilck
2021-04-23 14:07 ` Ewan D. Milne

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.