All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2] nvme: rdma/tcp: call nvme_mpath_stop() from reconnect workqueue
@ 2021-04-23 13:38 mwilck
       [not found] ` <CAFL455k3aBLcZrZPq=Q-9aws4UesstA5gSOr_E7mEFrLT+KbKw@mail.gmail.com>
                   ` (3 more replies)
  0 siblings, 4 replies; 16+ messages in thread
From: mwilck @ 2021-04-23 13:38 UTC (permalink / raw)
  To: Keith Busch, Sagi Grimberg, Christoph Hellwig, Chao Leng
  Cc: Hannes Reinecke, Daniel Wagner, linux-nvme, Martin Wilck

From: Martin Wilck <mwilck@suse.com>

We have observed a few crashes run_timer_softirq(), where a broken
timer_list struct belonging to an anatt_timer was encountered. The broken
structures look like this, and we see actually multiple ones attached to
the same timer base:

crash> struct timer_list 0xffff92471bcfdc90
struct timer_list {
  entry = {
    next = 0xdead000000000122,  // LIST_POISON2
    pprev = 0x0
  },
  expires = 4296022933,
  function = 0xffffffffc06de5e0 <nvme_anatt_timeout>,
  flags = 20
}

If such a timer is encountered in run_timer_softirq(), the kernel
crashes. The test scenario was an I/O load test with lots of NVMe
controllers, some of which were removed and re-added on the storage side.

I think this may happen if the rdma recovery_work starts, in this call
chain:

nvme_rdma_error_recovery_work()
  /* this stops all sorts of activity for the controller, but not
     the multipath-related work queue and timer */
  nvme_rdma_reconnect_or_remove(ctrl)
    => kicks reconnect_work

work queue: reconnect_work

nvme_rdma_reconnect_ctrl_work()
  nvme_rdma_setup_ctrl()
    nvme_rdma_configure_admin_queue()
       nvme_init_identify()
          nvme_mpath_init()
	     # this sets some fields of the timer_list without taking a lock
             timer_setup()
             nvme_read_ana_log()
	         mod_timer() or del_timer_sync()

Similar for TCP. The idea for the patch is based on the observation that
nvme_rdma_reset_ctrl_work() calls nvme_stop_ctrl()->nvme_mpath_stop(),
whereas nvme_rdma_error_recovery_work() stops only the keepalive timer, but
not the anatt timer.

I admit that the root cause analysis isn't rock solid yet. In particular, I
can't explain why we see LIST_POISON2 in the "next" pointer, which would
indicate that the timer has been detached before; yet we find it linked to
the timer base when the crash occurs.

OTOH, the anatt_timer is only touched in nvme_mpath_init() (see above) and
nvme_mpath_stop(), so the hypothesis that modifying active timers may cause
the issue isn't totally out of sight. I suspect that the LIST_POISON2 may
come to pass in multiple steps.

If anyone has better ideas, please advise. The issue occurs very
sporadically; verifying this by testing will be difficult.

Signed-off-by: Martin Wilck <mwilck@suse.com>

----
Changes in v2: Moved call to nvme_mpath_stop() further down, directly before
the call of nvme_rdma_reconnect_or_remove() (Chao Leng)
---
 drivers/nvme/host/multipath.c | 1 +
 drivers/nvme/host/rdma.c      | 1 +
 drivers/nvme/host/tcp.c       | 1 +
 3 files changed, 3 insertions(+)

diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index a1d476e1ac02..c63dd5dfa7ff 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -586,6 +586,7 @@ void nvme_mpath_stop(struct nvme_ctrl *ctrl)
 	del_timer_sync(&ctrl->anatt_timer);
 	cancel_work_sync(&ctrl->ana_work);
 }
+EXPORT_SYMBOL_GPL(nvme_mpath_stop);
 
 #define SUBSYS_ATTR_RW(_name, _mode, _show, _store)  \
 	struct device_attribute subsys_attr_##_name =	\
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index be905d4fdb47..fc07a7b0dc1d 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1202,6 +1202,7 @@ static void nvme_rdma_error_recovery_work(struct work_struct *work)
 		return;
 	}
 
+	nvme_mpath_stop(&ctrl->ctrl);
 	nvme_rdma_reconnect_or_remove(ctrl);
 }
 
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index a0f00cb8f9f3..46287b4f4d10 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -2068,6 +2068,7 @@ static void nvme_tcp_error_recovery_work(struct work_struct *work)
 		return;
 	}
 
+	nvme_mpath_stop(ctrl);
 	nvme_tcp_reconnect_or_remove(ctrl);
 }
 
-- 
2.31.1


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] nvme: rdma/tcp: call nvme_mpath_stop() from reconnect workqueue
       [not found] ` <CAFL455k3aBLcZrZPq=Q-9aws4UesstA5gSOr_E7mEFrLT+KbKw@mail.gmail.com>
@ 2021-04-23 16:43   ` Martin Wilck
  2021-04-23 17:09     ` Martin Wilck
  0 siblings, 1 reply; 16+ messages in thread
From: Martin Wilck @ 2021-04-23 16:43 UTC (permalink / raw)
  To: Maurizio Lombardi
  Cc: Keith Busch, Sagi Grimberg, Christoph Hellwig, Chao Leng,
	Hannes Reinecke, Daniel Wagner, linux-nvme

On Fri, 2021-04-23 at 17:50 +0200, Maurizio Lombardi wrote:
> Hello,
> 
> pá 23. 4. 2021 v 15:49 odesílatel <mwilck@suse.com> napsal:
> > From: Martin Wilck <mwilck@suse.com>
> > 
> > diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> > index a0f00cb8f9f3..46287b4f4d10 100644
> > --- a/drivers/nvme/host/tcp.c
> > +++ b/drivers/nvme/host/tcp.c
> > @@ -2068,6 +2068,7 @@ static void
> > nvme_tcp_error_recovery_work(struct
> > work_struct *work)
> >                 return;
> >         }
> > 
> > +       nvme_mpath_stop(ctrl);
> >         nvme_tcp_reconnect_or_remove(ctrl);
> >  }
> > 
> 
> 
> Do you think it would make sense to call nvme_mpath_stop()
> inside nvme_tcp_reconnect_ctrl_work() too? 
> Right before it requeues itself by calling
> nvme_tcp_reconnect_or_remove(ctrl)?

Good point, probably yes. I'll have another look.

Thanks,
Martin



_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] nvme: rdma/tcp: call nvme_mpath_stop() from reconnect workqueue
  2021-04-23 16:43   ` Martin Wilck
@ 2021-04-23 17:09     ` Martin Wilck
  0 siblings, 0 replies; 16+ messages in thread
From: Martin Wilck @ 2021-04-23 17:09 UTC (permalink / raw)
  To: Maurizio Lombardi
  Cc: Keith Busch, Sagi Grimberg, Christoph Hellwig, Chao Leng,
	Hannes Reinecke, Daniel Wagner, linux-nvme

On Fri, 2021-04-23 at 18:43 +0200, Martin Wilck wrote:
> On Fri, 2021-04-23 at 17:50 +0200, Maurizio Lombardi wrote:
> > Hello,
> > 
> > pá 23. 4. 2021 v 15:49 odesílatel <mwilck@suse.com> napsal:
> > > From: Martin Wilck <mwilck@suse.com>
> > > 
> > > diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> > > index a0f00cb8f9f3..46287b4f4d10 100644
> > > --- a/drivers/nvme/host/tcp.c
> > > +++ b/drivers/nvme/host/tcp.c
> > > @@ -2068,6 +2068,7 @@ static void
> > > nvme_tcp_error_recovery_work(struct
> > > work_struct *work)
> > >                 return;
> > >         }
> > > 
> > > +       nvme_mpath_stop(ctrl);
> > >         nvme_tcp_reconnect_or_remove(ctrl);
> > >  }
> > > 
> > 
> > 
> > Do you think it would make sense to call nvme_mpath_stop()
> > inside nvme_tcp_reconnect_ctrl_work() too? 
> > Right before it requeues itself by calling
> > nvme_tcp_reconnect_or_remove(ctrl)?
> 
> Good point, probably yes. I'll have another look.

I did now, and I don't think it's necessary. With my patch,
nvme_mpath_stop() will be called the before the first invocation of
nvme_tcp_reconnect_ctrl_work(). The anatt timer and mpath work queue
will only be started if nvme_tcp_setup_ctrl() succeeds, so we don't
need to stop them again when it fails.

Regards
Martin



_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] nvme: rdma/tcp: call nvme_mpath_stop() from reconnect workqueue
  2021-04-23 13:38 [PATCH v2] nvme: rdma/tcp: call nvme_mpath_stop() from reconnect workqueue mwilck
       [not found] ` <CAFL455k3aBLcZrZPq=Q-9aws4UesstA5gSOr_E7mEFrLT+KbKw@mail.gmail.com>
@ 2021-04-24  0:21 ` Sagi Grimberg
  2021-04-26 14:51   ` Christoph Hellwig
  2021-04-27  9:04   ` Martin Wilck
  2021-04-25  1:07 ` Chao Leng
  2021-04-25 11:34 ` Hannes Reinecke
  3 siblings, 2 replies; 16+ messages in thread
From: Sagi Grimberg @ 2021-04-24  0:21 UTC (permalink / raw)
  To: mwilck, Keith Busch, Christoph Hellwig, Chao Leng
  Cc: Hannes Reinecke, Daniel Wagner, linux-nvme


> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index be905d4fdb47..fc07a7b0dc1d 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -1202,6 +1202,7 @@ static void nvme_rdma_error_recovery_work(struct work_struct *work)
>   		return;
>   	}
>   
> +	nvme_mpath_stop(&ctrl->ctrl);
>   	nvme_rdma_reconnect_or_remove(ctrl);

Its pretty annoying to have to needlessly wait for the ana log page
request to timeout... But this is also needed because we init ana_lock
in nvme_mpath_init while it can be potentially taken in ana_work, which
is a precipice for bad things...

So...

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

And this needs to go to stable, don't see a definitive offending
commit here, so maybe just CC stable and have the title be
more discriptive that it fixes a possible list-poison?

Something like:
nvme: rdma/tcp: fix a possible list corruption when reconnect work is 
racing with anatt timer


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] nvme: rdma/tcp: call nvme_mpath_stop() from reconnect workqueue
  2021-04-23 13:38 [PATCH v2] nvme: rdma/tcp: call nvme_mpath_stop() from reconnect workqueue mwilck
       [not found] ` <CAFL455k3aBLcZrZPq=Q-9aws4UesstA5gSOr_E7mEFrLT+KbKw@mail.gmail.com>
  2021-04-24  0:21 ` Sagi Grimberg
@ 2021-04-25  1:07 ` Chao Leng
  2021-04-25 11:34 ` Hannes Reinecke
  3 siblings, 0 replies; 16+ messages in thread
From: Chao Leng @ 2021-04-25  1:07 UTC (permalink / raw)
  To: mwilck, Keith Busch, Sagi Grimberg, Christoph Hellwig
  Cc: Hannes Reinecke, Daniel Wagner, linux-nvme

Looks good.

Reviewed-by: Chao Leng <lengchao@huawei.com>

On 2021/4/23 21:38, mwilck@suse.com wrote:
> From: Martin Wilck <mwilck@suse.com>
> 
> We have observed a few crashes run_timer_softirq(), where a broken
> timer_list struct belonging to an anatt_timer was encountered. The broken
> structures look like this, and we see actually multiple ones attached to
> the same timer base:
> 
> crash> struct timer_list 0xffff92471bcfdc90
> struct timer_list {
>    entry = {
>      next = 0xdead000000000122,  // LIST_POISON2
>      pprev = 0x0
>    },
>    expires = 4296022933,
>    function = 0xffffffffc06de5e0 <nvme_anatt_timeout>,
>    flags = 20
> }
> 
> If such a timer is encountered in run_timer_softirq(), the kernel
> crashes. The test scenario was an I/O load test with lots of NVMe
> controllers, some of which were removed and re-added on the storage side.
> 
> I think this may happen if the rdma recovery_work starts, in this call
> chain:
> 
> nvme_rdma_error_recovery_work()
>    /* this stops all sorts of activity for the controller, but not
>       the multipath-related work queue and timer */
>    nvme_rdma_reconnect_or_remove(ctrl)
>      => kicks reconnect_work
> 
> work queue: reconnect_work
> 
> nvme_rdma_reconnect_ctrl_work()
>    nvme_rdma_setup_ctrl()
>      nvme_rdma_configure_admin_queue()
>         nvme_init_identify()
>            nvme_mpath_init()
> 	     # this sets some fields of the timer_list without taking a lock
>               timer_setup()
>               nvme_read_ana_log()
> 	         mod_timer() or del_timer_sync()
> 
> Similar for TCP. The idea for the patch is based on the observation that
> nvme_rdma_reset_ctrl_work() calls nvme_stop_ctrl()->nvme_mpath_stop(),
> whereas nvme_rdma_error_recovery_work() stops only the keepalive timer, but
> not the anatt timer.
> 
> I admit that the root cause analysis isn't rock solid yet. In particular, I
> can't explain why we see LIST_POISON2 in the "next" pointer, which would
> indicate that the timer has been detached before; yet we find it linked to
> the timer base when the crash occurs.
> 
> OTOH, the anatt_timer is only touched in nvme_mpath_init() (see above) and
> nvme_mpath_stop(), so the hypothesis that modifying active timers may cause
> the issue isn't totally out of sight. I suspect that the LIST_POISON2 may
> come to pass in multiple steps.
> 
> If anyone has better ideas, please advise. The issue occurs very
> sporadically; verifying this by testing will be difficult.
> 
> Signed-off-by: Martin Wilck <mwilck@suse.com>
> 
> ----
> Changes in v2: Moved call to nvme_mpath_stop() further down, directly before
> the call of nvme_rdma_reconnect_or_remove() (Chao Leng)
> ---
>   drivers/nvme/host/multipath.c | 1 +
>   drivers/nvme/host/rdma.c      | 1 +
>   drivers/nvme/host/tcp.c       | 1 +
>   3 files changed, 3 insertions(+)
> 
> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> index a1d476e1ac02..c63dd5dfa7ff 100644
> --- a/drivers/nvme/host/multipath.c
> +++ b/drivers/nvme/host/multipath.c
> @@ -586,6 +586,7 @@ void nvme_mpath_stop(struct nvme_ctrl *ctrl)
>   	del_timer_sync(&ctrl->anatt_timer);
>   	cancel_work_sync(&ctrl->ana_work);
>   }
> +EXPORT_SYMBOL_GPL(nvme_mpath_stop);
>   
>   #define SUBSYS_ATTR_RW(_name, _mode, _show, _store)  \
>   	struct device_attribute subsys_attr_##_name =	\
> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index be905d4fdb47..fc07a7b0dc1d 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -1202,6 +1202,7 @@ static void nvme_rdma_error_recovery_work(struct work_struct *work)
>   		return;
>   	}
>   
> +	nvme_mpath_stop(&ctrl->ctrl);
>   	nvme_rdma_reconnect_or_remove(ctrl);
>   }
>   
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index a0f00cb8f9f3..46287b4f4d10 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -2068,6 +2068,7 @@ static void nvme_tcp_error_recovery_work(struct work_struct *work)
>   		return;
>   	}
>   
> +	nvme_mpath_stop(ctrl);
>   	nvme_tcp_reconnect_or_remove(ctrl);
>   }
>   
> 

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] nvme: rdma/tcp: call nvme_mpath_stop() from reconnect workqueue
  2021-04-23 13:38 [PATCH v2] nvme: rdma/tcp: call nvme_mpath_stop() from reconnect workqueue mwilck
                   ` (2 preceding siblings ...)
  2021-04-25  1:07 ` Chao Leng
@ 2021-04-25 11:34 ` Hannes Reinecke
  2021-04-26  2:31   ` Chao Leng
  2021-04-26  9:34   ` Martin Wilck
  3 siblings, 2 replies; 16+ messages in thread
From: Hannes Reinecke @ 2021-04-25 11:34 UTC (permalink / raw)
  To: mwilck, Keith Busch, Sagi Grimberg, Christoph Hellwig, Chao Leng
  Cc: Daniel Wagner, linux-nvme

On 4/23/21 3:38 PM, mwilck@suse.com wrote:
> From: Martin Wilck <mwilck@suse.com>
> 
> We have observed a few crashes run_timer_softirq(), where a broken
> timer_list struct belonging to an anatt_timer was encountered. The broken
> structures look like this, and we see actually multiple ones attached to
> the same timer base:
> 
> crash> struct timer_list 0xffff92471bcfdc90
> struct timer_list {
>    entry = {
>      next = 0xdead000000000122,  // LIST_POISON2
>      pprev = 0x0
>    },
>    expires = 4296022933,
>    function = 0xffffffffc06de5e0 <nvme_anatt_timeout>,
>    flags = 20
> }
> 
> If such a timer is encountered in run_timer_softirq(), the kernel
> crashes. The test scenario was an I/O load test with lots of NVMe
> controllers, some of which were removed and re-added on the storage side.
> 
...

But isn't this the result of detach_timer()? IE this suspiciously looks 
like perfectly normal operation; is you look at expire_timers() we're 
first calling 'detach_timer()' before calling the timer function, ie 
every crash in the timer function would have this signature.
And, incidentally, so would any timer function which does not crash.

Sorry to kill your analysis ...

This doesn't mean that the patch isn't valid (in the sense that it 
resolve the issue), but we definitely will need to work on root cause 
analysis.

Cheera,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] nvme: rdma/tcp: call nvme_mpath_stop() from reconnect workqueue
  2021-04-25 11:34 ` Hannes Reinecke
@ 2021-04-26  2:31   ` Chao Leng
  2021-04-26 15:18     ` Martin Wilck
  2021-04-26  9:34   ` Martin Wilck
  1 sibling, 1 reply; 16+ messages in thread
From: Chao Leng @ 2021-04-26  2:31 UTC (permalink / raw)
  To: Hannes Reinecke, mwilck, Keith Busch, Sagi Grimberg, Christoph Hellwig
  Cc: Daniel Wagner, linux-nvme



On 2021/4/25 19:34, Hannes Reinecke wrote:
> On 4/23/21 3:38 PM, mwilck@suse.com wrote:
>> From: Martin Wilck <mwilck@suse.com>
>>
>> We have observed a few crashes run_timer_softirq(), where a broken
>> timer_list struct belonging to an anatt_timer was encountered. The broken
>> structures look like this, and we see actually multiple ones attached to
>> the same timer base:
>>
>> crash> struct timer_list 0xffff92471bcfdc90
>> struct timer_list {
>>    entry = {
>>      next = 0xdead000000000122,  // LIST_POISON2
>>      pprev = 0x0
>>    },
>>    expires = 4296022933,
>>    function = 0xffffffffc06de5e0 <nvme_anatt_timeout>,
>>    flags = 20
>> }
>>
>> If such a timer is encountered in run_timer_softirq(), the kernel
>> crashes. The test scenario was an I/O load test with lots of NVMe
>> controllers, some of which were removed and re-added on the storage side.
>>
> ...
> 
> But isn't this the result of detach_timer()? IE this suspiciously looks like perfectly normal operation; is you look at expire_timers() we're first calling 'detach_timer()' before calling the timer function, ie every crash in the timer function would have this signature.
> And, incidentally, so would any timer function which does not crash.
> 
> Sorry to kill your analysis ...
> 
> This doesn't mean that the patch isn't valid (in the sense that it resolve the issue), but we definitely will need to work on root cause analysis.
The process maybe:1.ana_work add the timer;2.error recovery occurs,
in reconnecting, reinitialize the timer and call nvme_read_ana_log,
nvme_read_ana_log may add the timer again.
The same timer is added twice, crash will happens later.

Indeed ana_log_buf has the similar bug, it's been encountered in our testing.
To fix this bug, I also make the same patch and tested for more than 2 weeks.

This patch can fix the both bugs.
> 
> Cheera,
> 
> Hannes

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] nvme: rdma/tcp: call nvme_mpath_stop() from reconnect workqueue
  2021-04-25 11:34 ` Hannes Reinecke
  2021-04-26  2:31   ` Chao Leng
@ 2021-04-26  9:34   ` Martin Wilck
  2021-04-26 10:06     ` Hannes Reinecke
  1 sibling, 1 reply; 16+ messages in thread
From: Martin Wilck @ 2021-04-26  9:34 UTC (permalink / raw)
  To: Hannes Reinecke, Keith Busch, Sagi Grimberg, Christoph Hellwig,
	Chao Leng
  Cc: Daniel Wagner, linux-nvme

On Sun, 2021-04-25 at 13:34 +0200, Hannes Reinecke wrote:
> On 4/23/21 3:38 PM, mwilck@suse.com wrote:
> > From: Martin Wilck <mwilck@suse.com>
> > 
> > We have observed a few crashes run_timer_softirq(), where a broken
> > timer_list struct belonging to an anatt_timer was encountered. The
> > broken
> > structures look like this, and we see actually multiple ones attached
> > to
> > the same timer base:
> > 
> > crash> struct timer_list 0xffff92471bcfdc90
> > struct timer_list {
> >    entry = {
> >      next = 0xdead000000000122,  // LIST_POISON2
> >      pprev = 0x0
> >    },
> >    expires = 4296022933,
> >    function = 0xffffffffc06de5e0 <nvme_anatt_timeout>,
> >    flags = 20
> > }
> > 
> > If such a timer is encountered in run_timer_softirq(), the kernel
> > crashes. The test scenario was an I/O load test with lots of NVMe
> > controllers, some of which were removed and re-added on the storage
> > side.
> > 
> ...
> 
> But isn't this the result of detach_timer()? IE this suspiciously looks
> like perfectly normal operation; is you look at expire_timers() we're
> first calling 'detach_timer()' before calling the timer function, ie 
> every crash in the timer function would have this signature.
> And, incidentally, so would any timer function which does not crash.
> 
> Sorry to kill your analysis ...

No problem, I realized this myself, and actually mentioned it in the
commit description. OTOH, this timer is only modified in very few
places, and all but nvme_mpath_init() use the proper APIs for modifying
or deleting timers, so the initialization of a (possibly still) running
timer is the only suspect, afaics.

My personal theory is that the corruption might happen in several
steps, the first step being timer_setup() mofiying fields of a pending
timer. But I couldn't figure it out completely, and found it too hand-
waving to mention in the commit description.

> This doesn't mean that the patch isn't valid (in the sense that it 
> resolve the issue), but we definitely will need to work on root cause
> analysis.

I'd be grateful for any help figuring out the missing bits.

Thanks,
Martin



_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] nvme: rdma/tcp: call nvme_mpath_stop() from reconnect workqueue
  2021-04-26  9:34   ` Martin Wilck
@ 2021-04-26 10:06     ` Hannes Reinecke
  0 siblings, 0 replies; 16+ messages in thread
From: Hannes Reinecke @ 2021-04-26 10:06 UTC (permalink / raw)
  To: Martin Wilck, Keith Busch, Sagi Grimberg, Christoph Hellwig, Chao Leng
  Cc: Daniel Wagner, linux-nvme

On 4/26/21 11:34 AM, Martin Wilck wrote:
> On Sun, 2021-04-25 at 13:34 +0200, Hannes Reinecke wrote:
>> On 4/23/21 3:38 PM, mwilck@suse.com wrote:
>>> From: Martin Wilck <mwilck@suse.com>
>>>
>>> We have observed a few crashes run_timer_softirq(), where a broken
>>> timer_list struct belonging to an anatt_timer was encountered. The
>>> broken
>>> structures look like this, and we see actually multiple ones attached
>>> to
>>> the same timer base:
>>>
>>> crash> struct timer_list 0xffff92471bcfdc90
>>> struct timer_list {
>>>    entry = {
>>>      next = 0xdead000000000122,  // LIST_POISON2
>>>      pprev = 0x0
>>>    },
>>>    expires = 4296022933,
>>>    function = 0xffffffffc06de5e0 <nvme_anatt_timeout>,
>>>    flags = 20
>>> }
>>>
>>> If such a timer is encountered in run_timer_softirq(), the kernel
>>> crashes. The test scenario was an I/O load test with lots of NVMe
>>> controllers, some of which were removed and re-added on the storage
>>> side.
>>>
>> ...
>>
>> But isn't this the result of detach_timer()? IE this suspiciously looks
>> like perfectly normal operation; is you look at expire_timers() we're
>> first calling 'detach_timer()' before calling the timer function, ie 
>> every crash in the timer function would have this signature.
>> And, incidentally, so would any timer function which does not crash.
>>
>> Sorry to kill your analysis ...
> 
> No problem, I realized this myself, and actually mentioned it in the
> commit description. OTOH, this timer is only modified in very few
> places, and all but nvme_mpath_init() use the proper APIs for modifying
> or deleting timers, so the initialization of a (possibly still) running
> timer is the only suspect, afaics.
> 
> My personal theory is that the corruption might happen in several
> steps, the first step being timer_setup() mofiying fields of a pending
> timer. But I couldn't figure it out completely, and found it too hand-
> waving to mention in the commit description.
> 
>> This doesn't mean that the patch isn't valid (in the sense that it 
>> resolve the issue), but we definitely will need to work on root cause
>> analysis.
> 
> I'd be grateful for any help figuring out the missing bits.
> 
No, the problem lies in the pecularities of NVMe error recovery.

The crucial point here is that during error recovery all I/O is aborted,
but after that the queue is _NOT_ stopped.
So it's perfectly legal to send I/O even if the controller is not fully
LIVE. And if _this_ I/O fails then we'll find ourselves in lots of
interesting situations.

I guess the problem lies with nvme_init_identify().

We're calling nvme_mpath_init() (which will start the ANATT timer), but
issue some more I/O afterwards.
If any of _that_ I/O fails, an error is registered with the caller (eg
in nvme_tcp_configure_admin_queue(), _but_ the ANATT timer continues to
run. Then we'll be calling nvme_tcp_reconnect_or_remove() and delete the
controller. Completely oblivious to the running ANATT timer.
Which will get quite unhappy once it triggers as the controller
structure from which the timer was started is done for.

So with this analysis an alternative fix would be to move
nvme_path_init() right at the end of nvme_init_identify(), such that
we'll never get into the situation where nvme_init_identify() fails yet
the ANATT timer is running.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		        Kernel Storage Architect
hare@suse.de			               +49 911 74053 688
SUSE Software Solutions Germany GmbH, 90409 Nürnberg
GF: F. Imendörffer, HRB 36809 (AG Nürnberg)

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] nvme: rdma/tcp: call nvme_mpath_stop() from reconnect workqueue
  2021-04-24  0:21 ` Sagi Grimberg
@ 2021-04-26 14:51   ` Christoph Hellwig
  2021-04-26 16:27     ` Martin Wilck
  2021-04-27  9:04   ` Martin Wilck
  1 sibling, 1 reply; 16+ messages in thread
From: Christoph Hellwig @ 2021-04-26 14:51 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: mwilck, Keith Busch, Christoph Hellwig, Chao Leng,
	Hannes Reinecke, Daniel Wagner, linux-nvme

On Fri, Apr 23, 2021 at 05:21:03PM -0700, Sagi Grimberg wrote:
>
>> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
>> index be905d4fdb47..fc07a7b0dc1d 100644
>> --- a/drivers/nvme/host/rdma.c
>> +++ b/drivers/nvme/host/rdma.c
>> @@ -1202,6 +1202,7 @@ static void nvme_rdma_error_recovery_work(struct work_struct *work)
>>   		return;
>>   	}
>>   +	nvme_mpath_stop(&ctrl->ctrl);
>>   	nvme_rdma_reconnect_or_remove(ctrl);
>
> Its pretty annoying to have to needlessly wait for the ana log page
> request to timeout... But this is also needed because we init ana_lock
> in nvme_mpath_init while it can be potentially taken in ana_work, which
> is a precipice for bad things...

I also really hate open coding this mpath detail in the transport
drivers.  Didn't you at some point have a series to factor out more
common code from the whole reset and reconnect path?

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] nvme: rdma/tcp: call nvme_mpath_stop() from reconnect workqueue
  2021-04-26  2:31   ` Chao Leng
@ 2021-04-26 15:18     ` Martin Wilck
  0 siblings, 0 replies; 16+ messages in thread
From: Martin Wilck @ 2021-04-26 15:18 UTC (permalink / raw)
  To: Chao Leng, Hannes Reinecke, Keith Busch, Sagi Grimberg,
	Christoph Hellwig
  Cc: Daniel Wagner, linux-nvme

On Mon, 2021-04-26 at 10:31 +0800, Chao Leng wrote:
> 
> 
> On 2021/4/25 19:34, Hannes Reinecke wrote:
> > On 4/23/21 3:38 PM, mwilck@suse.com wrote:
> > > 
> > > If such a timer is encountered in run_timer_softirq(), the kernel
> > > crashes. The test scenario was an I/O load test with lots of NVMe
> > > controllers, some of which were removed and re-added on the storage
> > > side.
> > > 
> > ...
> > 
> > But isn't this the result of detach_timer()? IE this suspiciously
> > looks like perfectly normal operation; is you look at expire_timers()
> > we're first calling 'detach_timer()' before calling the timer
> > function, ie every crash in the timer function would have this
> > signature.
> > And, incidentally, so would any timer function which does not crash.
> > 
> > Sorry to kill your analysis ...
> > 
> > This doesn't mean that the patch isn't valid (in the sense that it
> > resolve the issue), but we definitely will need to work on root cause
> > analysis.
> The process maybe:1.ana_work add the timer;2.error recovery occurs,
> in reconnecting, reinitialize the timer and call nvme_read_ana_log,
> nvme_read_ana_log may add the timer again.
> The same timer is added twice, crash will happens later.

Thanks a lot! I think you nailed it.

I had another look at my dump with this in mind, and indeed I was able
to prove that the timer of the same NVMe controller had been started
(at least) twice. This explains why it looks as if it had been cleanly
detached, and is still running nonetheless.

In particular, I saw this:

[ 4389.939928] nvme nvme36: queue_size 128 > ctrl sqsize 32, clamping down
[ 4389.962709] nvme nvme36: creating 4 I/O queues.
[ 4389.978462] nvme nvme36: starting error recovery
[ 4389.978509] nvme nvme36: failed to send request -32
[ 4389.978724] nvme nvme36: Failed to configure AEN (cfg 900)
[ 4389.978732] nvme nvme36: Successfully reconnected (1 attempt)
[ 4389.978739] nvme nvme36: failed to send request -32
[ 4389.991420] nvme nvme36: Reconnecting in 10 seconds...
...
[ 4441.399435] nvme nvme36: queue_size 128 > ctrl sqsize 32, clamping down
[ 4441.407647] nvme nvme36: creating 4 I/O queues.
[ 4441.445655] nvme nvme36: Successfully reconnected (1 attempt)
[ 4441.699665] nvme nvme36: starting error recovery
[ 4441.719493] nvme nvme36: Identify Descriptors failed (881)
[ 4441.735409] nvme nvme36: Reconnecting in 10 seconds...
...
[ 4510.891400] nvme nvme36: ANATT timeout, resetting controller.
...
[ 4517.035411] general protection fault: 0000 [#1] SMP PTI

The protection fault was caused by the anatt timer of another nvme
controller. But I could see the anatt_timer of nvme36 still linked to
an active timer_base, about to expire 44s in the future. Calculating
back, that timer had been started around 4441s above, whereas the timer
that expired a few seconds earlier must have been started at 4389.9.

Regards,
Martin



_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] nvme: rdma/tcp: call nvme_mpath_stop() from reconnect workqueue
  2021-04-26 14:51   ` Christoph Hellwig
@ 2021-04-26 16:27     ` Martin Wilck
  2021-04-27  1:45       ` Chao Leng
  0 siblings, 1 reply; 16+ messages in thread
From: Martin Wilck @ 2021-04-26 16:27 UTC (permalink / raw)
  To: Christoph Hellwig, Sagi Grimberg
  Cc: Keith Busch, Chao Leng, Hannes Reinecke, Daniel Wagner, linux-nvme

On Mon, 2021-04-26 at 16:51 +0200, Christoph Hellwig wrote:
> On Fri, Apr 23, 2021 at 05:21:03PM -0700, Sagi Grimberg wrote:
> > 
> > > diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> > > index be905d4fdb47..fc07a7b0dc1d 100644
> > > --- a/drivers/nvme/host/rdma.c
> > > +++ b/drivers/nvme/host/rdma.c
> > > @@ -1202,6 +1202,7 @@ static void
> > > nvme_rdma_error_recovery_work(struct work_struct *work)
> > >                 return;
> > >         }
> > >   +     nvme_mpath_stop(&ctrl->ctrl);
> > >         nvme_rdma_reconnect_or_remove(ctrl);
> > 
> > Its pretty annoying to have to needlessly wait for the ana log page
> > request to timeout... But this is also needed because we init
> > ana_lock
> > in nvme_mpath_init while it can be potentially taken in ana_work,
> > which
> > is a precipice for bad things...
> 
> I also really hate open coding this mpath detail in the transport
> drivers.  Didn't you at some point have a series to factor out more
> common code from the whole reset and reconnect path?
> 

So ... resubmit and cc stable as Sagi suggested, or work out something
else?

Martin



_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] nvme: rdma/tcp: call nvme_mpath_stop() from reconnect workqueue
  2021-04-26 16:27     ` Martin Wilck
@ 2021-04-27  1:45       ` Chao Leng
  2021-04-27  7:30         ` Martin Wilck
  0 siblings, 1 reply; 16+ messages in thread
From: Chao Leng @ 2021-04-27  1:45 UTC (permalink / raw)
  To: Martin Wilck, Christoph Hellwig, Sagi Grimberg
  Cc: Keith Busch, Hannes Reinecke, Daniel Wagner, linux-nvme



On 2021/4/27 0:27, Martin Wilck wrote:
> On Mon, 2021-04-26 at 16:51 +0200, Christoph Hellwig wrote:
>> On Fri, Apr 23, 2021 at 05:21:03PM -0700, Sagi Grimberg wrote:
>>>
>>>> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
>>>> index be905d4fdb47..fc07a7b0dc1d 100644
>>>> --- a/drivers/nvme/host/rdma.c
>>>> +++ b/drivers/nvme/host/rdma.c
>>>> @@ -1202,6 +1202,7 @@ static void
>>>> nvme_rdma_error_recovery_work(struct work_struct *work)
>>>>                  return;
>>>>          }
>>>>    +     nvme_mpath_stop(&ctrl->ctrl);
>>>>          nvme_rdma_reconnect_or_remove(ctrl);
>>>
>>> Its pretty annoying to have to needlessly wait for the ana log page
>>> request to timeout... But this is also needed because we init
>>> ana_lock
>>> in nvme_mpath_init while it can be potentially taken in ana_work,
>>> which
>>> is a precipice for bad things...
>>
>> I also really hate open coding this mpath detail in the transport
>> drivers.  Didn't you at some point have a series to factor out more
>> common code from the whole reset and reconnect path?
>>
> 
> So ... resubmit and cc stable as Sagi suggested, or work out something
> else?
Another option:call nvme_mpath_stop in nvme_mpath_init.
In this way, transport drivers do not care mpath detail.
> 
> Martin
> 
> 
> .
> 

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] nvme: rdma/tcp: call nvme_mpath_stop() from reconnect workqueue
  2021-04-27  1:45       ` Chao Leng
@ 2021-04-27  7:30         ` Martin Wilck
  2021-04-27  8:56           ` Martin Wilck
  0 siblings, 1 reply; 16+ messages in thread
From: Martin Wilck @ 2021-04-27  7:30 UTC (permalink / raw)
  To: Chao Leng, Christoph Hellwig, Sagi Grimberg
  Cc: Keith Busch, Hannes Reinecke, Daniel Wagner, linux-nvme

On Tue, 2021-04-27 at 09:45 +0800, Chao Leng wrote:
> 
> 
> On 2021/4/27 0:27, Martin Wilck wrote:
> > On Mon, 2021-04-26 at 16:51 +0200, Christoph Hellwig wrote:
> > > On Fri, Apr 23, 2021 at 05:21:03PM -0700, Sagi Grimberg wrote:
> > > > 
> > > > > diff --git a/drivers/nvme/host/rdma.c
> > > > > b/drivers/nvme/host/rdma.c
> > > > > index be905d4fdb47..fc07a7b0dc1d 100644
> > > > > --- a/drivers/nvme/host/rdma.c
> > > > > +++ b/drivers/nvme/host/rdma.c
> > > > > @@ -1202,6 +1202,7 @@ static void
> > > > > nvme_rdma_error_recovery_work(struct work_struct *work)
> > > > >                  return;
> > > > >          }
> > > > >    +     nvme_mpath_stop(&ctrl->ctrl);
> > > > >          nvme_rdma_reconnect_or_remove(ctrl);
> > > > 
> > > > Its pretty annoying to have to needlessly wait for the ana log
> > > > page
> > > > request to timeout... But this is also needed because we init
> > > > ana_lock
> > > > in nvme_mpath_init while it can be potentially taken in
> > > > ana_work,
> > > > which
> > > > is a precipice for bad things...
> > > 
> > > I also really hate open coding this mpath detail in the transport
> > > drivers.  Didn't you at some point have a series to factor out
> > > more
> > > common code from the whole reset and reconnect path?
> > > 
> > 
> > So ... resubmit and cc stable as Sagi suggested, or work out
> > something
> > else?
> Another option:call nvme_mpath_stop in nvme_mpath_init.
> In this way, transport drivers do not care mpath detail.

That would require that we determine in nvme_mpath_init() whether we're
called for the first time for the given controller, or whether the
controller is re-initializing. It's not obvious to me how to do that
reliably without introducing a new state flag.

Regards,
Martin



_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] nvme: rdma/tcp: call nvme_mpath_stop() from reconnect workqueue
  2021-04-27  7:30         ` Martin Wilck
@ 2021-04-27  8:56           ` Martin Wilck
  0 siblings, 0 replies; 16+ messages in thread
From: Martin Wilck @ 2021-04-27  8:56 UTC (permalink / raw)
  To: Chao Leng, Christoph Hellwig, Sagi Grimberg
  Cc: Keith Busch, Hannes Reinecke, Daniel Wagner, linux-nvme

On Tue, 2021-04-27 at 09:30 +0200, Martin Wilck wrote:
> On Tue, 2021-04-27 at 09:45 +0800, Chao Leng wrote:
> > > 
> > > So ... resubmit and cc stable as Sagi suggested, or work out
> > > something
> > > else?
> > Another option:call nvme_mpath_stop in nvme_mpath_init.
> > In this way, transport drivers do not care mpath detail.
> 
> That would require that we determine in nvme_mpath_init() whether
> we're
> called for the first time for the given controller, or whether the
> controller is re-initializing. It's not obvious to me how to do that
> reliably without introducing a new state flag.
> 

I'll submit a v3 with changed subject as suggested by Sagi. That will
fix the list corruption; we can figure out a better solution later.

Martin



_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2] nvme: rdma/tcp: call nvme_mpath_stop() from reconnect workqueue
  2021-04-24  0:21 ` Sagi Grimberg
  2021-04-26 14:51   ` Christoph Hellwig
@ 2021-04-27  9:04   ` Martin Wilck
  1 sibling, 0 replies; 16+ messages in thread
From: Martin Wilck @ 2021-04-27  9:04 UTC (permalink / raw)
  To: Sagi Grimberg, Keith Busch, Christoph Hellwig, Chao Leng
  Cc: Hannes Reinecke, Daniel Wagner, linux-nvme

On Fri, 2021-04-23 at 17:21 -0700, Sagi Grimberg wrote:
> 
> > diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> > index be905d4fdb47..fc07a7b0dc1d 100644
> > --- a/drivers/nvme/host/rdma.c
> > +++ b/drivers/nvme/host/rdma.c
> > @@ -1202,6 +1202,7 @@ static void
> > nvme_rdma_error_recovery_work(struct work_struct *work)
> >                 return;
> >         }
> >   
> > +       nvme_mpath_stop(&ctrl->ctrl);
> >         nvme_rdma_reconnect_or_remove(ctrl);
> 
> Its pretty annoying to have to needlessly wait for the ana log page
> request to timeout..
>  But this is also needed because we init ana_lock
> in nvme_mpath_init while it can be potentially taken in ana_work,
> which
> is a precipice for bad things...

What if we move the ana-related fields into a separately kmalloc'd
structure, rather than embedding it in nvme_ctrl? That way we could
just detach it when the controller is re-initialized, without needing
to wait. When the timeouts eventually occur, the worker func would
realize that it's detached and just do nothing.

Regards
Martin



_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2021-04-27  9:04 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-23 13:38 [PATCH v2] nvme: rdma/tcp: call nvme_mpath_stop() from reconnect workqueue mwilck
     [not found] ` <CAFL455k3aBLcZrZPq=Q-9aws4UesstA5gSOr_E7mEFrLT+KbKw@mail.gmail.com>
2021-04-23 16:43   ` Martin Wilck
2021-04-23 17:09     ` Martin Wilck
2021-04-24  0:21 ` Sagi Grimberg
2021-04-26 14:51   ` Christoph Hellwig
2021-04-26 16:27     ` Martin Wilck
2021-04-27  1:45       ` Chao Leng
2021-04-27  7:30         ` Martin Wilck
2021-04-27  8:56           ` Martin Wilck
2021-04-27  9:04   ` Martin Wilck
2021-04-25  1:07 ` Chao Leng
2021-04-25 11:34 ` Hannes Reinecke
2021-04-26  2:31   ` Chao Leng
2021-04-26 15:18     ` Martin Wilck
2021-04-26  9:34   ` Martin Wilck
2021-04-26 10:06     ` Hannes Reinecke

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.