All of lore.kernel.org
 help / color / mirror / Atom feed
From: Martin Wilck <mwilck@suse.com>
To: Hannes Reinecke <hare@suse.de>, Keith Busch <kbusch@kernel.org>,
	Sagi Grimberg <sagi@grimberg.me>, Christoph Hellwig <hch@lst.de>,
	Chao Leng <lengchao@huawei.com>
Cc: Daniel Wagner <dwagner@suse.de>, linux-nvme@lists.infradead.org
Subject: Re: [PATCH v2] nvme: rdma/tcp: call nvme_mpath_stop() from reconnect workqueue
Date: Mon, 26 Apr 2021 11:34:14 +0200	[thread overview]
Message-ID: <181b5a7c61e31abff2f4d0102281c0e0f96a7832.camel@suse.com> (raw)
In-Reply-To: <65167282-84e7-d08b-f97d-edb0d1372a49@suse.de>

On Sun, 2021-04-25 at 13:34 +0200, Hannes Reinecke wrote:
> On 4/23/21 3:38 PM, mwilck@suse.com wrote:
> > From: Martin Wilck <mwilck@suse.com>
> > 
> > We have observed a few crashes run_timer_softirq(), where a broken
> > timer_list struct belonging to an anatt_timer was encountered. The
> > broken
> > structures look like this, and we see actually multiple ones attached
> > to
> > the same timer base:
> > 
> > crash> struct timer_list 0xffff92471bcfdc90
> > struct timer_list {
> >    entry = {
> >      next = 0xdead000000000122,  // LIST_POISON2
> >      pprev = 0x0
> >    },
> >    expires = 4296022933,
> >    function = 0xffffffffc06de5e0 <nvme_anatt_timeout>,
> >    flags = 20
> > }
> > 
> > If such a timer is encountered in run_timer_softirq(), the kernel
> > crashes. The test scenario was an I/O load test with lots of NVMe
> > controllers, some of which were removed and re-added on the storage
> > side.
> > 
> ...
> 
> But isn't this the result of detach_timer()? IE this suspiciously looks
> like perfectly normal operation; is you look at expire_timers() we're
> first calling 'detach_timer()' before calling the timer function, ie 
> every crash in the timer function would have this signature.
> And, incidentally, so would any timer function which does not crash.
> 
> Sorry to kill your analysis ...

No problem, I realized this myself, and actually mentioned it in the
commit description. OTOH, this timer is only modified in very few
places, and all but nvme_mpath_init() use the proper APIs for modifying
or deleting timers, so the initialization of a (possibly still) running
timer is the only suspect, afaics.

My personal theory is that the corruption might happen in several
steps, the first step being timer_setup() mofiying fields of a pending
timer. But I couldn't figure it out completely, and found it too hand-
waving to mention in the commit description.

> This doesn't mean that the patch isn't valid (in the sense that it 
> resolve the issue), but we definitely will need to work on root cause
> analysis.

I'd be grateful for any help figuring out the missing bits.

Thanks,
Martin



_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

  parent reply	other threads:[~2021-04-26  9:34 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-23 13:38 [PATCH v2] nvme: rdma/tcp: call nvme_mpath_stop() from reconnect workqueue mwilck
     [not found] ` <CAFL455k3aBLcZrZPq=Q-9aws4UesstA5gSOr_E7mEFrLT+KbKw@mail.gmail.com>
2021-04-23 16:43   ` Martin Wilck
2021-04-23 17:09     ` Martin Wilck
2021-04-24  0:21 ` Sagi Grimberg
2021-04-26 14:51   ` Christoph Hellwig
2021-04-26 16:27     ` Martin Wilck
2021-04-27  1:45       ` Chao Leng
2021-04-27  7:30         ` Martin Wilck
2021-04-27  8:56           ` Martin Wilck
2021-04-27  9:04   ` Martin Wilck
2021-04-25  1:07 ` Chao Leng
2021-04-25 11:34 ` Hannes Reinecke
2021-04-26  2:31   ` Chao Leng
2021-04-26 15:18     ` Martin Wilck
2021-04-26  9:34   ` Martin Wilck [this message]
2021-04-26 10:06     ` Hannes Reinecke

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=181b5a7c61e31abff2f4d0102281c0e0f96a7832.camel@suse.com \
    --to=mwilck@suse.com \
    --cc=dwagner@suse.de \
    --cc=hare@suse.de \
    --cc=hch@lst.de \
    --cc=kbusch@kernel.org \
    --cc=lengchao@huawei.com \
    --cc=linux-nvme@lists.infradead.org \
    --cc=sagi@grimberg.me \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.