All of lore.kernel.org
 help / color / mirror / Atom feed
From: Greg KH <gregkh@linuxfoundation.org>
To: "Pali Rohár" <pali@kernel.org>
Cc: linux-usb@vger.kernel.org, linux-pci@vger.kernel.org,
	linux-kernel@vger.kernel.org, "Marek Behún" <kabel@kernel.org>
Subject: Re: xhci_pci & PCIe hotplug crash
Date: Wed, 5 May 2021 14:40:55 +0200	[thread overview]
Message-ID: <YJKSV9nvf0ipq7CJ@kroah.com> (raw)
In-Reply-To: <20210505123346.kxfpumww5i4qmhnk@pali>

On Wed, May 05, 2021 at 02:33:46PM +0200, Pali Rohár wrote:
> On Wednesday 05 May 2021 14:09:17 Greg KH wrote:
> > On Wed, May 05, 2021 at 02:01:17PM +0200, Pali Rohár wrote:
> > > Hello!
> > > 
> > > During debugging of pci-aardvark.c driver I got following synchronous
> > > external abort 96000210 which I can reproduce with VIA XHCI controller
> > > when PCIe hot plug support is enabled in kernel and PCIe Root Bridge
> > > triggers link down event via PCIe hot plug interrupt.
> > > 
> > > [   71.773033] pcieport 0000:00:00.0: pciehp: Slot(0): Link Down
> > > [   71.779120] xhci_hcd 0000:01:00.0: remove, state 4
> > > [   71.784113] usb usb5: USB disconnect, device number 1
> > > [   71.790398] xhci_hcd 0000:01:00.0: USB bus 5 deregistered
> > > [   72.511899] Internal error: synchronous external abort: 96000210 [#1] SMP
> > > [   72.518918] Modules linked in:
> > > [   72.522074] CPU: 1 PID: 988 Comm: irq/53-pciehp Not tainted 5.12.0-dirty #949
> > > [   72.536983] pstate: 60000085 (nZCv daIf -PAN -UAO -TCO BTYPE=--)
> > > [   72.543182] pc : xhci_irq+0x70/0x17b8
> > > [   72.546972] lr : xhci_irq+0x28/0x17b8
> > > [   72.550752] sp : ffffffc012b8bab0
> > > [   72.554167] x29: ffffffc012b8bab0 x28: 00000000000000a0 
> > > [   72.559652] x27: 0000000000000060 x26: ffffff8000af2250 
> > > [   72.565135] x25: ffffffc0100b0d48 x24: ffffffc0100b0be0 
> > > [   72.570620] x23: ffffff80003be028 x22: ffffff8000af229c 
> > > [   72.576104] x21: 0000000000000080 x20: ffffff8000af2000 
> > > [   72.581587] x19: ffffff8000af2000 x18: 0000000000000004 
> > > [   72.587071] x17: 0000000000000000 x16: 0000000000000000 
> > > [   72.592553] x15: ffffffc01154cc70 x14: ffffff8001751df8 
> > > [   72.598037] x13: 0000000000000000 x12: 0000000000000000 
> > > [   72.603519] x11: ffffff8001751da8 x10: ffffffc01154cc78 
> > > [   72.609001] x9 : ffffffc01087c238 x8 : 0000000000000000 
> > > [   72.614485] x7 : ffffffc01162c4e0 x6 : 0000000000000000 
> > > [   72.619967] x5 : fffffffe00085000 x4 : fffffffe00085000 
> > > [   72.625451] x3 : 0000000000000000 x2 : 0000000000000001 
> > > [   72.630933] x1 : ffffffc0118bd024 x0 : 0000000000000000 
> > > [   72.636415] Call trace:
> > > [   72.638936]  xhci_irq+0x70/0x17b8
> > > [   72.642360]  usb_hcd_irq+0x34/0x50
> > > [   72.645876]  usb_hcd_pci_remove+0x78/0x138
> > > [   72.650106]  xhci_pci_remove+0x6c/0xa8
> > > [   72.653978]  pci_device_remove+0x44/0x108
> > > [   72.658122]  device_release_driver_internal+0x110/0x1e0
> > > [   72.663521]  device_release_driver+0x1c/0x28
> > > [   72.667931]  pci_stop_bus_device+0x84/0xc0
> > > [   72.672162]  pci_stop_and_remove_bus_device+0x1c/0x30
> > > [   72.677373]  pciehp_unconfigure_device+0x98/0xf8
> > > [   72.682138]  pciehp_disable_slot+0x60/0x118
> > > [   72.686457]  pciehp_handle_presence_or_link_change+0xec/0x3b0
> > > [   72.692386]  pciehp_ist+0x170/0x1a0
> > > [   72.695984]  irq_thread_fn+0x30/0x90
> > > [   72.699674]  irq_thread+0x13c/0x200
> > > [   72.703271]  kthread+0x12c/0x130
> > > [   72.706603]  ret_from_fork+0x10/0x1c
> > > [   72.710299] Code: 35ffff83 35002741 f9400f41 91001021 (b9400021) 
> > > [   72.716586] ---[ end trace 20ce3e30ff292c93 ]---
> > > [   72.721453] genirq: exiting task "irq/53-pciehp" (988) is an active IRQ thread (irq 53)
> > > [   72.730068] sched: RT throttling activated
> > > 
> > > And after that kernel is in some semi-broken state. Some functionality
> > > works, but some other (like reboot) does not.
> > > 
> > > I can reproduce it also when I manually inject/fake this link down PCIe
> > > hot plug interrupt with setting corresponding bits in PCIe Root Status
> > > registers, so pciehp driver thinks that link down even occurred.
> > > 
> > > I suspect that issue is in usb_hcd_pci_remove() function which calls
> > > local_irq_disable()+usb_hcd_irq()+local_irq_enable() functions but do
> > > not take into care that whole usb_hcd_pci_remove() function may be
> > > called from interrupt context.
> > 
> > usb_hcd_pci_remove() should NOT be called from interrupt context.
> > 
> > What is causing that to happen?
> 
> PCIe Hot Plug interrupt with PCI_EXP_SLTSTA_DLLSC status bit set.
> 
> I can reproduce it by issuing PCIe Hot Reset to PCIe controller (via
> setpci from userspace) which resulted in link down event (which is
> obvious) and PCIe controller then triggered link down interrupt.
> 
> > No PCI driver can handle that, especially USB ones.
> > 
> > > Can you look at this issue if it is really safe to call usb_hcd_irq()
> > > from interrupt context? Or rather if it is safe to call functions like
> > > pciehp_disable_slot() or device_release_driver() from interrupt context
> > > like it can be seen in call trace?
> > 
> > What is removing devices from an irq?
> 
> It can be seen in above call trace. It is pciehp_disable_slot() followed
> by pciehp_unconfigure_device().

But pciehp_disable_slot() is called under protection of a mutex, so we
"know" it can't be called from an irq.  The trace might be wrong there,
or someone moved to using a threaded irq handler somehow?

I would focus on the "synchronous external abort", are you sure that is
not just a platform error being hit somehow that is independent of the
xhci driver?

> > That is wrong, pci hotplug never used to do that, what recently changed?
> 
> I really do not know what was changed recently. I hope that other people
> in linux-pci ML would know history details better.
> 
> I just spotted this crash during debugging PCIe controller driver
> pci-aardvark.c with trying to expose its link down events via "hot plug"
> interrupt and corresponding link layer state flags.
> 
> And because in whole call trace I see only generic PCIe and USB code
> path without any driver specific parts, I suspect that this is not PCIe
> controller-specific issue but rather something "wrong" in genetic PCIe
> (or USB) code. That is why I sent this email, so maybe somebody else
> find something suspicious here.
> 
> But still there is a chance that issue can be also in pci-aardvark.c
> driver and somehow it masked its issue and propagated it into generic
> PCIe hot plug code path.

Any chance you can use 'git bisect' to track down where this showed up?

thanks,

greg k-h

  reply	other threads:[~2021-05-05 12:41 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-05 12:01 xhci_pci & PCIe hotplug crash Pali Rohár
2021-05-05 12:09 ` Greg KH
2021-05-05 12:33   ` Pali Rohár
2021-05-05 12:40     ` Greg KH [this message]
2021-05-05 12:44     ` Lukas Wunner
2021-05-05 13:02       ` Pali Rohár
2021-05-05 15:20         ` David Laight
2021-05-05 15:39           ` Pali Rohár
2021-06-19  7:53             ` Lukas Wunner
2021-06-19  8:55               ` Pali Rohár
2021-05-05 12:37   ` Lukas Wunner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YJKSV9nvf0ipq7CJ@kroah.com \
    --to=gregkh@linuxfoundation.org \
    --cc=kabel@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=linux-usb@vger.kernel.org \
    --cc=pali@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.