All of lore.kernel.org
 help / color / mirror / Atom feed
* Having problems resetting a PCI device
@ 2017-03-29 20:03 Zytaruk, Kelly
  2017-03-29 20:54 ` Bjorn Helgaas
  0 siblings, 1 reply; 7+ messages in thread
From: Zytaruk, Kelly @ 2017-03-29 20:03 UTC (permalink / raw)
  To: linux-pci, Alex Williamson; +Cc: Bjorn Helgaas

I have a PCI device that is sitting behind a bridge.=20

Under certain reproducible circumstances the PCI device will become inactiv=
e. Reading the PCI config space returns all 0xFFFFFFFF.

The bridge appears to still be functional. Reading the status from the brid=
ge I see a Fatal Error due to a Surprise Down event.

I am trying to figure out how to bring the device back online.

I tried toggling the secondary bus reset bit of the Bridge Control Register=
 but it doesn't appear to make any difference. I still see 0xFFFFFFFF in th=
e device config space.

I provided a pci_error_handler but the error_detected() function is not get=
ting called.

Given that these two methods are not helping me out what other choices do I=
 have to either reset the PCI device or hot-plug the device from a kernel d=
river. Or some other method of bring the device back to life.

Note that I am running Linux 4.8 in dom0 on Xen (if that makes a difference=
).

Thanks,
Kelly

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Having problems resetting a PCI device
  2017-03-29 20:03 Having problems resetting a PCI device Zytaruk, Kelly
@ 2017-03-29 20:54 ` Bjorn Helgaas
  2017-03-29 21:23   ` Zytaruk, Kelly
  2017-03-29 21:41   ` Zytaruk, Kelly
  0 siblings, 2 replies; 7+ messages in thread
From: Bjorn Helgaas @ 2017-03-29 20:54 UTC (permalink / raw)
  To: Zytaruk, Kelly; +Cc: linux-pci, Alex Williamson

Hi Kelly,

On Wed, Mar 29, 2017 at 08:03:33PM +0000, Zytaruk, Kelly wrote:
> I have a PCI device that is sitting behind a bridge. 
> 
> Under certain reproducible circumstances the PCI device will become
> inactive. Reading the PCI config space returns all 0xFFFFFFFF.
> 
> The bridge appears to still be functional. Reading the status from
> the bridge I see a Fatal Error due to a Surprise Down event.

Just to be specific, is this the "Surprise Down Error" in the AER
uncorrectable error status register?  "lspci -vv" probably decodes all
that for you.

> I am trying to figure out how to bring the device back online.
> 
> I tried toggling the secondary bus reset bit of the Bridge Control
> Register but it doesn't appear to make any difference. I still see
> 0xFFFFFFFF in the device config space.

Are you calling pci_reset_function() or doing this by hand?
pci_reset_function() tries several different strategies, one of which
is toggling the secondary bus reset bit.

> I provided a pci_error_handler but the error_detected() function is
> not getting called.

Do you have CONFIG_PCIEAER turned on?  I would naively expect AER to
log something and call your error_detected() function if this error
occurs (but I haven't looked at the code for a long time).

> Given that these two methods are not helping me out what other
> choices do I have to either reset the PCI device or hot-plug the
> device from a kernel driver. Or some other method of bring the
> device back to life.
> 
> Note that I am running Linux 4.8 in dom0 on Xen (if that makes a
> difference).
> 
> Thanks, Kelly

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: Having problems resetting a PCI device
  2017-03-29 20:54 ` Bjorn Helgaas
@ 2017-03-29 21:23   ` Zytaruk, Kelly
  2017-03-29 21:41   ` Zytaruk, Kelly
  1 sibling, 0 replies; 7+ messages in thread
From: Zytaruk, Kelly @ 2017-03-29 21:23 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: linux-pci, Alex Williamson

Hi Bjorn,

> -----Original Message-----
> From: Bjorn Helgaas [mailto:helgaas@kernel.org]
> Sent: Wednesday, March 29, 2017 4:55 PM
> To: Zytaruk, Kelly
> Cc: linux-pci@vger.kernel.org; Alex Williamson
> Subject: Re: Having problems resetting a PCI device
>=20
> Hi Kelly,
>=20
> On Wed, Mar 29, 2017 at 08:03:33PM +0000, Zytaruk, Kelly wrote:
> > I have a PCI device that is sitting behind a bridge.
> >
> > Under certain reproducible circumstances the PCI device will become
> > inactive. Reading the PCI config space returns all 0xFFFFFFFF.
> >
> > The bridge appears to still be functional. Reading the status from the
> > bridge I see a Fatal Error due to a Surprise Down event.
>=20
> Just to be specific, is this the "Surprise Down Error" in the AER uncorre=
ctable
> error status register?  "lspci -vv" probably decodes all that for you.

Yes, it shows up in the bridge device PCI config space.

>=20
> > I am trying to figure out how to bring the device back online.
> >
> > I tried toggling the secondary bus reset bit of the Bridge Control
> > Register but it doesn't appear to make any difference. I still see
> > 0xFFFFFFFF in the device config space.
>=20
> Are you calling pci_reset_function() or doing this by hand?
> pci_reset_function() tries several different strategies, one of which is =
toggling
> the secondary bus reset bit.

I am doing it by hand. =20
I just found the pci_reset_function about 5 minutes ago as I was scanning t=
hrough pci.c for any clues.

>=20
> > I provided a pci_error_handler but the error_detected() function is
> > not getting called.
>=20
> Do you have CONFIG_PCIEAER turned on?  I would naively expect AER to log
> something and call your error_detected() function if this error occurs (b=
ut I

CONFIG_PCIEAER=3Dy but error_detected() not getting called.

I also noticed that CONFIG_HOTPLUG_PCI_PCIE=3Dy.  How do I trigger a hot un=
plug, hot plug?
Maybe that might bring it back?

> haven't looked at the code for a long time).
>=20
> > Given that these two methods are not helping me out what other choices
> > do I have to either reset the PCI device or hot-plug the device from a
> > kernel driver. Or some other method of bring the device back to life.
> >
> > Note that I am running Linux 4.8 in dom0 on Xen (if that makes a
> > difference).
> >
> > Thanks, Kelly

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: Having problems resetting a PCI device
  2017-03-29 20:54 ` Bjorn Helgaas
  2017-03-29 21:23   ` Zytaruk, Kelly
@ 2017-03-29 21:41   ` Zytaruk, Kelly
  2017-03-30 21:42     ` Bjorn Helgaas
  1 sibling, 1 reply; 7+ messages in thread
From: Zytaruk, Kelly @ 2017-03-29 21:41 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: linux-pci, Alex Williamson



> -----Original Message-----
> From: Bjorn Helgaas [mailto:helgaas@kernel.org]
> Sent: Wednesday, March 29, 2017 4:55 PM
> To: Zytaruk, Kelly
> Cc: linux-pci@vger.kernel.org; Alex Williamson
> Subject: Re: Having problems resetting a PCI device
>=20
> Hi Kelly,
>=20
> On Wed, Mar 29, 2017 at 08:03:33PM +0000, Zytaruk, Kelly wrote:
> > I have a PCI device that is sitting behind a bridge.
> >
> > Under certain reproducible circumstances the PCI device will become
> > inactive. Reading the PCI config space returns all 0xFFFFFFFF.
> >
> > The bridge appears to still be functional. Reading the status from the
> > bridge I see a Fatal Error due to a Surprise Down event.
>=20
> Just to be specific, is this the "Surprise Down Error" in the AER uncorre=
ctable
> error status register?  "lspci -vv" probably decodes all that for you.
>=20
> > I am trying to figure out how to bring the device back online.
> >
> > I tried toggling the secondary bus reset bit of the Bridge Control
> > Register but it doesn't appear to make any difference. I still see
> > 0xFFFFFFFF in the device config space.
>=20
> Are you calling pci_reset_function() or doing this by hand?
> pci_reset_function() tries several different strategies, one of which is =
toggling
> the secondary bus reset bit.

I just read the documentation for the call and this could be a problem
"The PCI device must be responsive  to PCI config space in order to use thi=
s function."

In my case reading PCI config space returns all 0xFFFFFFFF

>=20
> > I provided a pci_error_handler but the error_detected() function is
> > not getting called.
>=20
> Do you have CONFIG_PCIEAER turned on?  I would naively expect AER to log
> something and call your error_detected() function if this error occurs (b=
ut I
> haven't looked at the code for a long time).
>=20
> > Given that these two methods are not helping me out what other choices
> > do I have to either reset the PCI device or hot-plug the device from a
> > kernel driver. Or some other method of bring the device back to life.
> >
> > Note that I am running Linux 4.8 in dom0 on Xen (if that makes a
> > difference).
> >
> > Thanks, Kelly

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Having problems resetting a PCI device
  2017-03-29 21:41   ` Zytaruk, Kelly
@ 2017-03-30 21:42     ` Bjorn Helgaas
  2017-03-30 21:47       ` Zytaruk, Kelly
  0 siblings, 1 reply; 7+ messages in thread
From: Bjorn Helgaas @ 2017-03-30 21:42 UTC (permalink / raw)
  To: Zytaruk, Kelly; +Cc: linux-pci, Alex Williamson

On Wed, Mar 29, 2017 at 09:41:48PM +0000, Zytaruk, Kelly wrote:
> 
> 
> > -----Original Message-----
> > From: Bjorn Helgaas [mailto:helgaas@kernel.org]
> > Sent: Wednesday, March 29, 2017 4:55 PM
> > To: Zytaruk, Kelly
> > Cc: linux-pci@vger.kernel.org; Alex Williamson
> > Subject: Re: Having problems resetting a PCI device
> > 
> > Hi Kelly,
> > 
> > On Wed, Mar 29, 2017 at 08:03:33PM +0000, Zytaruk, Kelly wrote:
> > > I have a PCI device that is sitting behind a bridge.
> > >
> > > Under certain reproducible circumstances the PCI device will become
> > > inactive. Reading the PCI config space returns all 0xFFFFFFFF.
> > >
> > > The bridge appears to still be functional. Reading the status from the
> > > bridge I see a Fatal Error due to a Surprise Down event.
> > 
> > Just to be specific, is this the "Surprise Down Error" in the AER uncorrectable
> > error status register?  "lspci -vv" probably decodes all that for you.
> > 
> > > I am trying to figure out how to bring the device back online.
> > >
> > > I tried toggling the secondary bus reset bit of the Bridge Control
> > > Register but it doesn't appear to make any difference. I still see
> > > 0xFFFFFFFF in the device config space.
> > 
> > Are you calling pci_reset_function() or doing this by hand?
> > pci_reset_function() tries several different strategies, one of which is toggling
> > the secondary bus reset bit.
> 
> I just read the documentation for the call and this could be a problem
> "The PCI device must be responsive  to PCI config space in order to use this function."
> 
> In my case reading PCI config space returns all 0xFFFFFFFF

I think Surprise Down means the link is down, so you won't be able to
reach the device at all until it gets reset.

But the secondary bus reset is done by the switch port immediately
upstream from the device, so that should still work.  If the device
still doesn't work after doing a secondary bus reset, maybe there's a
device defect related to reset.

That port (a Root Port or Switch Downstream Port) is probably where
the Surprise Down error was logged.  If you have CONFIG_PCIEAER turned
on, I think the kernel should log some stuff in dmesg, hopefully
including the error type and something that identifies the link.  Do
you see any of that?

If you don't have CONFIG_PCIEAER turned on, you should be able to use
lspci to look at what's logged in the AER capability.  Unfortunately,
lspci doesn't know how to decode everything, but you can use
"lspci -xxxx" to look at it and decode things manually.

> > > I provided a pci_error_handler but the error_detected() function is
> > > not getting called.
> > 
> > Do you have CONFIG_PCIEAER turned on?  I would naively expect AER to log
> > something and call your error_detected() function if this error occurs (but I
> > haven't looked at the code for a long time).


> > > Given that these two methods are not helping me out what other choices
> > > do I have to either reset the PCI device or hot-plug the device from a
> > > kernel driver. Or some other method of bring the device back to life.

You should be able to "echo 1 > /sys/bus/pci/devices/.../remove" to
hot-unplug the device, then "echo 1 > /sys/bus/pci/rescan" to
rediscover it.

Bjorn

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: Having problems resetting a PCI device
  2017-03-30 21:42     ` Bjorn Helgaas
@ 2017-03-30 21:47       ` Zytaruk, Kelly
  0 siblings, 0 replies; 7+ messages in thread
From: Zytaruk, Kelly @ 2017-03-30 21:47 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: linux-pci, Alex Williamson



> -----Original Message-----
> From: Bjorn Helgaas [mailto:helgaas@kernel.org]
> Sent: Thursday, March 30, 2017 5:42 PM
> To: Zytaruk, Kelly
> Cc: linux-pci@vger.kernel.org; Alex Williamson
> Subject: Re: Having problems resetting a PCI device
>=20
> On Wed, Mar 29, 2017 at 09:41:48PM +0000, Zytaruk, Kelly wrote:
> >
> >
> > > -----Original Message-----
> > > From: Bjorn Helgaas [mailto:helgaas@kernel.org]
> > > Sent: Wednesday, March 29, 2017 4:55 PM
> > > To: Zytaruk, Kelly
> > > Cc: linux-pci@vger.kernel.org; Alex Williamson
> > > Subject: Re: Having problems resetting a PCI device
> > >
> > > Hi Kelly,
> > >
> > > On Wed, Mar 29, 2017 at 08:03:33PM +0000, Zytaruk, Kelly wrote:
> > > > I have a PCI device that is sitting behind a bridge.
> > > >
> > > > Under certain reproducible circumstances the PCI device will
> > > > become inactive. Reading the PCI config space returns all 0xFFFFFFF=
F.
> > > >
> > > > The bridge appears to still be functional. Reading the status from
> > > > the bridge I see a Fatal Error due to a Surprise Down event.
> > >
> > > Just to be specific, is this the "Surprise Down Error" in the AER
> > > uncorrectable error status register?  "lspci -vv" probably decodes al=
l that for
> you.
> > >
> > > > I am trying to figure out how to bring the device back online.
> > > >
> > > > I tried toggling the secondary bus reset bit of the Bridge Control
> > > > Register but it doesn't appear to make any difference. I still see
> > > > 0xFFFFFFFF in the device config space.
> > >
> > > Are you calling pci_reset_function() or doing this by hand?
> > > pci_reset_function() tries several different strategies, one of
> > > which is toggling the secondary bus reset bit.
> >
> > I just read the documentation for the call and this could be a problem
> > "The PCI device must be responsive  to PCI config space in order to use=
 this
> function."
> >
> > In my case reading PCI config space returns all 0xFFFFFFFF
>=20
> I think Surprise Down means the link is down, so you won't be able to rea=
ch the
> device at all until it gets reset.
>=20
> But the secondary bus reset is done by the switch port immediately upstre=
am
> from the device, so that should still work.  If the device still doesn't =
work after
> doing a secondary bus reset, maybe there's a device defect related to res=
et.
>=20
> That port (a Root Port or Switch Downstream Port) is probably where the
> Surprise Down error was logged.  If you have CONFIG_PCIEAER turned on, I
> think the kernel should log some stuff in dmesg, hopefully including the =
error
> type and something that identifies the link.  Do you see any of that?

I am not seeing anything in dmesg log

>=20
> If you don't have CONFIG_PCIEAER turned on, you should be able to use lsp=
ci to
> look at what's logged in the AER capability.  Unfortunately, lspci doesn'=
t know
> how to decode everything, but you can use "lspci -xxxx" to look at it and=
 decode
> things manually.
>=20
> > > > I provided a pci_error_handler but the error_detected() function
> > > > is not getting called.
> > >
> > > Do you have CONFIG_PCIEAER turned on?  I would naively expect AER to
> > > log something and call your error_detected() function if this error
> > > occurs (but I haven't looked at the code for a long time).
>=20
>=20
> > > > Given that these two methods are not helping me out what other
> > > > choices do I have to either reset the PCI device or hot-plug the
> > > > device from a kernel driver. Or some other method of bring the devi=
ce back
> to life.
>=20
> You should be able to "echo 1 > /sys/bus/pci/devices/.../remove" to hot-u=
nplug
> the device, then "echo 1 > /sys/bus/pci/rescan" to rediscover it.

I tried "echo 1 >remove" after the hang and it hung the Hypervisor.  The Xe=
n log should a fault followed by a reboot about 5 second later.

I don't recall the exact message but the last entry on the stack had someth=
ing to with restoring msi interrupts just before the reboot.

>=20
> Bjorn

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Having problems resetting a PCI device
@ 2017-03-29 19:57 Zytaruk, Kelly
  0 siblings, 0 replies; 7+ messages in thread
From: Zytaruk, Kelly @ 2017-03-29 19:57 UTC (permalink / raw)
  To: linux-pci, Alex Williamson; +Cc: Bjorn Helgaas

[-- Attachment #1: Type: text/plain, Size: 966 bytes --]

I have a PCI device that is sitting behind a bridge.

Under certain reproducible circumstances the PCI device will become inactive. Reading the PCI config space returns all 0xFFFFFFFF.

The bridge appears to still be functional. Reading the status from the bridge I see a Fatal Error due to a Surprise Down event.

I am trying to figure out how to bring the device back online.

I tried toggling the secondary bus reset bit of the Bridge Control Register but it doesn't appear to make any difference. I still see 0xFFFFFFFF in the device config space.

I provided a pci_error_handler but the error_detected() function is not getting called.

Given that these two methods are not helping me out what other choices do I have to either reset the PCI device or hot-plug the device from a kernel driver. Or some other method of bring the device back to life.

Note that I am running Linux 4.8 in dom0 on Xen (if that makes a difference).

Thanks,
Kelly

[-- Attachment #2: Type: text/html, Size: 2691 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2017-03-30 21:47 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-29 20:03 Having problems resetting a PCI device Zytaruk, Kelly
2017-03-29 20:54 ` Bjorn Helgaas
2017-03-29 21:23   ` Zytaruk, Kelly
2017-03-29 21:41   ` Zytaruk, Kelly
2017-03-30 21:42     ` Bjorn Helgaas
2017-03-30 21:47       ` Zytaruk, Kelly
  -- strict thread matches above, loose matches on Subject: below --
2017-03-29 19:57 Zytaruk, Kelly

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.