linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Regression caused by "PCI: Probe for device reset support during enumeration"
@ 2018-12-14 13:15 Ross Lagerwall
  2018-12-14 17:18 ` Bjorn Helgaas
  0 siblings, 1 reply; 3+ messages in thread
From: Ross Lagerwall @ 2018-12-14 13:15 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: linux-pci

Hi Bjorn,

Your commit 5b0764cac9f1 ("PCI: Probe for device reset support during 
enumeration") moved checking whether a device could be reset much 
earlier to when the device is first probed. When the device is first 
probed, the other devices on the bus may not have been discovered yet. 
This means that we will claim to support SBR as a reset mechanism 
because it is the only device behind the bus at that pointer meanwhile 
the others simply haven't been discovered yet. This results in 
dev->reset_fn being incorrectly set to true and a reset file being 
created. When userspace actually tries to use the reset file it fails 
because now there are other sibling devices preventing the use of an SBR.

How can this best be fixed? I was considering moving the check to the 
end of pci_bus_add_devices(), but I'm not at all familiar with the code 
so perhaps there is a better way.

Thanks,
-- 
Ross Lagerwall

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Regression caused by "PCI: Probe for device reset support during enumeration"
  2018-12-14 13:15 Regression caused by "PCI: Probe for device reset support during enumeration" Ross Lagerwall
@ 2018-12-14 17:18 ` Bjorn Helgaas
  2018-12-14 18:40   ` Alex Williamson
  0 siblings, 1 reply; 3+ messages in thread
From: Bjorn Helgaas @ 2018-12-14 17:18 UTC (permalink / raw)
  To: Ross Lagerwall; +Cc: linux-pci, Sinan Kaya, Alex Williamson, Rafael J. Wysocki

[+cc Sinan, Alex, Rafael]

On Fri, Dec 14, 2018 at 01:15:58PM +0000, Ross Lagerwall wrote:
> Hi Bjorn,
> 
> Your commit 5b0764cac9f1 ("PCI: Probe for device reset support during
> enumeration") moved checking whether a device could be reset much earlier to
> when the device is first probed. When the device is first probed, the other
> devices on the bus may not have been discovered yet. This means that we will
> claim to support SBR as a reset mechanism because it is the only device
> behind the bus at that pointer meanwhile the others simply haven't been
> discovered yet. This results in dev->reset_fn being incorrectly set to true
> and a reset file being created. When userspace actually tries to use the
> reset file it fails because now there are other sibling devices preventing
> the use of an SBR.

First of all, I'm very sorry about the regression and thanks very much
for the report!  I know it's a lot of work to track down things like
this.

We run pci_probe_reset_function() during pci_init_capabilities(),
which is before other devices on the bus are discovered, as you say.
That checks not just for SBR support, but for other types of reset
(FLR, PM, etc) as well.

In your case the userspace reset actually fails, so I assume that
means SBR is the *only* supported reset type for that device?

Since you have two devices on the bus, I guess SBR of device A
*should* work during the interval between enumerating A and B (it will
reset both A and B in hardware, but since the OS doesn't know about B
yet, that's probably OK).  After we enumerate B and potentially bind a
driver to it, we can't use SBR any more, of course.

I could imagine removing A's sysfs reset file when B is enumerated,
but the locking might be ugly.  And of course A might support other
types of reset, and then you *want* A's sysfs file.

What happened prior to 5b0764cac9f1?  There was never a sysfs reset
file at all, so userspace didn't even attempt the reset?

We could wait to create the sysfs reset file until after we've
enumerated all the devices on the bus, but then we're opening a race
when a device could be hot-added to the bus.  E.g., boot-time
enumeration finds only A, we create its sysfs reset file, then we
hot-add device B (admittedly mostly a conventional PCI scenario), and
now we're in the same situation you're seeing where A's reset file
exists, but it always fails because SBR is no longer safe.

Bjorn

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Regression caused by "PCI: Probe for device reset support during enumeration"
  2018-12-14 17:18 ` Bjorn Helgaas
@ 2018-12-14 18:40   ` Alex Williamson
  0 siblings, 0 replies; 3+ messages in thread
From: Alex Williamson @ 2018-12-14 18:40 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: Ross Lagerwall, linux-pci, Sinan Kaya, Rafael J. Wysocki

On Fri, 14 Dec 2018 11:18:38 -0600
Bjorn Helgaas <helgaas@kernel.org> wrote:

> [+cc Sinan, Alex, Rafael]
> 
> On Fri, Dec 14, 2018 at 01:15:58PM +0000, Ross Lagerwall wrote:
> > Hi Bjorn,
> > 
> > Your commit 5b0764cac9f1 ("PCI: Probe for device reset support during
> > enumeration") moved checking whether a device could be reset much earlier to
> > when the device is first probed. When the device is first probed, the other
> > devices on the bus may not have been discovered yet. This means that we will
> > claim to support SBR as a reset mechanism because it is the only device
> > behind the bus at that pointer meanwhile the others simply haven't been
> > discovered yet. This results in dev->reset_fn being incorrectly set to true
> > and a reset file being created. When userspace actually tries to use the
> > reset file it fails because now there are other sibling devices preventing
> > the use of an SBR.  
> 
> First of all, I'm very sorry about the regression and thanks very much
> for the report!  I know it's a lot of work to track down things like
> this.
> 
> We run pci_probe_reset_function() during pci_init_capabilities(),
> which is before other devices on the bus are discovered, as you say.
> That checks not just for SBR support, but for other types of reset
> (FLR, PM, etc) as well.
> 
> In your case the userspace reset actually fails, so I assume that
> means SBR is the *only* supported reset type for that device?
> 
> Since you have two devices on the bus, I guess SBR of device A
> *should* work during the interval between enumerating A and B (it will
> reset both A and B in hardware, but since the OS doesn't know about B
> yet, that's probably OK).  After we enumerate B and potentially bind a
> driver to it, we can't use SBR any more, of course.
> 
> I could imagine removing A's sysfs reset file when B is enumerated,
> but the locking might be ugly.  And of course A might support other
> types of reset, and then you *want* A's sysfs file.

I've actually wished for the opposite of this in the past, given a bus
with devices A and B with no sysfs reset file, I wish I could 'echo 1 >
B/remove' to rescan for SBR for device A.

> What happened prior to 5b0764cac9f1?  There was never a sysfs reset
> file at all, so userspace didn't even attempt the reset?

Yep.

> We could wait to create the sysfs reset file until after we've
> enumerated all the devices on the bus, but then we're opening a race
> when a device could be hot-added to the bus.  E.g., boot-time
> enumeration finds only A, we create its sysfs reset file, then we
> hot-add device B (admittedly mostly a conventional PCI scenario), and
> now we're in the same situation you're seeing where A's reset file
> exists, but it always fails because SBR is no longer safe.

I'd welcome dynamically re-evaluating the reset availability on device
add and remove, but it might get a bit complicated.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2018-12-14 18:40 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-14 13:15 Regression caused by "PCI: Probe for device reset support during enumeration" Ross Lagerwall
2018-12-14 17:18 ` Bjorn Helgaas
2018-12-14 18:40   ` Alex Williamson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).