linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* arm64 PCI resource allocation issue
@ 2022-08-02  4:07 Benjamin Herrenschmidt
  2022-08-02  7:46 ` Ard Biesheuvel
  2022-08-04 10:36 ` Lorenzo Pieralisi
  0 siblings, 2 replies; 8+ messages in thread
From: Benjamin Herrenschmidt @ 2022-08-02  4:07 UTC (permalink / raw)
  To: linux-pci
  Cc: bhelgaas, mark.rutland, linux-arm-kernel, Jeremy Linton,
	Ali Saidi, David Woodhouse

Hi Folks !

It's been a while ... (please extend the CC list as needed)

I'd like to re-open an ancient issue because it's still biting us
(AWS).

A few years back, I updated the PCIe resource allocation code to be a
bit more in line with what other architectures do. That said, once
thing we couldn't agree on was to do like x86 and default to preserving
the firmware provided resources by default.

On x86, the kernel "allocates" (claims) the resources (unless it finds
something obviously wrong), then allocates anything left unallocated.

On arm64, we use to just re-allocate everything. I changed this to
first use some more common code for doing all that, but also to have
the option to claim existing resources if _DSM tells us to preserve
them for a given host brigde.

I still think this is the wrong way to go and that we should preserve
the UEFI resources by default unless told not to :-)

The case back then was that there existed some (how many ? there was
one real example if I remember correctly) bogus firwmares that came out
of UEFI with too small windows. We could just quirk those ....

The reason I'm bringing this back is that re-allocating resources for
system devices cause problems.

The most obvious one today that is affecting EC2 instances is that the
UART address specified in SPCR is no longer valid, causing issues
ranging from the console not working to MMIO to what becomes "random
addresses". Typically today this is "worked around" by using
console=ttyS0 to force selection of the first detected PCI UART,
because the match against SPCR is based on address and it won't match,
but there's always the underlying risk that things like earlycon starts
poking at now-incorrect addresses until 8250 takes over.

This is the most obvious problem. Any other "system" device that
happens to be PCIe based (anything detected early, via device-tree,
ACPI or otherwise) is at risk of a similar issue. On x86 that could be
catastrophic because near everything looks like a PCI device, on arm64
we seem to have been getting away with it a bit more easily ... so far.

The alternative here would be to use ad-hock kludges for such system
devices, to "register" the addresses early, and have some kind of hook
in the PCI code that keeps track of them as they get remapped.

If we want this, I would propose (happy to provide the implementation
but let's discuss the design first) something along the line of a
generic mechanism to "register" such a system device, which would add
it to a list. That list would be scanned on PCI device discovery for
BAR address matches, and the pci_dev/BAR# added to the entry (that or
put a pointer to the entry into pci_dev for speed/efficiency).

The difficulty is how is that update propagated:

This is of course fiddly. For example, the serial info is passed via
two different ways, one being earlycon (and probably the easiest to
track), the other one an ASCII string passed to
add_preferred_console(), which would require more significant hackery
(the code dealing with console mathing is a gothic horror).

Also if such a system device is in continuous use during the boot
process (UART ?) it needs to be "updated" as soon as possible after the
BARs (and parent BARs) have been also updated (in fact this is
generally why PCI debug dies horribly when using PCI based UARTs).
Maybe an (optional) callback that earlycon can add ?

Additionally, this would only work if such system devices are
"registered" before they get remapped... 

Another approach would be to have pci_dev keep a copy of the original
resources (at least for the primary BARs) and provide an accessor for
use by things like earlycon or 8250 to compare against these, though
that doesn't solve the problem of promptly "updating" drivers for
system devices.

Opinions ?

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: arm64 PCI resource allocation issue
  2022-08-02  4:07 arm64 PCI resource allocation issue Benjamin Herrenschmidt
@ 2022-08-02  7:46 ` Ard Biesheuvel
  2022-08-02  9:18   ` [EXTERNAL]arm64 " David Woodhouse
  2022-08-03  2:31   ` arm64 " Benjamin Herrenschmidt
  2022-08-04 10:36 ` Lorenzo Pieralisi
  1 sibling, 2 replies; 8+ messages in thread
From: Ard Biesheuvel @ 2022-08-02  7:46 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: linux-pci, bhelgaas, mark.rutland, linux-arm-kernel,
	Jeremy Linton, Ali Saidi, David Woodhouse, Robin Murphy

Hi Ben,

On Tue, 2 Aug 2022 at 06:13, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
>
> Hi Folks !
>
> It's been a while ... (please extend the CC list as needed)
>
> I'd like to re-open an ancient issue because it's still biting us
> (AWS).
>
> A few years back, I updated the PCIe resource allocation code to be a
> bit more in line with what other architectures do. That said, once
> thing we couldn't agree on was to do like x86 and default to preserving
> the firmware provided resources by default.
>
> On x86, the kernel "allocates" (claims) the resources (unless it finds
> something obviously wrong), then allocates anything left unallocated.
>
> On arm64, we use to just re-allocate everything. I changed this to
> first use some more common code for doing all that, but also to have
> the option to claim existing resources if _DSM tells us to preserve
> them for a given host brigde.
>
> I still think this is the wrong way to go and that we should preserve
> the UEFI resources by default unless told not to :-)
>

+1

Using _DSM #5 also prevents, e.g., the AMD GPU driver from resizing
its BARs, which is unfortunate. And the CXL stuff that is layered on
top of PCIe is going to get trickier when the OS is free to reassign
resources, so I expect _DSM #5 to be used more widely in that context.

So are there any other reasons to avoid _DSM #5 in your case?

> The case back then was that there existed some (how many ? there was
> one real example if I remember correctly) bogus firwmares that came out
> of UEFI with too small windows. We could just quirk those ....
>

Agreed. We could add another hint at the firmware level that the PCIe
resources have been assigned, and as far as the firmware is concerned,
no changes are needed. This would be weaker than _DSM #5 (which means
'resource allocations *must* be honored for some unspecified reason,
which is similar to 'probe-only' on DT, i.e., any problems will not be
fixed)

> The reason I'm bringing this back is that re-allocating resources for
> system devices cause problems.
>
> The most obvious one today that is affecting EC2 instances is that the
> UART address specified in SPCR is no longer valid, causing issues
> ranging from the console not working to MMIO to what becomes "random
> addresses". Typically today this is "worked around" by using
> console=ttyS0 to force selection of the first detected PCI UART,
> because the match against SPCR is based on address and it won't match,
> but there's always the underlying risk that things like earlycon starts
> poking at now-incorrect addresses until 8250 takes over.
>
> This is the most obvious problem. Any other "system" device that
> happens to be PCIe based (anything detected early, via device-tree,
> ACPI or otherwise) is at risk of a similar issue. On x86 that could be
> catastrophic because near everything looks like a PCI device, on arm64
> we seem to have been getting away with it a bit more easily ... so far.
>

So what is the reason you are avoiding _DSM #5?

> The alternative here would be to use ad-hock kludges for such system
> devices, to "register" the addresses early, and have some kind of hook
> in the PCI code that keeps track of them as they get remapped.
>

Yeah, we did this for the EFI framebuffer but this doesn't really
scale IMHO so I would prefer to avoid that.

> If we want this, I would propose (happy to provide the implementation
> but let's discuss the design first) something along the line of a
> generic mechanism to "register" such a system device, which would add
> it to a list. That list would be scanned on PCI device discovery for
> BAR address matches, and the pci_dev/BAR# added to the entry (that or
> put a pointer to the entry into pci_dev for speed/efficiency).
>

This means that bus numbers cannot be reassigned, which I don't think
we rely on today. To positively identify a PCI device, you'll need
some kind of path notation to avoid relying on the bus numbers
assigned by the firmware (this could happen for hot-pluggable root
ports where no bus range has been reserved by the firmware)

> The difficulty is how is that update propagated:
>
> This is of course fiddly. For example, the serial info is passed via
> two different ways, one being earlycon (and probably the easiest to
> track), the other one an ASCII string passed to
> add_preferred_console(), which would require more significant hackery
> (the code dealing with console mathing is a gothic horror).
>
> Also if such a system device is in continuous use during the boot
> process (UART ?) it needs to be "updated" as soon as possible after the
> BARs (and parent BARs) have been also updated (in fact this is
> generally why PCI debug dies horribly when using PCI based UARTs).
> Maybe an (optional) callback that earlycon can add ?
>
> Additionally, this would only work if such system devices are
> "registered" before they get remapped...
>
> Another approach would be to have pci_dev keep a copy of the original
> resources (at least for the primary BARs) and provide an accessor for
> use by things like earlycon or 8250 to compare against these, though
> that doesn't solve the problem of promptly "updating" drivers for
> system devices.
>
> Opinions ?
>

I don't think working around it like this is going to be maintainable
in the long term. I would much rather move to a model where the OS
[conditionally] preserves the bus and BAR assignments, perhaps in a
more fine-grained way than _DSM #5? Or at least more permissive.

The last time this came up in the PCI firmware SIG, we did discuss
_DSM #5 at intermediate levels of the resource tree IIRC, which could
be one way around this. But I'd still prefer a model where the
resource assignments are guaranteed to be preserved if they meet some
kind of agreed standard between the OS and the firmware.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [EXTERNAL]arm64 PCI resource allocation issue
  2022-08-02  7:46 ` Ard Biesheuvel
@ 2022-08-02  9:18   ` David Woodhouse
  2022-08-03  2:32     ` Benjamin Herrenschmidt
  2022-08-03  2:31   ` arm64 " Benjamin Herrenschmidt
  1 sibling, 1 reply; 8+ messages in thread
From: David Woodhouse @ 2022-08-02  9:18 UTC (permalink / raw)
  To: Ard Biesheuvel, Benjamin Herrenschmidt
  Cc: linux-pci, bhelgaas, mark.rutland, linux-arm-kernel,
	Jeremy Linton, Ali Saidi, Robin Murphy

[-- Attachment #1: Type: text/plain, Size: 1087 bytes --]

On Tue, 2022-08-02 at 09:46 +0200, Ard Biesheuvel wrote:
> > If we want this, I would propose (happy to provide the implementation
> > but let's discuss the design first) something along the line of a
> > generic mechanism to "register" such a system device, which would add
> > it to a list. That list would be scanned on PCI device discovery for
> > BAR address matches, and the pci_dev/BAR# added to the entry (that or
> > put a pointer to the entry into pci_dev for speed/efficiency).
> 
> This means that bus numbers cannot be reassigned, which I don't think
> we rely on today. To positively identify a PCI device, you'll need
> some kind of path notation to avoid relying on the bus numbers
> assigned by the firmware (this could happen for hot-pluggable root
> ports where no bus range has been reserved by the firmware)

That kind of path notation already exists for the Intel IOMMU, and
probably others. See dmar_match_pci_path(), dmar_pci_bus_add_dev() etc.

It would be good to lift that out and make it generic, rather than
reinventing another version.


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: arm64 PCI resource allocation issue
  2022-08-02  7:46 ` Ard Biesheuvel
  2022-08-02  9:18   ` [EXTERNAL]arm64 " David Woodhouse
@ 2022-08-03  2:31   ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 8+ messages in thread
From: Benjamin Herrenschmidt @ 2022-08-03  2:31 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: linux-pci, bhelgaas, mark.rutland, linux-arm-kernel,
	Jeremy Linton, Ali Saidi, David Woodhouse, Robin Murphy

On Tue, 2022-08-02 at 09:46 +0200, Ard Biesheuvel wrote:
> 
> > A few years back, I updated the PCIe resource allocation code to be a
> > bit more in line with what other architectures do. That said, once
> > thing we couldn't agree on was to do like x86 and default to preserving
> > the firmware provided resources by default.
> > 
> > On x86, the kernel "allocates" (claims) the resources (unless it finds
> > something obviously wrong), then allocates anything left unallocated.
> > 
> > On arm64, we use to just re-allocate everything. I changed this to
> > first use some more common code for doing all that, but also to have
> > the option to claim existing resources if _DSM tells us to preserve
> > them for a given host brigde.
> > 
> > I still think this is the wrong way to go and that we should preserve
> > the UEFI resources by default unless told not to :-)
> > 
> 
> +1
> 
> Using _DSM #5 also prevents, e.g., the AMD GPU driver from resizing
> its BARs, which is unfortunate.

Why ? It doesn't have to as long as there's space available somewhere
for it. Note that most x86 UEFIs have specific options to enable BAR
resizing, probably for that specific reason but I don't know for sure
what that option really  does.

But re-assigning under driver control is always ok if the space is
available somewhere.

What I want is the behaviour we have on x86 which is to honor what was
assigned by default. It doesnt' prevent explicit re-assignment under
driver control later on. (Well it shouldn't). Same goes with VFIO
etc...

>  And the CXL stuff that is layered on
> top of PCIe is going to get trickier when the OS is free to reassign
> resources, so I expect _DSM #5 to be used more widely in that context.
> 
> So are there any other reasons to avoid _DSM #5 in your case?

We currently use _DSM #5 to force the use of the firmware allocation on
our metal systems, but here too, simply honoring existing resources at
boot would be a viable alternative.

We could fix the serial console problem I mentioned further down by
extending the use of _DSM #5 but I don't like it. First it's an ad-hoc
fix for us and doesn't fix the general problem that SPRC for PCI based
UART is basically broken unless we have _DSM in firwmare. Secondly, in
our specific case, changing this would mean a potentially disruptive
change to the PCI layout of existing instance types, which is the kind
of change we prefer not exposing our customers to.  

> > The case back then was that there existed some (how many ? there was
> > one real example if I remember correctly) bogus firwmares that came out
> > of UEFI with too small windows. We could just quirk those ....
> > 
> 
> Agreed. We could add another hint at the firmware level that the PCIe
> resources have been assigned, and as far as the firmware is concerned,
> no changes are needed.

I don't think we need the firmware to set a hint. Fundamentally a UEFI
firmware is supposed to do it. Additionally, it's perfectly fine for
firmware to only assign *some* resources and Linux already knows how to
pick it up from there and assign what's left. This is how x86 works.

If what we are trying to deal is known broken UEFI implementations on
specific hardware, I'd rather we add a quirk mechanism to force re-
assignment on selected firmwares to deal with them.

For DT platforms, I don't mind having "reassign all" be the default
with potentially a DT property to flip that around. That makes sense.
But for UEFI+ACPI platforms, I think the default really should be to
honor what's been assigned.

>  This would be weaker than _DSM #5 (which means
> 'resource allocations *must* be honored for some unspecified reason,
> which is similar to 'probe-only' on DT, i.e., any problems will not be
> fixed)

One can be a bit pedantic.... Linux will still assign what was left
unassigned and will still force-reassign a range of "really broken"
setups (such as devices that appear to be outside of their bridges
etc...). Just like x86 :-)

It's also capable of reallocating of asked for and there's space
available under the host bridge (or in the top address space).

The general expectation however is that if a UEFI+ACPI platform has
assigned resources at boot, we should probably honor them unless we
have a very good reason not to.

> > The reason I'm bringing this back is that re-allocating resources for
> > system devices cause problems.
> > 
> > The most obvious one today that is affecting EC2 instances is that the
> > UART address specified in SPCR is no longer valid, causing issues
> > ranging from the console not working to MMIO to what becomes "random
> > addresses". Typically today this is "worked around" by using
> > console=ttyS0 to force selection of the first detected PCI UART,
> > because the match against SPCR is based on address and it won't match,
> > but there's always the underlying risk that things like earlycon starts
> > poking at now-incorrect addresses until 8250 takes over.
> > 
> > This is the most obvious problem. Any other "system" device that
> > happens to be PCIe based (anything detected early, via device-tree,
> > ACPI or otherwise) is at risk of a similar issue. On x86 that could be
> > catastrophic because near everything looks like a PCI device, on arm64
> > we seem to have been getting away with it a bit more easily ... so far.
> > 
> 
> So what is the reason you are avoiding _DSM #5?

The reason *we* (EC2) specifically is that historically we didn't have
it on virtual instances and it would be a disruptive change. x86 never
needed it.

But there are other reasons. For example we only know how to honor it
for the entire hierarchy below a root bridge, it's not granular. 

> > The alternative here would be to use ad-hock kludges for such system
> > devices, to "register" the addresses early, and have some kind of hook
> > in the PCI code that keeps track of them as they get remapped.
> > 
> 
> Yeah, we did this for the EFI framebuffer but this doesn't really
> scale IMHO so I would prefer to avoid that.

Ah I missed that bit.

> > If we want this, I would propose (happy to provide the implementation
> > but let's discuss the design first) something along the line of a
> > generic mechanism to "register" such a system device, which would add
> > it to a list. That list would be scanned on PCI device discovery for
> > BAR address matches, and the pci_dev/BAR# added to the entry (that or
> > put a pointer to the entry into pci_dev for speed/efficiency).
> > 
> 
> This means that bus numbers cannot be reassigned, which I don't think
> we rely on today.

Not sure why we would have that limitation. My idea was to match the
address at discovery time and link to the struct pci_dev, which should
be agnostic to bus numbering.

>  To positively identify a PCI device, you'll need
> some kind of path notation to avoid relying on the bus numbers
> assigned by the firmware (this could happen for hot-pluggable root
> ports where no bus range has been reserved by the firmware)

No, we would match the firmware provided address with the resource
location at probe time. But that's just one idea... anotehr one could
be to have a list of such "firmware" devices at boot and at probe time,
when such a device is detected, we "flag" it.

Then we introduce a pass that force the resource allocation for those
flagged devices (and the parents)

> > The difficulty is how is that update propagated:
> > 
> > This is of course fiddly. For example, the serial info is passed via
> > two different ways, one being earlycon (and probably the easiest to
> > track), the other one an ASCII string passed to
> > add_preferred_console(), which would require more significant hackery
> > (the code dealing with console mathing is a gothic horror).
> > 
> > Also if such a system device is in continuous use during the boot
> > process (UART ?) it needs to be "updated" as soon as possible after the
> > BARs (and parent BARs) have been also updated (in fact this is
> > generally why PCI debug dies horribly when using PCI based UARTs).
> > Maybe an (optional) callback that earlycon can add ?
> > 
> > Additionally, this would only work if such system devices are
> > "registered" before they get remapped...
> > 
> > Another approach would be to have pci_dev keep a copy of the original
> > resources (at least for the primary BARs) and provide an accessor for
> > use by things like earlycon or 8250 to compare against these, though
> > that doesn't solve the problem of promptly "updating" drivers for
> > system devices.
> > 
> > Opinions ?
> > 
> 
> I don't think working around it like this is going to be maintainable
> in the long term. 

I don't like it either, just trying to layout all the options :)

> I would much rather move to a model where the OS
> [conditionally] preserves the bus and BAR assignments, perhaps in a
> more fine-grained way than _DSM #5? Or at least more permissive.

Fine grained is always difficult since we always have to escalate up
the parent chain but yes. Maybe the approach I described above, ie,
mark those devices when we "detect" them early (fb, serial spcr, ...),
keep a list of addresses, which we then match at probe time with the
original firmware allocation, and mark those devices for resource
allocation.

Then we do a pass of allocation for those resources and their parents
only.

> The last time this came up in the PCI firmware SIG, we did discuss
> _DSM #5 at intermediate levels of the resource tree IIRC, which could
> be one way around this. But I'd still prefer a model where the
> resource assignments are guaranteed to be preserved if they meet some
> kind of agreed standard between the OS and the firmware.

I don't necessarily want to make it something that require firmware
changes and even less firmware standard changes. We are broken today.
SPCR is broken today. I think any case where we use an address provided
by firmware which might correspond to a PCI device to be later
reallocated is going to break.

x86 lived with honoring UEFI allocations forever, with a few quirks but
overall it has worked well. I really wonder why we can't do the same.

If we really feel bad about it, then I would suggest that approach
described above, where we keep a list of those early obtained addresses
and later at PCI discovery, match them to devices and force those to be
allocated.

Opinions ?

Cheers
Ben.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [EXTERNAL]arm64 PCI resource allocation issue
  2022-08-02  9:18   ` [EXTERNAL]arm64 " David Woodhouse
@ 2022-08-03  2:32     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 8+ messages in thread
From: Benjamin Herrenschmidt @ 2022-08-03  2:32 UTC (permalink / raw)
  To: David Woodhouse, Ard Biesheuvel
  Cc: linux-pci, bhelgaas, mark.rutland, linux-arm-kernel,
	Jeremy Linton, Ali Saidi, Robin Murphy

On Tue, 2022-08-02 at 10:18 +0100, David Woodhouse wrote:
> On Tue, 2022-08-02 at 09:46 +0200, Ard Biesheuvel wrote:
> > > If we want this, I would propose (happy to provide the implementation
> > > but let's discuss the design first) something along the line of a
> > > generic mechanism to "register" such a system device, which would add
> > > it to a list. That list would be scanned on PCI device discovery for
> > > BAR address matches, and the pci_dev/BAR# added to the entry (that or
> > > put a pointer to the entry into pci_dev for speed/efficiency).
> > 
> > This means that bus numbers cannot be reassigned, which I don't think
> > we rely on today. To positively identify a PCI device, you'll need
> > some kind of path notation to avoid relying on the bus numbers
> > assigned by the firmware (this could happen for hot-pluggable root
> > ports where no bus range has been reserved by the firmware)
> 
> That kind of path notation already exists for the Intel IOMMU, and
> probably others. See dmar_match_pci_path(), dmar_pci_bus_add_dev() etc.
> 
> It would be good to lift that out and make it generic, rather than
> reinventing another version.

I think this is a completely orthogonal issue to what I'm trying to
solve. I don't think we actually have a problem with bus numbers
changing (see my other response).

Yes, bus-number-agnostic PCI paths have been a thing for a long time in
device-tree land :)

Cheers,
Ben.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: arm64 PCI resource allocation issue
  2022-08-02  4:07 arm64 PCI resource allocation issue Benjamin Herrenschmidt
  2022-08-02  7:46 ` Ard Biesheuvel
@ 2022-08-04 10:36 ` Lorenzo Pieralisi
  2022-08-04 23:51   ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 8+ messages in thread
From: Lorenzo Pieralisi @ 2022-08-04 10:36 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: linux-pci, bhelgaas, mark.rutland, linux-arm-kernel,
	Jeremy Linton, Ali Saidi, David Woodhouse

On Tue, Aug 02, 2022 at 02:07:00PM +1000, Benjamin Herrenschmidt wrote:

[...]

> The case back then was that there existed some (how many ? there was
> one real example if I remember correctly) bogus firwmares that came out
> of UEFI with too small windows. We could just quirk those ....

There is just one way to discover "how many" unfortunately, quirking
those can be more problematic than it seems.

[...]

> The alternative here would be to use ad-hock kludges for such system
> devices, to "register" the addresses early, and have some kind of hook
> in the PCI code that keeps track of them as they get remapped.

That's what x86 does AFAICS (pcibios_save_fw_addr()), even though
it is used in a different scope (ie revert to FW address if the
resource allocation fails).

> If we want this, I would propose (happy to provide the implementation
> but let's discuss the design first) something along the line of a
> generic mechanism to "register" such a system device, which would add
> it to a list. That list would be scanned on PCI device discovery for
> BAR address matches, and the pci_dev/BAR# added to the entry (that or
> put a pointer to the entry into pci_dev for speed/efficiency).
> 
> The difficulty is how is that update propagated:
> 
> This is of course fiddly. For example, the serial info is passed via
> two different ways, one being earlycon (and probably the easiest to
> track), the other one an ASCII string passed to
> add_preferred_console(), which would require more significant hackery
> (the code dealing with console mathing is a gothic horror).
> 
> Also if such a system device is in continuous use during the boot
> process (UART ?) it needs to be "updated" as soon as possible after the
> BARs (and parent BARs) have been also updated (in fact this is
> generally why PCI debug dies horribly when using PCI based UARTs).
> Maybe an (optional) callback that earlycon can add ?
> 
> Additionally, this would only work if such system devices are
> "registered" before they get remapped... 
> 
> Another approach would be to have pci_dev keep a copy of the original
> resources (at least for the primary BARs) and provide an accessor for
> use by things like earlycon or 8250 to compare against these, though
> that doesn't solve the problem of promptly "updating" drivers for
> system devices.
> 
> Opinions ?

You may also want to look into IORESOURCE_PCI_FIXED even though the
last time I looked into I found some broken logic (basically the
immutable/"fixed" BAR resources should obviously take into account the
PCI tree hierarchy - upstream bridges, etc., which I don't think
IORESOURCE_PCI_FIXED does - how it works remains a bit of
a mystery for me).

Lorenzo

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: arm64 PCI resource allocation issue
  2022-08-04 10:36 ` Lorenzo Pieralisi
@ 2022-08-04 23:51   ` Benjamin Herrenschmidt
  2022-08-15 23:02     ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 8+ messages in thread
From: Benjamin Herrenschmidt @ 2022-08-04 23:51 UTC (permalink / raw)
  To: Lorenzo Pieralisi
  Cc: linux-pci, bhelgaas, mark.rutland, linux-arm-kernel,
	Jeremy Linton, Ali Saidi, David Woodhouse

On Thu, 2022-08-04 at 12:36 +0200, Lorenzo Pieralisi wrote:
> On Tue, Aug 02, 2022 at 02:07:00PM +1000, Benjamin Herrenschmidt wrote:
> 
> [...]
> 
> > The case back then was that there existed some (how many ? there was
> > one real example if I remember correctly) bogus firwmares that came out
> > of UEFI with too small windows. We could just quirk those ....
> 
> There is just one way to discover "how many" unfortunately, quirking
> those can be more problematic than it seems.

Yes it can be but I'm still keen to try, if anything to keep all
UEFI+ACPI platforms on the same basic mechanism.

> [...]
> 
> > The alternative here would be to use ad-hock kludges for such system
> > devices, to "register" the addresses early, and have some kind of hook
> > in the PCI code that keeps track of them as they get remapped.
> 
> That's what x86 does AFAICS (pcibios_save_fw_addr()), even though
> it is used in a different scope (ie revert to FW address if the
> resource allocation fails).

Right. Another kludge... It won't work much better than
IORESOURCE_PCI_FIXED the minute the device is below any amount of
bridges that have themselves be re-assigned. Worst, we might have moved
something else over to where the FW left that device.

It's ugly as heck :-) Oh well... I also really don't like how it
maintains that separate list, would be much nicer to have something
hanging off pci_dev.

../...

> > Opinions ?
> 
> You may also want to look into IORESOURCE_PCI_FIXED even though the
> last time I looked into I found some broken logic (basically the
> immutable/"fixed" BAR resources should obviously take into account the
> PCI tree hierarchy - upstream bridges, etc., which I don't think
> IORESOURCE_PCI_FIXED does - how it works remains a bit of
> a mystery for me).

It doesn't really work for the reasons you cited ... or rather it works
in limited cases. I did look into it as well ages ago, and unless
things changed, it was broken and not easily fixable. Our resource
allocation code is.... intricated.

That leaves us with 3 overall routes I can think of (we can figure out
the details next):

 1) We can try to detect early those devices (easy with SPCR, are there
more on aarch64 ? on x86 there is) and hammer them into place, flagging
them somewhat and forcing them (and all their parents) to keep their
resources.

Pros: It's rather easy to implement, we can "register" the addresses
early and have the PCI probe code match detected devices againt that
list & flag them (for example IORESOURCE_PCI_FIXES :-) and their
parents.

Cons: It will force entire bus hierarchies to be fixed, which might not
really help on firmwares that are known to setup sub-optimal apertures
(or even completely b0rked ones). But we don't know who those are
except maybe one or two if we dig down into the previous version of
that discussion from a couple of years ago.

 2) We can try to "keep track" of them as they move. Variant A.

We do it the way efifb does it and wrap that in something a bit nicer
as follow:

 - We add a helper to "record" a pci_dev/BAR#/offset combination and an
other one to do the lookup & fixup of a FW originated address.

 - We make efifb quirk use that instead of its existing global
"bar_resource".

 - We add a similar quirk to the ACPI code that parses SPCR and (maybe)
another one for earlycon (hint they may be the same device, some
deduplication would be useful).
 
 - We update 8250_pci (I assume pl01x are never PCI ?) to call this to
"fixup" addresses obtained from earlycon. That's the easy bit. SPRC is
trickier, we'd need to fixup addressed parsed from
add_preferred_console() .. I'm not 100% sure there's a case where such
an address would be added post-PCI-remap and we might incorrectly fix
it up.  I don't think so but ...

Pros: It should (hopefully) not be overly complicated and reasonably
self contained, low risk.

Cons: 

 - It's a bit more complicated than other solutions, though not
insanely

 - This doesn't solve the problem of a driver such as earlycon being
"live" accross the remapping (and thus means we'll probably still have
verybose PCI probing with earlycon dying horribly). This is already
partially broken since we temporarily disable decoding during probing
but that's a small window ... We can look at solving that separately by
adding on top of this registration mechanism: We *could* optionally
register in our above helper a pair of callbacks that the PCI code
would call for each registered "early device" before and after
remapping to "suspend access" and "fixup address". Those would be
ideally called around the remapping of the entire host bridge the
device is on.

3) Keeping track, Variant B

(note: the more I think about it, the more I prefer variant A but let's
see what others think)

We generalize pcibios_save_fw_addr() and for the sake of it, we move
that into pci_dev which simplifies everything and gets rid of that
separate list.

Then, things like efifb, 8250_pci etc... do a lookup in there for
addresses they obtain from screen_info, earlycon,
add_preferred_console.. and on match, perform the necessary fixup.
Assuming we are confident those addresses originate from before the PCI
remapping that is.

Pros: It *seems* even simpler than the above other options and maybe
even faster.

Cons: It's more resource intensive as we now backup original BARs for
everything under the sun. It also doesn't provide a great path to
address the case I mentioned earlier for dealing with "live" devices.

That's all I came up with ... Any better ideas and any preferences ? At
this point I'm reasonably keen on (2) (tracking variant A).

I'll be travelling this weekend and next week, so probably won't have
time to produce much code but this has been broken forever so I don't
see a huge emergency. So unless somebody beats me to it or strongly
objects, I'll start hacking at it in the next few weeks.

Cheers,
Ben.




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: arm64 PCI resource allocation issue
  2022-08-04 23:51   ` Benjamin Herrenschmidt
@ 2022-08-15 23:02     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 8+ messages in thread
From: Benjamin Herrenschmidt @ 2022-08-15 23:02 UTC (permalink / raw)
  To: Lorenzo Pieralisi
  Cc: linux-pci, bhelgaas, mark.rutland, linux-arm-kernel,
	Jeremy Linton, Ali Saidi, David Woodhouse

Still hoping for opinions on this ... I'll probably have time to hack
on it next week...

On Fri, 2022-08-05 at 09:51 +1000, Benjamin Herrenschmidt wrote:

> That leaves us with 3 overall routes I can think of (we can figure out
> the details next):
> 
>  1) We can try to detect early those devices (easy with SPCR, are
> there more on aarch64 ? on x86 there is) and hammer them into place,
> flagging them somewhat and forcing them (and all their parents) to keep their
> resources.
> 
> Pros: It's rather easy to implement, we can "register" the addresses
> early and have the PCI probe code match detected devices againt that
> list & flag them (for example IORESOURCE_PCI_FIXES :-) and their
> parents.
> 
> Cons: It will force entire bus hierarchies to be fixed, which might
> not really help on firmwares that are known to setup sub-optimal
> apertures (or even completely b0rked ones). But we don't know who those are
> except maybe one or two if we dig down into the previous version of
> that discussion from a couple of years ago.
> 
>  2) We can try to "keep track" of them as they move. Variant A.
> 
> We do it the way efifb does it and wrap that in something a bit nicer
> as follow:
> 
>  - We add a helper to "record" a pci_dev/BAR#/offset combination and
> an
> other one to do the lookup & fixup of a FW originated address.
> 
>  - We make efifb quirk use that instead of its existing global
> "bar_resource".
> 
>  - We add a similar quirk to the ACPI code that parses SPCR and
> (maybe) another one for earlycon (hint they may be the same device, some
> deduplication would be useful).
>  
>  - We update 8250_pci (I assume pl01x are never PCI ?) to call this
> to "fixup" addresses obtained from earlycon. That's the easy bit. SPRC
> is trickier, we'd need to fixup addressed parsed from
> add_preferred_console() .. I'm not 100% sure there's a case where
> such an address would be added post-PCI-remap and we might incorrectly fix
> it up.  I don't think so but ...
> 
> Pros: It should (hopefully) not be overly complicated and reasonably
> self contained, low risk.
> 
> Cons: 
> 
>  - It's a bit more complicated than other solutions, though not
> insanely
> 
>  - This doesn't solve the problem of a driver such as earlycon being
> "live" accross the remapping (and thus means we'll probably still
> have verbose PCI probing with earlycon dying horribly). This is already
> partially broken since we temporarily disable decoding during probing
> but that's a small window ... We can look at solving that separately
> by adding on top of this registration mechanism: We *could* optionally
> register in our above helper a pair of callbacks that the PCI code
> would call for each registered "early device" before and after
> remapping to "suspend access" and "fixup address". Those would be
> ideally called around the remapping of the entire host bridge the
> device is on.
> 
> 3) Keeping track, Variant B
> 
> (note: the more I think about it, the more I prefer variant A but
> let's see what others think)
> 
> We generalize pcibios_save_fw_addr() and for the sake of it, we move
> that into pci_dev which simplifies everything and gets rid of that
> separate list.
> 
> Then, things like efifb, 8250_pci etc... do a lookup in there for
> addresses they obtain from screen_info, earlycon,
> add_preferred_console.. and on match, perform the necessary fixup.
> Assuming we are confident those addresses originate from before the
> PCI remapping that is.
> 
> Pros: It *seems* even simpler than the above other options and maybe
> even faster.
> 
> Cons: It's more resource intensive as we now backup original BARs for
> everything under the sun. It also doesn't provide a great path to
> address the case I mentioned earlier for dealing with "live" devices.
> 
> That's all I came up with ... Any better ideas and any preferences ?
> At this point I'm reasonably keen on (2) (tracking variant A).



^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2022-08-16  2:44 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-02  4:07 arm64 PCI resource allocation issue Benjamin Herrenschmidt
2022-08-02  7:46 ` Ard Biesheuvel
2022-08-02  9:18   ` [EXTERNAL]arm64 " David Woodhouse
2022-08-03  2:32     ` Benjamin Herrenschmidt
2022-08-03  2:31   ` arm64 " Benjamin Herrenschmidt
2022-08-04 10:36 ` Lorenzo Pieralisi
2022-08-04 23:51   ` Benjamin Herrenschmidt
2022-08-15 23:02     ` Benjamin Herrenschmidt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).