linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [nicholas.johnson-opensource@outlook.com.au: [PATCH v6 3/4] PCI: Fix bug resulting in double hpmemsize being assigned to MMIO window]
       [not found] <SL2P216MB01874DFDDBDE49B935A9B1B380E50@SL2P216MB0187.KORP216.PROD.OUTLOOK.COM>
@ 2019-06-19 16:21 ` Logan Gunthorpe
  2019-06-20  0:44   ` Nicholas Johnson
  2019-06-27  7:50   ` Nicholas Johnson
  0 siblings, 2 replies; 21+ messages in thread
From: Logan Gunthorpe @ 2019-06-19 16:21 UTC (permalink / raw)
  To: Nicholas Johnson; +Cc: benh, Bjorn Helgaas, linux-pci

*(cc'd back Bjorn and the list)

On 2019-06-19 8:00 a.m., Nicholas Johnson wrote:
> Hi Ben and Logan,
> 
> It looks like my git send-email has been not working correctly since I
> started trying to get these patches accepted. I may have remedied this
> now, but I have seen that Logan tried to find these patches and failed.
> So as a courtesy until I post PATCH v7 (hopefully correctly, this time),
> I am forwarding you the patches. I hope you like them. I would love to 
> know of any concerns or questions you may have, and / or what happens if 
> you test them. Thanks and all the best!
> 
> ----- Forwarded message from Nicholas Johnson <nicholas.johnson-opensource@outlook.com.au> -----
> 
> Date: Thu, 23 May 2019 06:29:27 +0800
> From: Nicholas Johnson <nicholas.johnson-opensource@outlook.com.au>
> To: linux-kernel@vger.kernel.org
> Cc: linux-pci@vger.kernel.org, bhelgaas@google.com, mika.westerberg@linux.intel.com, corbet@lwn.net, Nicholas Johnson <nicholas.johnson-opensource@outlook.com.au>
> Subject: [PATCH v6 3/4] PCI: Fix bug resulting in double hpmemsize being assigned to MMIO window
> X-Mailer: git-send-email 2.19.1
> 
> Background
> ==========================================================================
> 
> Solve bug report:
> https://bugzilla.kernel.org/show_bug.cgi?id=203243

This is all kinds of confusing... the bug report just seems to be a copy
of the patch set. The description of the actual symptoms of the problem
appears to be missing from all of it.

> Currently, the kernel can sometimes assign the MMIO_PREF window
> additional size into the MMIO window, resulting in double the MMIO
> additional size, even if the MMIO_PREF window was successful.
> 
> This happens if in the first pass, the MMIO_PREF succeeds but the MMIO
> fails. In the next pass, because MMIO_PREF is already assigned, the
> attempt to assign MMIO_PREF returns an error code instead of success
> (nothing more to do, already allocated).
> 
> Example of problem (more context can be found in the bug report URL):
> 
> Mainline kernel:
> pci 0000:06:01.0: BAR 14: assigned [mem 0x90100000-0xa00fffff] = 256M
> pci 0000:06:04.0: BAR 14: assigned [mem 0xa0200000-0xb01fffff] = 256M
> 
> Patched kernel:
> pci 0000:06:01.0: BAR 14: assigned [mem 0x90100000-0x980fffff] = 128M
> pci 0000:06:04.0: BAR 14: assigned [mem 0x98200000-0xa01fffff] = 128M
> 
> This was using pci=realloc,hpmemsize=128M,nocrs - on the same machine
> with the same configuration, with a Ubuntu mainline kernel and a kernel
> patched with this patch series.
> 
> This patch is vital for the next patch in the series. The next patch
> allows the user to specify MMIO and MMIO_PREF independently. If the
> MMIO_PREF is set to be very large, this bug will end up more than
> doubling the MMIO size. The bug results in the MMIO_PREF being added to
> the MMIO window, which means doubling if MMIO_PREF size == MMIO size.
> With a large MMIO_PREF, without this patch, the MMIO window will likely
> fail to be assigned altogether due to lack of 32-bit address space.
> 
> Patch notes
> ==========================================================================
> 
> Change find_free_bus_resource() to not skip assigned resources with
> non-null parent.
> 
> Add checks in pbus_size_io() and pbus_size_mem() to return success if
> resource returned from find_free_bus_resource() is already allocated.
> 
> This avoids pbus_size_io() and pbus_size_mem() returning error code to
> __pci_bus_size_bridges() when a resource has been successfully assigned
> in a previous pass. This fixes the existing behaviour where space for a
> resource could be reserved multiple times in different parent bridge
> windows. This also greatly reduces the number of failed BAR messages in
> dmesg when Linux assigns resources.

This patch looks like the same bug that I tracked down earlier but I
solved in a slightly different way. See this patch[1] which is still
under review. Can you maybe test it and see if it solves the same problem?

Thanks,

Logan

[1]
https://lore.kernel.org/lkml/20190531171216.20532-2-logang@deltatee.com/T/#u

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [nicholas.johnson-opensource@outlook.com.au: [PATCH v6 3/4] PCI: Fix bug resulting in double hpmemsize being assigned to MMIO window]
  2019-06-19 16:21 ` [nicholas.johnson-opensource@outlook.com.au: [PATCH v6 3/4] PCI: Fix bug resulting in double hpmemsize being assigned to MMIO window] Logan Gunthorpe
@ 2019-06-20  0:44   ` Nicholas Johnson
  2019-06-20  0:49     ` Logan Gunthorpe
  2019-06-20 13:43     ` [nicholas.johnson-opensource@outlook.com.au: [PATCH v6 3/4] PCI: Fix bug resulting in double hpmemsize being assigned to MMIO window] Bjorn Helgaas
  2019-06-27  7:50   ` Nicholas Johnson
  1 sibling, 2 replies; 21+ messages in thread
From: Nicholas Johnson @ 2019-06-20  0:44 UTC (permalink / raw)
  To: Logan Gunthorpe; +Cc: benh, Bjorn Helgaas, linux-pci

On Wed, Jun 19, 2019 at 10:21:21AM -0600, Logan Gunthorpe wrote:
> *(cc'd back Bjorn and the list)
> 
> On 2019-06-19 8:00 a.m., Nicholas Johnson wrote:
> > Hi Ben and Logan,
> > 
> > It looks like my git send-email has been not working correctly since I
> > started trying to get these patches accepted. I may have remedied this
> > now, but I have seen that Logan tried to find these patches and failed.
> > So as a courtesy until I post PATCH v7 (hopefully correctly, this time),
> > I am forwarding you the patches. I hope you like them. I would love to 
> > know of any concerns or questions you may have, and / or what happens if 
> > you test them. Thanks and all the best!
> > 
> > ----- Forwarded message from Nicholas Johnson <nicholas.johnson-opensource@outlook.com.au> -----
> > 
> > Date: Thu, 23 May 2019 06:29:27 +0800
> > From: Nicholas Johnson <nicholas.johnson-opensource@outlook.com.au>
> > To: linux-kernel@vger.kernel.org
> > Cc: linux-pci@vger.kernel.org, bhelgaas@google.com, mika.westerberg@linux.intel.com, corbet@lwn.net, Nicholas Johnson <nicholas.johnson-opensource@outlook.com.au>
> > Subject: [PATCH v6 3/4] PCI: Fix bug resulting in double hpmemsize being assigned to MMIO window
> > X-Mailer: git-send-email 2.19.1
> > 
> > Background
> > ==========================================================================
> > 
> > Solve bug report:
> > https://bugzilla.kernel.org/show_bug.cgi?id=203243
> 
> This is all kinds of confusing... the bug report just seems to be a copy
> of the patch set. The description of the actual symptoms of the problem
> appears to be missing from all of it.
I believe everything to be there, but I can take another look and add 
more details. It is possible I lost track of what I had written where.

There are common elements which I borrowed from the patchset or 
vice-versa, like the pin diagram for using the Thunderbolt add-in card 
for testing.

> 
> > Currently, the kernel can sometimes assign the MMIO_PREF window
> > additional size into the MMIO window, resulting in double the MMIO
> > additional size, even if the MMIO_PREF window was successful.
> > 
> > This happens if in the first pass, the MMIO_PREF succeeds but the MMIO
> > fails. In the next pass, because MMIO_PREF is already assigned, the
> > attempt to assign MMIO_PREF returns an error code instead of success
> > (nothing more to do, already allocated).
> > 
> > Example of problem (more context can be found in the bug report URL):
> > 
> > Mainline kernel:
> > pci 0000:06:01.0: BAR 14: assigned [mem 0x90100000-0xa00fffff] = 256M
> > pci 0000:06:04.0: BAR 14: assigned [mem 0xa0200000-0xb01fffff] = 256M
> > 
> > Patched kernel:
> > pci 0000:06:01.0: BAR 14: assigned [mem 0x90100000-0x980fffff] = 128M
> > pci 0000:06:04.0: BAR 14: assigned [mem 0x98200000-0xa01fffff] = 128M
> > 
> > This was using pci=realloc,hpmemsize=128M,nocrs - on the same machine
> > with the same configuration, with a Ubuntu mainline kernel and a kernel
> > patched with this patch series.
> > 
> > This patch is vital for the next patch in the series. The next patch
> > allows the user to specify MMIO and MMIO_PREF independently. If the
> > MMIO_PREF is set to be very large, this bug will end up more than
> > doubling the MMIO size. The bug results in the MMIO_PREF being added to
> > the MMIO window, which means doubling if MMIO_PREF size == MMIO size.
> > With a large MMIO_PREF, without this patch, the MMIO window will likely
> > fail to be assigned altogether due to lack of 32-bit address space.
> > 
> > Patch notes
> > ==========================================================================
> > 
> > Change find_free_bus_resource() to not skip assigned resources with
> > non-null parent.
> > 
> > Add checks in pbus_size_io() and pbus_size_mem() to return success if
> > resource returned from find_free_bus_resource() is already allocated.
> > 
> > This avoids pbus_size_io() and pbus_size_mem() returning error code to
> > __pci_bus_size_bridges() when a resource has been successfully assigned
> > in a previous pass. This fixes the existing behaviour where space for a
> > resource could be reserved multiple times in different parent bridge
> > windows. This also greatly reduces the number of failed BAR messages in
> > dmesg when Linux assigns resources.
> 
> This patch looks like the same bug that I tracked down earlier but I
> solved in a slightly different way. See this patch[1] which is still
> under review. Can you maybe test it and see if it solves the same problem?

I read [1] and it is definitely the same bug, without a doubt. This is 
fantastic because it means I have somebody to back me up on this. I will 
test the patch as soon as I can - perhaps after work today.

My initial thoughts of [1] patch are that restricting 64-bit BARs to 
64-bit windows might break assigning 64-bit BARs on bridges without the 
optional prefetchable window. My patch should not have that issue - but 
after I have tested [1], it might turn out to be fine.

Correct me if I am wrong about assumptions about windows. My 
understanding cannot be perfect. As far as I know, 64-bit BARs should 
always be prefetchable, but I own the Aquantia AQC-107S NIC and it has 
three 64-bit non-pref BARs. It happens that they are assigned into the 
32-bit window. I will see if [1] patch prevents that from happening or 
not.

Cheers

> 
> Thanks,
> 
> Logan
> 
> [1]
> https://lore.kernel.org/lkml/20190531171216.20532-2-logang@deltatee.com/T/#u

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [nicholas.johnson-opensource@outlook.com.au: [PATCH v6 3/4] PCI: Fix bug resulting in double hpmemsize being assigned to MMIO window]
  2019-06-20  0:44   ` Nicholas Johnson
@ 2019-06-20  0:49     ` Logan Gunthorpe
  2019-06-23  5:01       ` Nicholas Johnson
  2019-06-20 13:43     ` [nicholas.johnson-opensource@outlook.com.au: [PATCH v6 3/4] PCI: Fix bug resulting in double hpmemsize being assigned to MMIO window] Bjorn Helgaas
  1 sibling, 1 reply; 21+ messages in thread
From: Logan Gunthorpe @ 2019-06-20  0:49 UTC (permalink / raw)
  To: Nicholas Johnson; +Cc: benh, Bjorn Helgaas, linux-pci



On 2019-06-19 6:44 p.m., Nicholas Johnson wrote:
>> This is all kinds of confusing... the bug report just seems to be a copy
>> of the patch set. The description of the actual symptoms of the problem
>> appears to be missing from all of it.
> I believe everything to be there, but I can take another look and add 
> more details. It is possible I lost track of what I had written where.
> 
> There are common elements which I borrowed from the patchset or 
> vice-versa, like the pin diagram for using the Thunderbolt add-in card 
> for testing.

What's missing are symptoms of the bug or what you are actually seeing
with what hardware. The closest thing to that is the bug's title. But
it's not clear what the problem is with having a double size MMIO window.

The pin diagram and stuff is just noise to me because I don't have that
hardware and am not going to buy it just to try to figure out if there
is a bug there or not.

>>
>>> Currently, the kernel can sometimes assign the MMIO_PREF window
>>> additional size into the MMIO window, resulting in double the MMIO
>>> additional size, even if the MMIO_PREF window was successful.
>>>
>>> This happens if in the first pass, the MMIO_PREF succeeds but the MMIO
>>> fails. In the next pass, because MMIO_PREF is already assigned, the
>>> attempt to assign MMIO_PREF returns an error code instead of success
>>> (nothing more to do, already allocated).
>>>
>>> Example of problem (more context can be found in the bug report URL):
>>>
>>> Mainline kernel:
>>> pci 0000:06:01.0: BAR 14: assigned [mem 0x90100000-0xa00fffff] = 256M
>>> pci 0000:06:04.0: BAR 14: assigned [mem 0xa0200000-0xb01fffff] = 256M
>>>
>>> Patched kernel:
>>> pci 0000:06:01.0: BAR 14: assigned [mem 0x90100000-0x980fffff] = 128M
>>> pci 0000:06:04.0: BAR 14: assigned [mem 0x98200000-0xa01fffff] = 128M
>>>
>>> This was using pci=realloc,hpmemsize=128M,nocrs - on the same machine
>>> with the same configuration, with a Ubuntu mainline kernel and a kernel
>>> patched with this patch series.
>>>
>>> This patch is vital for the next patch in the series. The next patch
>>> allows the user to specify MMIO and MMIO_PREF independently. If the
>>> MMIO_PREF is set to be very large, this bug will end up more than
>>> doubling the MMIO size. The bug results in the MMIO_PREF being added to
>>> the MMIO window, which means doubling if MMIO_PREF size == MMIO size.
>>> With a large MMIO_PREF, without this patch, the MMIO window will likely
>>> fail to be assigned altogether due to lack of 32-bit address space.
>>>
>>> Patch notes
>>> ==========================================================================
>>>
>>> Change find_free_bus_resource() to not skip assigned resources with
>>> non-null parent.
>>>
>>> Add checks in pbus_size_io() and pbus_size_mem() to return success if
>>> resource returned from find_free_bus_resource() is already allocated.
>>>
>>> This avoids pbus_size_io() and pbus_size_mem() returning error code to
>>> __pci_bus_size_bridges() when a resource has been successfully assigned
>>> in a previous pass. This fixes the existing behaviour where space for a
>>> resource could be reserved multiple times in different parent bridge
>>> windows. This also greatly reduces the number of failed BAR messages in
>>> dmesg when Linux assigns resources.
>>
>> This patch looks like the same bug that I tracked down earlier but I
>> solved in a slightly different way. See this patch[1] which is still
>> under review. Can you maybe test it and see if it solves the same problem?
> 
> I read [1] and it is definitely the same bug, without a doubt. This is 
> fantastic because it means I have somebody to back me up on this. I will 
> test the patch as soon as I can - perhaps after work today.
> 
> My initial thoughts of [1] patch are that restricting 64-bit BARs to 
> 64-bit windows might break assigning 64-bit BARs on bridges without the 
> optional prefetchable window. My patch should not have that issue - but 
> after I have tested [1], it might turn out to be fine.
> 
> Correct me if I am wrong about assumptions about windows. My 
> understanding cannot be perfect. As far as I know, 64-bit BARs should 
> always be prefetchable, but I own the Aquantia AQC-107S NIC and it has 
> three 64-bit non-pref BARs. It happens that they are assigned into the 
> 32-bit window. I will see if [1] patch prevents that from happening or 
> not.

As best as I can tell the patches should have identical functionality.
My patch ignores the error returned by pbus_size_mem() your patch forces
the function from returning an error inside it for the same case.

Logan

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [nicholas.johnson-opensource@outlook.com.au: [PATCH v6 3/4] PCI: Fix bug resulting in double hpmemsize being assigned to MMIO window]
  2019-06-20  0:44   ` Nicholas Johnson
  2019-06-20  0:49     ` Logan Gunthorpe
@ 2019-06-20 13:43     ` Bjorn Helgaas
  2019-06-20 23:24       ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 21+ messages in thread
From: Bjorn Helgaas @ 2019-06-20 13:43 UTC (permalink / raw)
  To: Nicholas Johnson; +Cc: Logan Gunthorpe, benh, linux-pci

On Thu, Jun 20, 2019 at 12:44:11AM +0000, Nicholas Johnson wrote:

> Correct me if I am wrong about assumptions about windows. My
> understanding cannot be perfect. As far as I know, 64-bit BARs
> should always be prefetchable, 

There's no requirement that a 64-bit BAR be prefetchable.

  - BARs of PCIe Functions must be prefetchable unless they have read
    side effects or can't tolerate write merging (PCIe r5.0, sec
    7.5.1.2.1).

  - BARs of PCIe Functions other than Legacy Endpoints must be 64-bit
    if they are prefetchable (sec 7.5.1.2.1).

  - Bridge non-prefetchable memory windows are limited to 32-bit
    (7.5.1.3.8).

  - There's some ambiguity in the spec about bridge prefetchable
    memory windows.  Current specs claim 64-bit addresses must be
    supported (sec 7.5.1.3.9), but also say the upper 32 bits are
    optional (sec 7.5.1.3.10).  Both 32- and 64-bit versions
    definitely exist.

> but I own the Aquantia AQC-107S NIC and it has three 64-bit non-pref
> BARs. It happens that they are assigned into the 32-bit window.

This is as it should be.  Non-prefetchable windows are 32 bits, and
in general non-prefetchable BARs must be placed there.

There is some wiggle room in pure PCIe systems because PCIe reads
always contain an explicit length, so in some cases it is safe to
put a non-prefetchable BAR in a prefetchable window (see the
implementation note in sec 7.5.1.2.1).  But I don't think Linux
currently implements this.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [nicholas.johnson-opensource@outlook.com.au: [PATCH v6 3/4] PCI: Fix bug resulting in double hpmemsize being assigned to MMIO window]
  2019-06-20 13:43     ` [nicholas.johnson-opensource@outlook.com.au: [PATCH v6 3/4] PCI: Fix bug resulting in double hpmemsize being assigned to MMIO window] Bjorn Helgaas
@ 2019-06-20 23:24       ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 21+ messages in thread
From: Benjamin Herrenschmidt @ 2019-06-20 23:24 UTC (permalink / raw)
  To: Bjorn Helgaas, Nicholas Johnson; +Cc: Logan Gunthorpe, linux-pci

On Thu, 2019-06-20 at 08:43 -0500, Bjorn Helgaas wrote:
> This is as it should be.  Non-prefetchable windows are 32 bits, and
> in general non-prefetchable BARs must be placed there.
> 
> There is some wiggle room in pure PCIe systems because PCIe reads
> always contain an explicit length, so in some cases it is safe to
> put a non-prefetchable BAR in a prefetchable window (see the
> implementation note in sec 7.5.1.2.1).  But I don't think Linux
> currently implements this.

We don't, we probably should, but seeing our current allocation code, I
dread of the end result ...

We would need a host bridge flag to indicate it's safe (no byte merging
at the PHB). I know most host bridge implementations don't
differenciate prefetchable from non-prefetchable outbound windows so
should be fine, and the other side effects are generally attributes of
the mapping done in the MMU and thus depend on the device BAR
attribute, not the bridge windows in the way.

I'm not 100% sure how/if x86 throws a wrench into this with MTRRs
(could a BIOS setup one of these things to cover a bridge/switch
prefetchable window ? That would be a bad idea but bad ideas is what
BIOS vendors often come up with).

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [nicholas.johnson-opensource@outlook.com.au: [PATCH v6 3/4] PCI: Fix bug resulting in double hpmemsize being assigned to MMIO window]
  2019-06-20  0:49     ` Logan Gunthorpe
@ 2019-06-23  5:01       ` Nicholas Johnson
  2019-06-24  9:13         ` Multitude of resource assignment functions Benjamin Herrenschmidt
  0 siblings, 1 reply; 21+ messages in thread
From: Nicholas Johnson @ 2019-06-23  5:01 UTC (permalink / raw)
  To: Logan Gunthorpe; +Cc: benh, Bjorn Helgaas, linux-pci

Bjorn, please weigh in on this - please see below.

On Wed, Jun 19, 2019 at 06:49:45PM -0600, Logan Gunthorpe wrote:
> 
> 
> On 2019-06-19 6:44 p.m., Nicholas Johnson wrote:
> >> This is all kinds of confusing... the bug report just seems to be a copy
> >> of the patch set. The description of the actual symptoms of the problem
> >> appears to be missing from all of it.
> > I believe everything to be there, but I can take another look and add 
> > more details. It is possible I lost track of what I had written where.
> > 
> > There are common elements which I borrowed from the patchset or 
> > vice-versa, like the pin diagram for using the Thunderbolt add-in card 
> > for testing.
> 
> What's missing are symptoms of the bug or what you are actually seeing
> with what hardware. The closest thing to that is the bug's title. But
> it's not clear what the problem is with having a double size MMIO window.
> 
> The pin diagram and stuff is just noise to me because I don't have that
> hardware and am not going to buy it just to try to figure out if there
> is a bug there or not.
> 
> >>
> >>> Currently, the kernel can sometimes assign the MMIO_PREF window
> >>> additional size into the MMIO window, resulting in double the MMIO
> >>> additional size, even if the MMIO_PREF window was successful.
> >>>
> >>> This happens if in the first pass, the MMIO_PREF succeeds but the MMIO
> >>> fails. In the next pass, because MMIO_PREF is already assigned, the
> >>> attempt to assign MMIO_PREF returns an error code instead of success
> >>> (nothing more to do, already allocated).
> >>>
> >>> Example of problem (more context can be found in the bug report URL):
> >>>
> >>> Mainline kernel:
> >>> pci 0000:06:01.0: BAR 14: assigned [mem 0x90100000-0xa00fffff] = 256M
> >>> pci 0000:06:04.0: BAR 14: assigned [mem 0xa0200000-0xb01fffff] = 256M
> >>>
> >>> Patched kernel:
> >>> pci 0000:06:01.0: BAR 14: assigned [mem 0x90100000-0x980fffff] = 128M
> >>> pci 0000:06:04.0: BAR 14: assigned [mem 0x98200000-0xa01fffff] = 128M
> >>>
> >>> This was using pci=realloc,hpmemsize=128M,nocrs - on the same machine
> >>> with the same configuration, with a Ubuntu mainline kernel and a kernel
> >>> patched with this patch series.
> >>>
> >>> This patch is vital for the next patch in the series. The next patch
> >>> allows the user to specify MMIO and MMIO_PREF independently. If the
> >>> MMIO_PREF is set to be very large, this bug will end up more than
> >>> doubling the MMIO size. The bug results in the MMIO_PREF being added to
> >>> the MMIO window, which means doubling if MMIO_PREF size == MMIO size.
> >>> With a large MMIO_PREF, without this patch, the MMIO window will likely
> >>> fail to be assigned altogether due to lack of 32-bit address space.
> >>>
> >>> Patch notes
> >>> ==========================================================================
> >>>
> >>> Change find_free_bus_resource() to not skip assigned resources with
> >>> non-null parent.
> >>>
> >>> Add checks in pbus_size_io() and pbus_size_mem() to return success if
> >>> resource returned from find_free_bus_resource() is already allocated.
> >>>
> >>> This avoids pbus_size_io() and pbus_size_mem() returning error code to
> >>> __pci_bus_size_bridges() when a resource has been successfully assigned
> >>> in a previous pass. This fixes the existing behaviour where space for a
> >>> resource could be reserved multiple times in different parent bridge
> >>> windows. This also greatly reduces the number of failed BAR messages in
> >>> dmesg when Linux assigns resources.
> >>
> >> This patch looks like the same bug that I tracked down earlier but I
> >> solved in a slightly different way. See this patch[1] which is still
> >> under review. Can you maybe test it and see if it solves the same problem?
> > 
> > I read [1] and it is definitely the same bug, without a doubt. This is 
> > fantastic because it means I have somebody to back me up on this. I will 
> > test the patch as soon as I can - perhaps after work today.
> > 
> > My initial thoughts of [1] patch are that restricting 64-bit BARs to 
> > 64-bit windows might break assigning 64-bit BARs on bridges without the 
> > optional prefetchable window. My patch should not have that issue - but 
> > after I have tested [1], it might turn out to be fine.
> > 
> > Correct me if I am wrong about assumptions about windows. My 
> > understanding cannot be perfect. As far as I know, 64-bit BARs should 
> > always be prefetchable, but I own the Aquantia AQC-107S NIC and it has 
> > three 64-bit non-pref BARs. It happens that they are assigned into the 
> > 32-bit window. I will see if [1] patch prevents that from happening or 
> > not.
> 
> As best as I can tell the patches should have identical functionality.
> My patch ignores the error returned by pbus_size_mem() your patch forces
> the function from returning an error inside it for the same case.
> 
> Logan
I finally tested this (not rigorously) and it appears to work the same 
as my patch and did not do anything strange. It solves the problem that 
it set out to fix. If you want to add my tested-by then that is fine.

I still slightly prefer my patch because it corrects the return codes 
instead of ignoring them. However, both patches have merits.

Bjorn, please advise on what you think is better and / or easier to sign 
off on.

In my PATCH v7, should I consider moving this to the end of the series 
so that any changes do not impact anything else? I.e. can remove the 
patch if we take Logan's version instead.

Also, if you decide to take Logan's patch, which of the following do I 
do?

a) drop the patch from my series and leave it at that (in this case I 
hope I can have a co-reported-by in Logan's patch when it gets accepted)

b) merge Logan's patch into my series, giving credit

c) something else


If you decide to take mine then we will need to discuss Logan's concerns 
about my documentation and I will need to update some more information 
into the bug report (or link both bug reports into the notes).

Cheers

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Multitude of resource assignment functions
  2019-06-23  5:01       ` Nicholas Johnson
@ 2019-06-24  9:13         ` Benjamin Herrenschmidt
  2019-06-24 16:45           ` Logan Gunthorpe
  0 siblings, 1 reply; 21+ messages in thread
From: Benjamin Herrenschmidt @ 2019-06-24  9:13 UTC (permalink / raw)
  To: Nicholas Johnson, Logan Gunthorpe; +Cc: Bjorn Helgaas, linux-pci

So I'm staring at these three mostly at this point:

void pci_assign_unassigned_root_bus_resources(struct pci_bus *bus)
void pci_assign_unassigned_bridge_resources(struct pci_dev *bridge)
void pci_assign_unassigned_bus_resources(struct pci_bus *bus)

Now we have 3 functions that fundamentally have the same purpose,
assign what was left unassigned down a PCI hierarchy, but are going
about it in quite a different manner.

Now to make things worse, there's little consistency in which one gets
called where. We have PCI controllers calling the first one sometimes,
the last one sometimes, or doing the manual:

	pci_bus_size_bridges(bus);
	pci_bus_assign_resources(bus);

Or variants with pci_bus_size_bridges sometimes missing etc...

Now I've consolidated a lot of that and removed all of those "manual"
cases in my work-in-progress branch, but I'd like to clarify and
possibly remove the 3 ones above.

Let's start with the last one, pci_assign_unassigned_bus_resources, as
it's the easiest to remove from users in drivers/pci/controller/* (and
replace with pci_assign_unassigned_root_bus_resources typically).

This leaves it used in a couple of corner cases, most of them I think
I can kill, and .... sysfs 'rescan'.

The interesting thing about that function is that it tries to avoid
resizing the bridge of the bus passed as an argument, it will only
resize subordinate bridges. From the changelog it was created for
hotplug bridges, but almost none uses it (some powerpc stuff I can
probably kill) ... and sysfs rescan.

I wonder what's the remaining purpose of it. sysfs rescan could
probably be cleaned up to use the two first... Also why avoid resizing
the bridge itself ?

That leads to the difference between
pci_assign_unassigned_root_bus_resources()
and pci_assign_unassigned_bridge_resources().

The names are misleading. The former isn't just about the root bus
resources. It's about the entire tree underneath the root bus.

The main difference that I can tell are:

 - pci_assign_unassigned_root_bus_resources() may or may not try to
realloc, depending on a combination of command line args, config
option, presence of IOV devices etc... while
pci_assign_unassigned_bridge_resources() always will

 - pci_assign_unassigned_bridge_resources() will call
pci_bridge_distribute_available_resources() to distribute resource to
child hotplug bridges, while pci_assign_unassigned_root_bus_resources()
won't.

Now, are we 100% confident we want to keep those discrepancies ?

It feels like the former function is intended for boot time resource
allocation, and the latter for hotplug, but I can't make sense of why
the resources of a device behind a hotplug bridge should be allocated
differently depending on whether that device was plugged at boot or
plugged later.

Also why not distribute available resources at boot between top level
hotplug bridges ?

I'm not even going into the question of why the resource
sizing/assignment code is so obscure/cryptic/incomprehensible, that's
another kettle of fish, but I'd like to at least clarify the usage
patterns a bit better.

Cheers,
Ben.




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Multitude of resource assignment functions
  2019-06-24  9:13         ` Multitude of resource assignment functions Benjamin Herrenschmidt
@ 2019-06-24 16:45           ` Logan Gunthorpe
  2019-06-27  7:40             ` Nicholas Johnson
  0 siblings, 1 reply; 21+ messages in thread
From: Logan Gunthorpe @ 2019-06-24 16:45 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Nicholas Johnson; +Cc: Bjorn Helgaas, linux-pci



On 2019-06-24 3:13 a.m., Benjamin Herrenschmidt wrote:
> So I'm staring at these three mostly at this point:
> 
> void pci_assign_unassigned_root_bus_resources(struct pci_bus *bus)
> void pci_assign_unassigned_bridge_resources(struct pci_dev *bridge)
> void pci_assign_unassigned_bus_resources(struct pci_bus *bus)
> 
> Now we have 3 functions that fundamentally have the same purpose,
> assign what was left unassigned down a PCI hierarchy, but are going
> about it in quite a different manner.
> 
> Now to make things worse, there's little consistency in which one gets
> called where. We have PCI controllers calling the first one sometimes,
> the last one sometimes, or doing the manual:
> 
> 	pci_bus_size_bridges(bus);
> 	pci_bus_assign_resources(bus);
> 
> Or variants with pci_bus_size_bridges sometimes missing etc...

I suspect there isn't much rhyme or reason to it. None of this is well
documented so developers writing the controller drivers probably didn't
have a good idea of what the correct thing to do was, and just stuck
with the first thing that worked.

> Now I've consolidated a lot of that and removed all of those "manual"
> cases in my work-in-progress branch, but I'd like to clarify and
> possibly remove the 3 ones above.
> 
> Let's start with the last one, pci_assign_unassigned_bus_resources, as
> it's the easiest to remove from users in drivers/pci/controller/* (and
> replace with pci_assign_unassigned_root_bus_resources typically).
> 
> This leaves it used in a couple of corner cases, most of them I think
> I can kill, and .... sysfs 'rescan'.
> 
> The interesting thing about that function is that it tries to avoid
> resizing the bridge of the bus passed as an argument, it will only
> resize subordinate bridges. From the changelog it was created for
> hotplug bridges, but almost none uses it (some powerpc stuff I can
> probably kill) ... and sysfs rescan.
> 
> I wonder what's the remaining purpose of it. sysfs rescan could
> probably be cleaned up to use the two first... Also why avoid resizing
> the bridge itself ?
> 
> That leads to the difference between
> pci_assign_unassigned_root_bus_resources()
> and pci_assign_unassigned_bridge_resources().
> 
> The names are misleading. The former isn't just about the root bus
> resources. It's about the entire tree underneath the root bus.
> 
> The main difference that I can tell are:
> 
>  - pci_assign_unassigned_root_bus_resources() may or may not try to
> realloc, depending on a combination of command line args, config
> option, presence of IOV devices etc... while
> pci_assign_unassigned_bridge_resources() always will
> 
>  - pci_assign_unassigned_bridge_resources() will call
> pci_bridge_distribute_available_resources() to distribute resource to
> child hotplug bridges, while pci_assign_unassigned_root_bus_resources()
> won't.
>
> Now, are we 100% confident we want to keep those discrepancies ?
> 
> It feels like the former function is intended for boot time resource
> allocation, and the latter for hotplug, but I can't make sense of why
> the resources of a device behind a hotplug bridge should be allocated
> differently depending on whether that device was plugged at boot or
> plugged later.

I don't really know, but I kind of assumed reallocing any time but early
in boot would be dangerous. It involves un-assigning a bunch of
resources without any real check to see if a driver is using them or
not. If they were being used by a driver (which is typical) and they
were reassigned, everything would break.

I mean, in theory the code could/should be the same for both paths and
it could just make a single, better decision on whether to realloc or
not. But that's going to be challenging to get there.

> Also why not distribute available resources at boot between top level
> hotplug bridges ?
>
> I'm not even going into the question of why the resource
> sizing/assignment code is so obscure/cryptic/incomprehensible, that's
> another kettle of fish, but I'd like to at least clarify the usage
> patterns a bit better.
I got the impression the code was designed to generally let the firmware
set things up -- it just fixed things up if the firmware messed it up
somehow. My guess would be it evolved out of a bunch of hacks designed
to fix broken bioses into something new platforms used to do full
enumeration (because it happened to work).

Logan

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Multitude of resource assignment functions
  2019-06-24 16:45           ` Logan Gunthorpe
@ 2019-06-27  7:40             ` Nicholas Johnson
  2019-06-27  8:48               ` Benjamin Herrenschmidt
  2019-06-27 16:35               ` Logan Gunthorpe
  0 siblings, 2 replies; 21+ messages in thread
From: Nicholas Johnson @ 2019-06-27  7:40 UTC (permalink / raw)
  To: Logan Gunthorpe; +Cc: Benjamin Herrenschmidt, Bjorn Helgaas, linux-pci

On Mon, Jun 24, 2019 at 10:45:17AM -0600, Logan Gunthorpe wrote:
> 
> 
> On 2019-06-24 3:13 a.m., Benjamin Herrenschmidt wrote:
> > So I'm staring at these three mostly at this point:
> > 
> > void pci_assign_unassigned_root_bus_resources(struct pci_bus *bus)
> > void pci_assign_unassigned_bridge_resources(struct pci_dev *bridge)
> > void pci_assign_unassigned_bus_resources(struct pci_bus *bus)
> > 
> > Now we have 3 functions that fundamentally have the same purpose,
> > assign what was left unassigned down a PCI hierarchy, but are going
> > about it in quite a different manner.
> > 
> > Now to make things worse, there's little consistency in which one gets
> > called where. We have PCI controllers calling the first one sometimes,
> > the last one sometimes, or doing the manual:
> > 
> > 	pci_bus_size_bridges(bus);
> > 	pci_bus_assign_resources(bus);
> > 
> > Or variants with pci_bus_size_bridges sometimes missing etc...
> 
> I suspect there isn't much rhyme or reason to it. None of this is well
> documented so developers writing the controller drivers probably didn't
> have a good idea of what the correct thing to do was, and just stuck
> with the first thing that worked.
> 
> > Now I've consolidated a lot of that and removed all of those "manual"
> > cases in my work-in-progress branch, but I'd like to clarify and
> > possibly remove the 3 ones above.
> > 
> > Let's start with the last one, pci_assign_unassigned_bus_resources, as
> > it's the easiest to remove from users in drivers/pci/controller/* (and
> > replace with pci_assign_unassigned_root_bus_resources typically).
> > 
> > This leaves it used in a couple of corner cases, most of them I think
> > I can kill, and .... sysfs 'rescan'.
> > 
> > The interesting thing about that function is that it tries to avoid
> > resizing the bridge of the bus passed as an argument, it will only
> > resize subordinate bridges. From the changelog it was created for
> > hotplug bridges, but almost none uses it (some powerpc stuff I can
> > probably kill) ... and sysfs rescan.
> > 
> > I wonder what's the remaining purpose of it. sysfs rescan could
> > probably be cleaned up to use the two first... Also why avoid resizing
> > the bridge itself ?
> > 
> > That leads to the difference between
> > pci_assign_unassigned_root_bus_resources()
> > and pci_assign_unassigned_bridge_resources().
> > 
> > The names are misleading. The former isn't just about the root bus
> > resources. It's about the entire tree underneath the root bus.
> > 
> > The main difference that I can tell are:
> > 
> >  - pci_assign_unassigned_root_bus_resources() may or may not try to
> > realloc, depending on a combination of command line args, config
> > option, presence of IOV devices etc... while
> > pci_assign_unassigned_bridge_resources() always will
> > 
> >  - pci_assign_unassigned_bridge_resources() will call
> > pci_bridge_distribute_available_resources() to distribute resource to
> > child hotplug bridges, while pci_assign_unassigned_root_bus_resources()
> > won't.
> >
> > Now, are we 100% confident we want to keep those discrepancies ?
> > 
> > It feels like the former function is intended for boot time resource
> > allocation, and the latter for hotplug, but I can't make sense of why
> > the resources of a device behind a hotplug bridge should be allocated
> > differently depending on whether that device was plugged at boot or
> > plugged later.
> 
> I don't really know, but I kind of assumed reallocing any time but early
> in boot would be dangerous. It involves un-assigning a bunch of
> resources without any real check to see if a driver is using them or
> not. If they were being used by a driver (which is typical) and they
> were reassigned, everything would break.
> 
> I mean, in theory the code could/should be the same for both paths and
> it could just make a single, better decision on whether to realloc or
> not. But that's going to be challenging to get there.
> 
> > Also why not distribute available resources at boot between top level
> > hotplug bridges ?
> >
> > I'm not even going into the question of why the resource
> > sizing/assignment code is so obscure/cryptic/incomprehensible, that's
> > another kettle of fish, but I'd like to at least clarify the usage
> > patterns a bit better.
> I got the impression the code was designed to generally let the firmware
> set things up -- it just fixed things up if the firmware messed it up
> somehow. My guess would be it evolved out of a bunch of hacks designed
> to fix broken bioses into something new platforms used to do full
> enumeration (because it happened to work).
Unfortunately, the operating system is designed to let the firmware do 
things. In my mind, ACPI should not need to exist, and the operating 
system should start with a clean state with PCI and re-enumerate 
everything at boot time. The PCI allocation is so broken and 
inconsistent (as you have noted) because it tries to combine the two, 
when firmware enumeration and native enumeration should be mutually 
exclusive. I have attempted to re-write large chunks of probe.c, pci.c 
and setup-bus.c to completely disregard firmware enumeration and clean 
everything up. Unfortunately, I get stuck in probe.c with the double 
recursive loop which assigns bus numbers - I cannot figure out how to 
re-write it successfully. Plus, I feel like nobody will be ready for 
such a drastic change - I am having trouble selling minor changes that 
fix actual use cases, as opposed to code reworking.

My next proposal might be a kernel parameter for PCI to set various 
levels of disregard for firmware, from none to complete, which can be 
added to incrementally to do more and more (rather than all in one patch 
series). This can supercede pci=realloc. The realloc command is so 
broken because once the system has loaded drivers, it becomes next to 
impossible to free and reallocate a resource to fit another device in - 
because it will upset existing devices. The realloc command is only 
useful in early boot because nothing is yet assigned, so it works. 
However, the same effect can be achieved by releasing all the resources 
on the root port before anything happens. I think it was 
pci_assign_unassigned_resources(), and I did verify this experimentally. 
This switch could be part of such a new kernel parameter to ignore 
firmware influence on PCI.

I hope that somehow we can transition to ignoring the firmware - because 
firmware and native enumeration need to be mutually exclusive, and we 
need native enumeration for PCI hotplug. If anybody has any ideas how, I 
would love to hear.

Nicholas

> 
> Logan

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [nicholas.johnson-opensource@outlook.com.au: [PATCH v6 3/4] PCI: Fix bug resulting in double hpmemsize being assigned to MMIO window]
  2019-06-19 16:21 ` [nicholas.johnson-opensource@outlook.com.au: [PATCH v6 3/4] PCI: Fix bug resulting in double hpmemsize being assigned to MMIO window] Logan Gunthorpe
  2019-06-20  0:44   ` Nicholas Johnson
@ 2019-06-27  7:50   ` Nicholas Johnson
  2019-06-27 16:54     ` Logan Gunthorpe
  1 sibling, 1 reply; 21+ messages in thread
From: Nicholas Johnson @ 2019-06-27  7:50 UTC (permalink / raw)
  To: Logan Gunthorpe; +Cc: benh, Bjorn Helgaas, linux-pci

On Wed, Jun 19, 2019 at 10:21:21AM -0600, Logan Gunthorpe wrote:
> *(cc'd back Bjorn and the list)
> 
> On 2019-06-19 8:00 a.m., Nicholas Johnson wrote:
> > Hi Ben and Logan,
> > 
> > It looks like my git send-email has been not working correctly since I
> > started trying to get these patches accepted. I may have remedied this
> > now, but I have seen that Logan tried to find these patches and failed.
> > So as a courtesy until I post PATCH v7 (hopefully correctly, this time),
> > I am forwarding you the patches. I hope you like them. I would love to 
> > know of any concerns or questions you may have, and / or what happens if 
> > you test them. Thanks and all the best!
> > 
> > ----- Forwarded message from Nicholas Johnson <nicholas.johnson-opensource@outlook.com.au> -----
> > 
> > Date: Thu, 23 May 2019 06:29:27 +0800
> > From: Nicholas Johnson <nicholas.johnson-opensource@outlook.com.au>
> > To: linux-kernel@vger.kernel.org
> > Cc: linux-pci@vger.kernel.org, bhelgaas@google.com, mika.westerberg@linux.intel.com, corbet@lwn.net, Nicholas Johnson <nicholas.johnson-opensource@outlook.com.au>
> > Subject: [PATCH v6 3/4] PCI: Fix bug resulting in double hpmemsize being assigned to MMIO window
> > X-Mailer: git-send-email 2.19.1
> > 
> > Background
> > ==========================================================================
> > 
> > Solve bug report:
> > https://bugzilla.kernel.org/show_bug.cgi?id=203243
> 
> This is all kinds of confusing... the bug report just seems to be a copy
> of the patch set. The description of the actual symptoms of the problem
> appears to be missing from all of it.
> 
> > Currently, the kernel can sometimes assign the MMIO_PREF window
> > additional size into the MMIO window, resulting in double the MMIO
> > additional size, even if the MMIO_PREF window was successful.
> > 
> > This happens if in the first pass, the MMIO_PREF succeeds but the MMIO
> > fails. In the next pass, because MMIO_PREF is already assigned, the
> > attempt to assign MMIO_PREF returns an error code instead of success
> > (nothing more to do, already allocated).
> > 
> > Example of problem (more context can be found in the bug report URL):
> > 
> > Mainline kernel:
> > pci 0000:06:01.0: BAR 14: assigned [mem 0x90100000-0xa00fffff] = 256M
> > pci 0000:06:04.0: BAR 14: assigned [mem 0xa0200000-0xb01fffff] = 256M
> > 
> > Patched kernel:
> > pci 0000:06:01.0: BAR 14: assigned [mem 0x90100000-0x980fffff] = 128M
> > pci 0000:06:04.0: BAR 14: assigned [mem 0x98200000-0xa01fffff] = 128M
> > 
> > This was using pci=realloc,hpmemsize=128M,nocrs - on the same machine
> > with the same configuration, with a Ubuntu mainline kernel and a kernel
> > patched with this patch series.
> > 
> > This patch is vital for the next patch in the series. The next patch
> > allows the user to specify MMIO and MMIO_PREF independently. If the
> > MMIO_PREF is set to be very large, this bug will end up more than
> > doubling the MMIO size. The bug results in the MMIO_PREF being added to
> > the MMIO window, which means doubling if MMIO_PREF size == MMIO size.
> > With a large MMIO_PREF, without this patch, the MMIO window will likely
> > fail to be assigned altogether due to lack of 32-bit address space.
> > 
> > Patch notes
> > ==========================================================================
> > 
> > Change find_free_bus_resource() to not skip assigned resources with
> > non-null parent.
> > 
> > Add checks in pbus_size_io() and pbus_size_mem() to return success if
> > resource returned from find_free_bus_resource() is already allocated.
> > 
> > This avoids pbus_size_io() and pbus_size_mem() returning error code to
> > __pci_bus_size_bridges() when a resource has been successfully assigned
> > in a previous pass. This fixes the existing behaviour where space for a
> > resource could be reserved multiple times in different parent bridge
> > windows. This also greatly reduces the number of failed BAR messages in
> > dmesg when Linux assigns resources.
> 
> This patch looks like the same bug that I tracked down earlier but I
> solved in a slightly different way. See this patch[1] which is still
> under review. Can you maybe test it and see if it solves the same problem?
> 
> Thanks,
> 
> Logan
> 
> [1]
> https://lore.kernel.org/lkml/20190531171216.20532-2-logang@deltatee.com/T/#u
[1] says Reported-by: Kit Chow, but I cannot find the bug report on 
bugzilla.kernel.org - should I be linking the bug reports into my 
version of this patch in case it is accepted?

Bjorn never replied to my queries about which should be accepted and 
what I should do either way. For now I am moving my version of this 
patch to the end of my series so that it can easily be knocked off if 
Bjorn prefers your patch.

Cheers

Nicholas

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Multitude of resource assignment functions
  2019-06-27  7:40             ` Nicholas Johnson
@ 2019-06-27  8:48               ` Benjamin Herrenschmidt
  2019-06-30  2:40                 ` Nicholas Johnson
  2019-06-27 16:35               ` Logan Gunthorpe
  1 sibling, 1 reply; 21+ messages in thread
From: Benjamin Herrenschmidt @ 2019-06-27  8:48 UTC (permalink / raw)
  To: Nicholas Johnson, Logan Gunthorpe; +Cc: Bjorn Helgaas, linux-pci

On Thu, 2019-06-27 at 07:40 +0000, Nicholas Johnson wrote:
> Unfortunately, the operating system is designed to let the firmware do 
> things. In my mind, ACPI should not need to exist, and the operating 
> system should start with a clean state with PCI and re-enumerate 
> everything at boot time. The PCI allocation is so broken and 
> inconsistent (as you have noted) because it tries to combine the two, 
> when firmware enumeration and native enumeration should be mutually 
> exclusive. I have attempted to re-write large chunks of probe.c, pci.c 
> and setup-bus.c to completely disregard firmware enumeration and clean 
> everything up. Unfortunately, I get stuck in probe.c with the double 
> recursive loop which assigns bus numbers - I cannot figure out how to 
> re-write it successfully. Plus, I feel like nobody will be ready for 
> such a drastic change - I am having trouble selling minor changes that 
> fix actual use cases, as opposed to code reworking.

Well... so a lot of platforms are happy to do a full re-assignment,
though they use the current code today which leads to rather sub
standard results when it comes to hotplug bridges.

All the embedded platforms today are like that,and all of ARM64 though
the latter will somewhat change, all DT based ARM64 will probably
remain that way.

> My next proposal might be a kernel parameter for PCI to set various 
> levels of disregard for firmware

Well, at least ACPI has this _DSM #5 thingy that can tell us that we
are allowed to disregard firmware for selected bits and pieces
(hopefully that tends to be whole hierarchies but I don't know how well
it's used in practice).

> , from none to complete, which can be 
> added to incrementally to do more and more (rather than all in one patch 
> series).

So there are a number of reasons to honor what the firmware did.

First, today (but that's fixable), we suck at setting up reasonable
space for hotplug by default.

But there are more insidious ones. There are platforms where you can't
move things (typically virtualized platforms with specific hypervisors,
such as IBM pseries).

There are platforms where the *runtime* firwmare (SMM or equivalent or
even ACPI AML bits) will be poking at some system devices and those
really must not be moved. (In fact there's a theorical problem with
such devices becoming temporarily inaccessible during BAR sizing today
but we mostly get lucky).

There are other "interesting" cases, like EFI giving us the framebuffer
address to use if we don't have a native driver... which happens to be
off a PCI BAR somewhere. Now we *could* probably try to special case
that and detect when we move that BAR but today we'll probably break if
we move it.

x86 historically has other nasty "hidden" devices. There are historical
cases of devices that break if they move after initial setup, etc...
Most of these things are ancient but we have to ensure we keep today's
policy for old platforms at least.

>  This can supercede pci=realloc. The realloc command is so 
> broken because once the system has loaded drivers, it becomes next to 
> impossible to free and reallocate a resource to fit another device in - 
> because it will upset existing devices. The realloc command is only 
> useful in early boot because nothing is yet assigned, so it works. 
> However, the same effect can be achieved by releasing all the resources 
> on the root port before anything happens. I think it was 
> pci_assign_unassigned_resources(), and I did verify this experimentally. 
> This switch could be part of such a new kernel parameter to ignore 
> firmware influence on PCI.

We should see what ACPI gives us in _DSM #5 on x86 these days.. if it's
meaningful on enough machines we could use that as an indication that a
given tree can be reallocated.

> I hope that somehow we can transition to ignoring the firmware - because 
> firmware and native enumeration need to be mutually exclusive, and we 
> need native enumeration for PCI hotplug. If anybody has any ideas how, I 
> would love to hear.

We'll probably have to live with an "in-between" forever on x86 and
maybe arm64, but with some luck, the static devices will only be the
on-board stuff, and we can go wild below bridges...

BTW: I'd like us to discuss that f2f at Plumbers in a miniconf if
enough of us can go.

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Multitude of resource assignment functions
  2019-06-27  7:40             ` Nicholas Johnson
  2019-06-27  8:48               ` Benjamin Herrenschmidt
@ 2019-06-27 16:35               ` Logan Gunthorpe
  2019-06-27 20:26                 ` Benjamin Herrenschmidt
  2019-06-30  2:57                 ` Nicholas Johnson
  1 sibling, 2 replies; 21+ messages in thread
From: Logan Gunthorpe @ 2019-06-27 16:35 UTC (permalink / raw)
  To: Nicholas Johnson; +Cc: Benjamin Herrenschmidt, Bjorn Helgaas, linux-pci



On 2019-06-27 1:40 a.m., Nicholas Johnson wrote:
> On Mon, Jun 24, 2019 at 10:45:17AM -0600, Logan Gunthorpe wrote:
>>
>>
>> On 2019-06-24 3:13 a.m., Benjamin Herrenschmidt wrote:
>>> So I'm staring at these three mostly at this point:
>>>
>>> void pci_assign_unassigned_root_bus_resources(struct pci_bus *bus)
>>> void pci_assign_unassigned_bridge_resources(struct pci_dev *bridge)
>>> void pci_assign_unassigned_bus_resources(struct pci_bus *bus)
>>>
>>> Now we have 3 functions that fundamentally have the same purpose,
>>> assign what was left unassigned down a PCI hierarchy, but are going
>>> about it in quite a different manner.
>>>
>>> Now to make things worse, there's little consistency in which one gets
>>> called where. We have PCI controllers calling the first one sometimes,
>>> the last one sometimes, or doing the manual:
>>>
>>> 	pci_bus_size_bridges(bus);
>>> 	pci_bus_assign_resources(bus);
>>>
>>> Or variants with pci_bus_size_bridges sometimes missing etc...
>>
>> I suspect there isn't much rhyme or reason to it. None of this is well
>> documented so developers writing the controller drivers probably didn't
>> have a good idea of what the correct thing to do was, and just stuck
>> with the first thing that worked.
>>
>>> Now I've consolidated a lot of that and removed all of those "manual"
>>> cases in my work-in-progress branch, but I'd like to clarify and
>>> possibly remove the 3 ones above.
>>>
>>> Let's start with the last one, pci_assign_unassigned_bus_resources, as
>>> it's the easiest to remove from users in drivers/pci/controller/* (and
>>> replace with pci_assign_unassigned_root_bus_resources typically).
>>>
>>> This leaves it used in a couple of corner cases, most of them I think
>>> I can kill, and .... sysfs 'rescan'.
>>>
>>> The interesting thing about that function is that it tries to avoid
>>> resizing the bridge of the bus passed as an argument, it will only
>>> resize subordinate bridges. From the changelog it was created for
>>> hotplug bridges, but almost none uses it (some powerpc stuff I can
>>> probably kill) ... and sysfs rescan.
>>>
>>> I wonder what's the remaining purpose of it. sysfs rescan could
>>> probably be cleaned up to use the two first... Also why avoid resizing
>>> the bridge itself ?
>>>
>>> That leads to the difference between
>>> pci_assign_unassigned_root_bus_resources()
>>> and pci_assign_unassigned_bridge_resources().
>>>
>>> The names are misleading. The former isn't just about the root bus
>>> resources. It's about the entire tree underneath the root bus.
>>>
>>> The main difference that I can tell are:
>>>
>>>  - pci_assign_unassigned_root_bus_resources() may or may not try to
>>> realloc, depending on a combination of command line args, config
>>> option, presence of IOV devices etc... while
>>> pci_assign_unassigned_bridge_resources() always will
>>>
>>>  - pci_assign_unassigned_bridge_resources() will call
>>> pci_bridge_distribute_available_resources() to distribute resource to
>>> child hotplug bridges, while pci_assign_unassigned_root_bus_resources()
>>> won't.
>>>
>>> Now, are we 100% confident we want to keep those discrepancies ?
>>>
>>> It feels like the former function is intended for boot time resource
>>> allocation, and the latter for hotplug, but I can't make sense of why
>>> the resources of a device behind a hotplug bridge should be allocated
>>> differently depending on whether that device was plugged at boot or
>>> plugged later.
>>
>> I don't really know, but I kind of assumed reallocing any time but early
>> in boot would be dangerous. It involves un-assigning a bunch of
>> resources without any real check to see if a driver is using them or
>> not. If they were being used by a driver (which is typical) and they
>> were reassigned, everything would break.
>>
>> I mean, in theory the code could/should be the same for both paths and
>> it could just make a single, better decision on whether to realloc or
>> not. But that's going to be challenging to get there.
>>
>>> Also why not distribute available resources at boot between top level
>>> hotplug bridges ?
>>>
>>> I'm not even going into the question of why the resource
>>> sizing/assignment code is so obscure/cryptic/incomprehensible, that's
>>> another kettle of fish, but I'd like to at least clarify the usage
>>> patterns a bit better.
>> I got the impression the code was designed to generally let the firmware
>> set things up -- it just fixed things up if the firmware messed it up
>> somehow. My guess would be it evolved out of a bunch of hacks designed
>> to fix broken bioses into something new platforms used to do full
>> enumeration (because it happened to work).
> Unfortunately, the operating system is designed to let the firmware do 
> things. In my mind, ACPI should not need to exist, and the operating 
> system should start with a clean state with PCI and re-enumerate 
> everything at boot time. The PCI allocation is so broken and 
> inconsistent (as you have noted) because it tries to combine the two, 
> when firmware enumeration and native enumeration should be mutually 
> exclusive. I have attempted to re-write large chunks of probe.c, pci.c 
> and setup-bus.c to completely disregard firmware enumeration and clean 
> everything up. Unfortunately, I get stuck in probe.c with the double 
> recursive loop which assigns bus numbers - I cannot figure out how to 
> re-write it successfully. Plus, I feel like nobody will be ready for 
> such a drastic change - I am having trouble selling minor changes that 
> fix actual use cases, as opposed to code reworking.

My worry would be if the firmware depends on any of those PCI resources
for any of it's calls. For example, laptop firmware often has specific
code for screen blanking/dimming when the special buttons are pressed.
If it implements this by communicating with a PCI device then the kernel
will break things by reassigning all the addresses.

However, having a kernel parameter to ignore the firmware choices might
be a good way for us to start testing whether this is a problem or not
on some systems

Logan

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [nicholas.johnson-opensource@outlook.com.au: [PATCH v6 3/4] PCI: Fix bug resulting in double hpmemsize being assigned to MMIO window]
  2019-06-27  7:50   ` Nicholas Johnson
@ 2019-06-27 16:54     ` Logan Gunthorpe
  0 siblings, 0 replies; 21+ messages in thread
From: Logan Gunthorpe @ 2019-06-27 16:54 UTC (permalink / raw)
  To: Nicholas Johnson; +Cc: benh, Bjorn Helgaas, linux-pci



On 2019-06-27 1:50 a.m., Nicholas Johnson wrote:
> On Wed, Jun 19, 2019 at 10:21:21AM -0600, Logan Gunthorpe wrote:
>> *(cc'd back Bjorn and the list)
>>
>> On 2019-06-19 8:00 a.m., Nicholas Johnson wrote:
>>> Hi Ben and Logan,
>>>
>>> It looks like my git send-email has been not working correctly since I
>>> started trying to get these patches accepted. I may have remedied this
>>> now, but I have seen that Logan tried to find these patches and failed.
>>> So as a courtesy until I post PATCH v7 (hopefully correctly, this time),
>>> I am forwarding you the patches. I hope you like them. I would love to 
>>> know of any concerns or questions you may have, and / or what happens if 
>>> you test them. Thanks and all the best!
>>>
>>> ----- Forwarded message from Nicholas Johnson <nicholas.johnson-opensource@outlook.com.au> -----
>>>
>>> Date: Thu, 23 May 2019 06:29:27 +0800
>>> From: Nicholas Johnson <nicholas.johnson-opensource@outlook.com.au>
>>> To: linux-kernel@vger.kernel.org
>>> Cc: linux-pci@vger.kernel.org, bhelgaas@google.com, mika.westerberg@linux.intel.com, corbet@lwn.net, Nicholas Johnson <nicholas.johnson-opensource@outlook.com.au>
>>> Subject: [PATCH v6 3/4] PCI: Fix bug resulting in double hpmemsize being assigned to MMIO window
>>> X-Mailer: git-send-email 2.19.1
>>>
>>> Background
>>> ==========================================================================
>>>
>>> Solve bug report:
>>> https://bugzilla.kernel.org/show_bug.cgi?id=203243
>>
>> This is all kinds of confusing... the bug report just seems to be a copy
>> of the patch set. The description of the actual symptoms of the problem
>> appears to be missing from all of it.
>>
>>> Currently, the kernel can sometimes assign the MMIO_PREF window
>>> additional size into the MMIO window, resulting in double the MMIO
>>> additional size, even if the MMIO_PREF window was successful.
>>>
>>> This happens if in the first pass, the MMIO_PREF succeeds but the MMIO
>>> fails. In the next pass, because MMIO_PREF is already assigned, the
>>> attempt to assign MMIO_PREF returns an error code instead of success
>>> (nothing more to do, already allocated).
>>>
>>> Example of problem (more context can be found in the bug report URL):
>>>
>>> Mainline kernel:
>>> pci 0000:06:01.0: BAR 14: assigned [mem 0x90100000-0xa00fffff] = 256M
>>> pci 0000:06:04.0: BAR 14: assigned [mem 0xa0200000-0xb01fffff] = 256M
>>>
>>> Patched kernel:
>>> pci 0000:06:01.0: BAR 14: assigned [mem 0x90100000-0x980fffff] = 128M
>>> pci 0000:06:04.0: BAR 14: assigned [mem 0x98200000-0xa01fffff] = 128M
>>>
>>> This was using pci=realloc,hpmemsize=128M,nocrs - on the same machine
>>> with the same configuration, with a Ubuntu mainline kernel and a kernel
>>> patched with this patch series.
>>>
>>> This patch is vital for the next patch in the series. The next patch
>>> allows the user to specify MMIO and MMIO_PREF independently. If the
>>> MMIO_PREF is set to be very large, this bug will end up more than
>>> doubling the MMIO size. The bug results in the MMIO_PREF being added to
>>> the MMIO window, which means doubling if MMIO_PREF size == MMIO size.
>>> With a large MMIO_PREF, without this patch, the MMIO window will likely
>>> fail to be assigned altogether due to lack of 32-bit address space.
>>>
>>> Patch notes
>>> ==========================================================================
>>>
>>> Change find_free_bus_resource() to not skip assigned resources with
>>> non-null parent.
>>>
>>> Add checks in pbus_size_io() and pbus_size_mem() to return success if
>>> resource returned from find_free_bus_resource() is already allocated.
>>>
>>> This avoids pbus_size_io() and pbus_size_mem() returning error code to
>>> __pci_bus_size_bridges() when a resource has been successfully assigned
>>> in a previous pass. This fixes the existing behaviour where space for a
>>> resource could be reserved multiple times in different parent bridge
>>> windows. This also greatly reduces the number of failed BAR messages in
>>> dmesg when Linux assigns resources.
>>
>> This patch looks like the same bug that I tracked down earlier but I
>> solved in a slightly different way. See this patch[1] which is still
>> under review. Can you maybe test it and see if it solves the same problem?
>>
>> Thanks,
>>
>> Logan
>>
>> [1]
>> https://lore.kernel.org/lkml/20190531171216.20532-2-logang@deltatee.com/T/#u
> [1] says Reported-by: Kit Chow, but I cannot find the bug report on 
> bugzilla.kernel.org - should I be linking the bug reports into my 
> version of this patch in case it is accepted?

Reported-by doesn't indicate a bug report is on bugzilla. In fact I
don't think too many people rely on bugzilla. It's more of a place that
triage is done to send reporters to appropriate mailing lists.

> Bjorn never replied to my queries about which should be accepted and 
> what I should do either way. For now I am moving my version of this 
> patch to the end of my series so that it can easily be knocked off if 
> Bjorn prefers your patch.

I'm sure he'll get to it eventually. Probably just seeing where some of
the discussion is leading.

Logan

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Multitude of resource assignment functions
  2019-06-27 16:35               ` Logan Gunthorpe
@ 2019-06-27 20:26                 ` Benjamin Herrenschmidt
  2019-06-30  2:57                 ` Nicholas Johnson
  1 sibling, 0 replies; 21+ messages in thread
From: Benjamin Herrenschmidt @ 2019-06-27 20:26 UTC (permalink / raw)
  To: Logan Gunthorpe, Nicholas Johnson; +Cc: Bjorn Helgaas, linux-pci

On Thu, 2019-06-27 at 10:35 -0600, Logan Gunthorpe wrote:
> My worry would be if the firmware depends on any of those PCI resources
> for any of it's calls. For example, laptop firmware often has specific
> code for screen blanking/dimming when the special buttons are pressed.
> If it implements this by communicating with a PCI device then the kernel
> will break things by reassigning all the addresses.
> 
> However, having a kernel parameter to ignore the firmware choices might
> be a good way for us to start testing whether this is a problem or not
> on some systems

As I consolidate that accross archs I can add such a parameter... I
haven't quite folded x86 in yet, but I'm hoping I'll be able to do so
soon. I plan to move some of those x86 specific kernel parameters into
generic code while doing so. I can add this one.

Cheers,
Ben.




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Multitude of resource assignment functions
  2019-06-27  8:48               ` Benjamin Herrenschmidt
@ 2019-06-30  2:40                 ` Nicholas Johnson
  0 siblings, 0 replies; 21+ messages in thread
From: Nicholas Johnson @ 2019-06-30  2:40 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: Logan Gunthorpe, Bjorn Helgaas, linux-pci

Thank you for the reply. I have been mulling it over for a while.

On Thu, Jun 27, 2019 at 06:48:35PM +1000, Benjamin Herrenschmidt wrote:
> On Thu, 2019-06-27 at 07:40 +0000, Nicholas Johnson wrote:
> > Unfortunately, the operating system is designed to let the firmware do 
> > things. In my mind, ACPI should not need to exist, and the operating 
> > system should start with a clean state with PCI and re-enumerate 
> > everything at boot time. The PCI allocation is so broken and 
> > inconsistent (as you have noted) because it tries to combine the two, 
> > when firmware enumeration and native enumeration should be mutually 
> > exclusive. I have attempted to re-write large chunks of probe.c, pci.c 
> > and setup-bus.c to completely disregard firmware enumeration and clean 
> > everything up. Unfortunately, I get stuck in probe.c with the double 
> > recursive loop which assigns bus numbers - I cannot figure out how to 
> > re-write it successfully. Plus, I feel like nobody will be ready for 
> > such a drastic change - I am having trouble selling minor changes that 
> > fix actual use cases, as opposed to code reworking.
> 
> Well... so a lot of platforms are happy to do a full re-assignment,
> though they use the current code today which leads to rather sub
> standard results when it comes to hotplug bridges.
> 
> All the embedded platforms today are like that,and all of ARM64 though
> the latter will somewhat change, all DT based ARM64 will probably
> remain that way.
> 
> > My next proposal might be a kernel parameter for PCI to set various 
> > levels of disregard for firmware
> 
> Well, at least ACPI has this _DSM #5 thingy that can tell us that we
> are allowed to disregard firmware for selected bits and pieces
> (hopefully that tends to be whole hierarchies but I don't know how well
> it's used in practice).
I will need to find out more about this - can you suggest any 
particularly good resources on learning about ACPI?

> 
> > , from none to complete, which can be 
> > added to incrementally to do more and more (rather than all in one patch 
> > series).
> 
> So there are a number of reasons to honor what the firmware did.
> 
> First, today (but that's fixable), we suck at setting up reasonable
> space for hotplug by default.
What annoys me more is that the BIOS vendors

a) don't provide means to 
configure this in the BIOS, and if they do, it is hidden options which 
require you to re-flash the BIOS or use the dumped IFRs and EFI shell to 
modify the variables

b) Even the few motherboards with the options for Thunderbolt available 
without resorting to (a) have it limited to 4096M.

c) Motherboards are still cramming us into the 32-bit address space in 
case somebody is still using a 32-bit OS. There is the "above 4G 
decoding option" available on most motherboards, but I am not sure if 
that completely fixes the issue. Given that Microsoft said you need 
Windows 10 to run on the latest hardware, I do not see many people using 
32-bit OS on the latest hardware.

d) These options are especially needed because Windows cannot override 
anything whatsoever. Not even _OSC like pcie_ports=native on Linux.

> 
> But there are more insidious ones. There are platforms where you can't
> move things (typically virtualized platforms with specific hypervisors,
> such as IBM pseries).
I cannot argue with this.

> 
> There are platforms where the *runtime* firwmare (SMM or equivalent or
> even ACPI AML bits) will be poking at some system devices and those
> really must not be moved. (In fact there's a theorical problem with
> such devices becoming temporarily inaccessible during BAR sizing today
> but we mostly get lucky).
I think SMM is a nasty back door. Unfortunately the precident set is 
that the firmware makers can do what they want and we are expected to 
honour that in the kernel. In an ideal world, it would default to the OS 
assigning things and the firmware vendors getting blamed when things 
break if they insist on using runtime firmware.

In my ideal world, motherboards would have the absolute bare minimum in 
BIOS to initialise DRAM and the tricky stuff, and then boot a CoreBoot 
Linux kernel off a MicroSD slot on the board. This could easily be 
updated constantly (for example, to add NVMe support to old boards) and 
it would be impossible to brick the motherboard by changing this, as the 
SD card could be removed and restored.

This would fix the following:
- No longer need for PCI option ROMs and 
their security issues
- Open source / free firmware
- Will not need firmware updates to add NVMe boot support
- Allow target OS booted with kexec to assign resources as required
- Set up IOMMU for Thunderbolt (and all DMA ports) at boot time without 
special BIOS updates required
- Etc

I am sure there are problems to what I am saying, but I do find it 
frustrating that the industry has the inability to move on from legacy 
to the massive extent that it does.

When you have an arch, you expect that the same bytecode will run on the 
next system with that same arch. I don't understand why it stops there - 
I believe two systems of the same arch should be indistinguishable - 
without all of the firmware differences, and I hope to influence this 
during my career.

> 
> There are other "interesting" cases, like EFI giving us the framebuffer
> address to use if we don't have a native driver... which happens to be
> off a PCI BAR somewhere. Now we *could* probably try to special case
> that and detect when we move that BAR but today we'll probably break if
> we move it.
Also fixed by CoreBoot which will have the Linux kernel and all the 
drivers - no need for legacy services like this.

> 
> x86 historically has other nasty "hidden" devices. There are historical
> cases of devices that break if they move after initial setup, etc...
> Most of these things are ancient but we have to ensure we keep today's
> policy for old platforms at least.
Sometimes I think that we need a fork of Linux. Although that would be 
the same as saying "for old systems, support ends on this kernel version 
and you are unlikely to need the new features of the latest kernels on 
oldest hardware". They did drop the older X86 recently, I believe.

> 
> >  This can supercede pci=realloc. The realloc command is so 
> > broken because once the system has loaded drivers, it becomes next to 
> > impossible to free and reallocate a resource to fit another device in - 
> > because it will upset existing devices. The realloc command is only 
> > useful in early boot because nothing is yet assigned, so it works. 
> > However, the same effect can be achieved by releasing all the resources 
> > on the root port before anything happens. I think it was 
> > pci_assign_unassigned_resources(), and I did verify this experimentally. 
> > This switch could be part of such a new kernel parameter to ignore 
> > firmware influence on PCI.
> 
> We should see what ACPI gives us in _DSM #5 on x86 these days.. if it's
> meaningful on enough machines we could use that as an indication that a
> given tree can be reallocated.
> 
> > I hope that somehow we can transition to ignoring the firmware - because 
> > firmware and native enumeration need to be mutually exclusive, and we 
> > need native enumeration for PCI hotplug. If anybody has any ideas how, I 
> > would love to hear.
> 
> We'll probably have to live with an "in-between" forever on x86 and
> maybe arm64, but with some luck, the static devices will only be the
> on-board stuff, and we can go wild below bridges...
The rest was just speculation and thoughts. My real question here is: 
What path do we have towards modernisation? We cannot replace the PCI 
code to handle everything natively and disregard the firmware for modern 
architectures like the emerging RISC-V because that code will screw up 
X86. So do we have to have pci-old and pci-new subsystems which can be 
elected by each arch?

> 
> BTW: I'd like us to discuss that f2f at Plumbers in a miniconf if
> enough of us can go.
Please explain this as I have no idea what f2f, Plumbers and miniconf 
are.

Cheers,
Nicholas

> 
> Cheers,
> Ben.
> 
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Multitude of resource assignment functions
  2019-06-27 16:35               ` Logan Gunthorpe
  2019-06-27 20:26                 ` Benjamin Herrenschmidt
@ 2019-06-30  2:57                 ` Nicholas Johnson
  2019-07-01  4:33                   ` Oliver O'Halloran
  2019-07-02 21:39                   ` Bjorn Helgaas
  1 sibling, 2 replies; 21+ messages in thread
From: Nicholas Johnson @ 2019-06-30  2:57 UTC (permalink / raw)
  To: Logan Gunthorpe; +Cc: Benjamin Herrenschmidt, Bjorn Helgaas, linux-pci

On Thu, Jun 27, 2019 at 10:35:12AM -0600, Logan Gunthorpe wrote:
> 
> 
> On 2019-06-27 1:40 a.m., Nicholas Johnson wrote:
> > On Mon, Jun 24, 2019 at 10:45:17AM -0600, Logan Gunthorpe wrote:
> >>
> >>
> >> On 2019-06-24 3:13 a.m., Benjamin Herrenschmidt wrote:
> >>> So I'm staring at these three mostly at this point:
> >>>
> >>> void pci_assign_unassigned_root_bus_resources(struct pci_bus *bus)
> >>> void pci_assign_unassigned_bridge_resources(struct pci_dev *bridge)
> >>> void pci_assign_unassigned_bus_resources(struct pci_bus *bus)
> >>>
> >>> Now we have 3 functions that fundamentally have the same purpose,
> >>> assign what was left unassigned down a PCI hierarchy, but are going
> >>> about it in quite a different manner.
> >>>
> >>> Now to make things worse, there's little consistency in which one gets
> >>> called where. We have PCI controllers calling the first one sometimes,
> >>> the last one sometimes, or doing the manual:
> >>>
> >>> 	pci_bus_size_bridges(bus);
> >>> 	pci_bus_assign_resources(bus);
> >>>
> >>> Or variants with pci_bus_size_bridges sometimes missing etc...
> >>
> >> I suspect there isn't much rhyme or reason to it. None of this is well
> >> documented so developers writing the controller drivers probably didn't
> >> have a good idea of what the correct thing to do was, and just stuck
> >> with the first thing that worked.
> >>
> >>> Now I've consolidated a lot of that and removed all of those "manual"
> >>> cases in my work-in-progress branch, but I'd like to clarify and
> >>> possibly remove the 3 ones above.
> >>>
> >>> Let's start with the last one, pci_assign_unassigned_bus_resources, as
> >>> it's the easiest to remove from users in drivers/pci/controller/* (and
> >>> replace with pci_assign_unassigned_root_bus_resources typically).
> >>>
> >>> This leaves it used in a couple of corner cases, most of them I think
> >>> I can kill, and .... sysfs 'rescan'.
> >>>
> >>> The interesting thing about that function is that it tries to avoid
> >>> resizing the bridge of the bus passed as an argument, it will only
> >>> resize subordinate bridges. From the changelog it was created for
> >>> hotplug bridges, but almost none uses it (some powerpc stuff I can
> >>> probably kill) ... and sysfs rescan.
> >>>
> >>> I wonder what's the remaining purpose of it. sysfs rescan could
> >>> probably be cleaned up to use the two first... Also why avoid resizing
> >>> the bridge itself ?
> >>>
> >>> That leads to the difference between
> >>> pci_assign_unassigned_root_bus_resources()
> >>> and pci_assign_unassigned_bridge_resources().
> >>>
> >>> The names are misleading. The former isn't just about the root bus
> >>> resources. It's about the entire tree underneath the root bus.
> >>>
> >>> The main difference that I can tell are:
> >>>
> >>>  - pci_assign_unassigned_root_bus_resources() may or may not try to
> >>> realloc, depending on a combination of command line args, config
> >>> option, presence of IOV devices etc... while
> >>> pci_assign_unassigned_bridge_resources() always will
> >>>
> >>>  - pci_assign_unassigned_bridge_resources() will call
> >>> pci_bridge_distribute_available_resources() to distribute resource to
> >>> child hotplug bridges, while pci_assign_unassigned_root_bus_resources()
> >>> won't.
> >>>
> >>> Now, are we 100% confident we want to keep those discrepancies ?
> >>>
> >>> It feels like the former function is intended for boot time resource
> >>> allocation, and the latter for hotplug, but I can't make sense of why
> >>> the resources of a device behind a hotplug bridge should be allocated
> >>> differently depending on whether that device was plugged at boot or
> >>> plugged later.
> >>
> >> I don't really know, but I kind of assumed reallocing any time but early
> >> in boot would be dangerous. It involves un-assigning a bunch of
> >> resources without any real check to see if a driver is using them or
> >> not. If they were being used by a driver (which is typical) and they
> >> were reassigned, everything would break.
> >>
> >> I mean, in theory the code could/should be the same for both paths and
> >> it could just make a single, better decision on whether to realloc or
> >> not. But that's going to be challenging to get there.
> >>
> >>> Also why not distribute available resources at boot between top level
> >>> hotplug bridges ?
> >>>
> >>> I'm not even going into the question of why the resource
> >>> sizing/assignment code is so obscure/cryptic/incomprehensible, that's
> >>> another kettle of fish, but I'd like to at least clarify the usage
> >>> patterns a bit better.
> >> I got the impression the code was designed to generally let the firmware
> >> set things up -- it just fixed things up if the firmware messed it up
> >> somehow. My guess would be it evolved out of a bunch of hacks designed
> >> to fix broken bioses into something new platforms used to do full
> >> enumeration (because it happened to work).
> > Unfortunately, the operating system is designed to let the firmware do 
> > things. In my mind, ACPI should not need to exist, and the operating 
> > system should start with a clean state with PCI and re-enumerate 
> > everything at boot time. The PCI allocation is so broken and 
> > inconsistent (as you have noted) because it tries to combine the two, 
> > when firmware enumeration and native enumeration should be mutually 
> > exclusive. I have attempted to re-write large chunks of probe.c, pci.c 
> > and setup-bus.c to completely disregard firmware enumeration and clean 
> > everything up. Unfortunately, I get stuck in probe.c with the double 
> > recursive loop which assigns bus numbers - I cannot figure out how to 
> > re-write it successfully. Plus, I feel like nobody will be ready for 
> > such a drastic change - I am having trouble selling minor changes that 
> > fix actual use cases, as opposed to code reworking.
> 
> My worry would be if the firmware depends on any of those PCI resources
> for any of it's calls. For example, laptop firmware often has specific
> code for screen blanking/dimming when the special buttons are pressed.
> If it implements this by communicating with a PCI device then the kernel
> will break things by reassigning all the addresses.
> 
> However, having a kernel parameter to ignore the firmware choices might
> be a good way for us to start testing whether this is a problem or not
> on some systems
> 
> Logan
If Bjorn also agrees then I will give it a shot when I have finished 
with Thunderbolt.

Some other related thoughts:

- Should pci=noacpi imply pci=nocrs? It does not appear to, and I feel 
like it should, as CRS is part of ACPI and relates to PCI.

- Does anybody know why with pci=noacpi, you get dmesg warnings about 
cannot find PCI int A mapping - but they do not seem to cause the 
devices any issues in functioning? Is it because they are using MSI?

- Does pci=ignorefw sound good for a future proposal?

- Modern arches could give this option by default if they want 
everything done by the OS. Although this would not be nearly as nice as 
a code overhaul or branching out pci into pci-old and pci-new.

- Thunderbolt has given me a glimmer of hope. It used to be so tightly 
integrated into the system firmware and add-in cards were not even 
detectable without it (you need to hit up the pcie2tbt mailbox in the 
BIOS to wake the controller up, for a start). It would not even show 
without ACPI running. Now I can use pci=noacpi with this patch series 
and work happily with Thunderbolt.

- I have not given iommu or intel_iommu parameters but I am getting DMAR 
faults (probably because I am using pci=noacpi) but normally the DMAR 
does not come on if you do not ask it to. Is there perhaps something 
recently added to do with Thunderbolt that is activating it? I 
understand that regardless, DMAR does not work well without ACPI. The 
main two I care about are DMAR and MADT (multiple processors) tables and 
otherwise, I would disable ACPI altogether.

Cheers,
Nicholas

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Multitude of resource assignment functions
  2019-06-30  2:57                 ` Nicholas Johnson
@ 2019-07-01  4:33                   ` Oliver O'Halloran
  2019-07-02 21:39                   ` Bjorn Helgaas
  1 sibling, 0 replies; 21+ messages in thread
From: Oliver O'Halloran @ 2019-07-01  4:33 UTC (permalink / raw)
  To: Nicholas Johnson
  Cc: Logan Gunthorpe, Benjamin Herrenschmidt, Bjorn Helgaas, linux-pci

On Sun, Jun 30, 2019 at 12:58 PM Nicholas Johnson
<nicholas.johnson-opensource@outlook.com.au> wrote:
>
> *snip*
>
> Some other related thoughts:

> - Should pci=noacpi imply pci=nocrs? It does not appear to, and I feel
> like it should, as CRS is part of ACPI and relates to PCI.

Are you sure about that? There has been explicit support for CRS in
the PCIe Base spec since gen1.

> - Modern arches could give this option by default if they want
> everything done by the OS. Although this would not be nearly as nice as
> a code overhaul or branching out pci into pci-old and pci-new.

This is only really nice if you can use the new code 100% of the time.
If we're keeping around the old code then it's still going to be a
maintenance burden since there's no shortage of platforms with odd
firmware requirements.

> - I have not given iommu or intel_iommu parameters but I am getting DMAR
> faults (probably because I am using pci=noacpi) but normally the DMAR
> does not come on if you do not ask it to. Is there perhaps something
> recently added to do with Thunderbolt that is activating it? I
> understand that regardless, DMAR does not work well without ACPI. The
> main two I care about are DMAR and MADT (multiple processors) tables and
> otherwise, I would disable ACPI altogether.

IIRC the IOMMU can be forced on for anything behind a thunderbolt
controller since external devices aren't necessarily trustworthy. You
might be hitting that.

>
> Cheers,
> Nicholas

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Multitude of resource assignment functions
  2019-06-30  2:57                 ` Nicholas Johnson
  2019-07-01  4:33                   ` Oliver O'Halloran
@ 2019-07-02 21:39                   ` Bjorn Helgaas
  2019-07-03 13:43                     ` Nicholas Johnson
  1 sibling, 1 reply; 21+ messages in thread
From: Bjorn Helgaas @ 2019-07-02 21:39 UTC (permalink / raw)
  To: Nicholas Johnson; +Cc: Logan Gunthorpe, Benjamin Herrenschmidt, linux-pci

On Sun, Jun 30, 2019 at 02:57:37AM +0000, Nicholas Johnson wrote:

> - Should pci=noacpi imply pci=nocrs? It does not appear to, and I feel 
> like it should, as CRS is part of ACPI and relates to PCI.

"pci=noacpi" means "Do not use ACPI for IRQ routing or for PCI
scanning."

"pci=nocrs" means "Ignore PCI host bridge windows from ACPI."  If we
ignore _CRS, we have no idea what the PCI host bridge apertures are,
so we cannot allocate resources for devices on the root bus.

The "Do not use ACPI for ... PCI scanning" part indeed does suggest
that "pci=noacpi" could imply "pci=nocrs", but I don't think there's
anything to be gained by changing that now.

We probably *should* remove "or for PCI scanning" from the
documentation, because "pci=noacpi" only affects IRQs.

The only reason these exist at all is as a debugging aid to
temporarily work around issues in firmware or Linux until we can
develop a real fix or quirk that works without the user specifying a
kernel parameter.

> - Does anybody know why with pci=noacpi, you get dmesg warnings about 
> cannot find PCI int A mapping - but they do not seem to cause the 
> devices any issues in functioning? Is it because they are using MSI?

I doubt it.  I think you're just lucky.  In general the information
from _PRT and _CRS is essential for correct operation.

> - Does pci=ignorefw sound good for a future proposal?

No, at least not without more description of what this would
accomplish.

It sounds like you would want this to turn off _PRT, _CRS, and other
information from ACPI.  You may not like ACPI, but that information is
there for good reason, and if we didn't get it from ACPI we would have
to get it from somewhere else.

There is always "acpi=off" if you just don't want ACPI at all.

Bjorn

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Multitude of resource assignment functions
  2019-07-02 21:39                   ` Bjorn Helgaas
@ 2019-07-03 13:43                     ` Nicholas Johnson
  2019-07-03 14:19                       ` Bjorn Helgaas
  2019-07-03 22:54                       ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 21+ messages in thread
From: Nicholas Johnson @ 2019-07-03 13:43 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: Logan Gunthorpe, Benjamin Herrenschmidt, linux-pci

On Tue, Jul 02, 2019 at 04:39:51PM -0500, Bjorn Helgaas wrote:
> On Sun, Jun 30, 2019 at 02:57:37AM +0000, Nicholas Johnson wrote:
> 
> > - Should pci=noacpi imply pci=nocrs? It does not appear to, and I feel 
> > like it should, as CRS is part of ACPI and relates to PCI.
> 
> "pci=noacpi" means "Do not use ACPI for IRQ routing or for PCI
> scanning."
> 
> "pci=nocrs" means "Ignore PCI host bridge windows from ACPI."  If we
> ignore _CRS, we have no idea what the PCI host bridge apertures are,
> so we cannot allocate resources for devices on the root bus.
But I use pci=nocrs (it is non-negotiable for assigning massive 
MMIO_PREF with kernel parameters) and it does work. If I use pci=nocrs, 
then the whole physical address range of the CPU goes to the root 
complex (for example, 39-bit physical address lines on quad-core Intel 
is 512G). I am guessing that the OS makes sure that when assigning root 
port windows, we do not clobber the physical RAM so that any RAM 
addresses pass straight through the root complex. I have never had funny 
crashes that would make me think I have clobbered the RAM with nocrs. If 
I push the limits then it fails to assign root port resources as 
expected. Usually I assign 64G size to each Thunderbolt port for total 
of 256G over four ports. It is total overkill but it gives me 
satisfaction to know that the firmware is definitely not in control and 
that if it is needed, it can be requested. For a production system, I 
would likely tone it down a little.

> 
> The "Do not use ACPI for ... PCI scanning" part indeed does suggest
> that "pci=noacpi" could imply "pci=nocrs", but I don't think there's
> anything to be gained by changing that now.
> 
> We probably *should* remove "or for PCI scanning" from the
> documentation, because "pci=noacpi" only affects IRQs.
> 
> The only reason these exist at all is as a debugging aid to
> temporarily work around issues in firmware or Linux until we can
> develop a real fix or quirk that works without the user specifying a
> kernel parameter.
> 
> > - Does anybody know why with pci=noacpi, you get dmesg warnings about 
> > cannot find PCI int A mapping - but they do not seem to cause the 
> > devices any issues in functioning? Is it because they are using MSI?
> 
> I doubt it.  I think you're just lucky.  In general the information
> from _PRT and _CRS is essential for correct operation.
Strange, because there are dozens of these warnings on multiple 
computers and heaps of devices on Thunderbolt. If the BARs are assigned 
then they work, every time, no questions asked. Maybe this suggests that 
Thunderbolt is somehow exempt. Perhaps the controller has kept 
configuration from the firmware setup and everything behind it does not 
care.

> 
> > - Does pci=ignorefw sound good for a future proposal?
> 
> No, at least not without more description of what this would
> accomplish.
I have not given it much time and thought but basically it will be 
something that can be added to incrementally. I would start with it 
implying nocrs and releasing all root complex resources at boot before 
the initial scan. That way we can see if the particular platform cares 
if we do everything in the kernel.

> 
> It sounds like you would want this to turn off _PRT, _CRS, and other
> information from ACPI.  You may not like ACPI, but that information is
> there for good reason, and if we didn't get it from ACPI we would have
> to get it from somewhere else.
The nocrs is vital because the BIOS places pitiful space behind the root 
complex and will fail for assigning large BARs - hence why Xeon Phi 
coprocessors with 8G or 16G BARs to map their whole RAM are only 
supported on certain systems. I consider all BIOS / firmware to be 
broken at this time, especially with most still catering for 32-bit OS 
that almost nobody uses. I know not everybody feels that way, but I am 
an idealist and aim to move things in the right direction.

I would accept ACPI if it were just a collection of tables, memory 
mapped like MMCONFIG. I know there are more complicated things that 
require bytecode to run (although I do assert my belief that it should 
be avoided if possible) but if the static tables were moved out of ACPI 
then in my mind, it would be progress.

Is there a reason why PCI SIG could not add a future extension where all 
of this information can be accessed with an extended MMCONFIG address 
range?

> 
> There is always "acpi=off" if you just don't want ACPI at all.
> 
> Bjorn
I am aware, and I will happily use that when there is a way to manually 
specify DMAR and MADT information. If you use acpi=off presently, you 
lose all but one CPU core and the use of IOMMU. There used to be acpi=ht 
to disable ACPI for everything except for HyperThreading, but that was 
removed a long time ago - I do not know why.

The reason I often test like this is because it gives me reassurance 
that my code is not working by fluke on the particular system because of 
a firmware quirk. Also, Thunderbolt was deeply entrenched in ACPI 
before, so I am kind of over-compensating to make sure that there is no 
longer any unconditional dependency.

Nicholas

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Multitude of resource assignment functions
  2019-07-03 13:43                     ` Nicholas Johnson
@ 2019-07-03 14:19                       ` Bjorn Helgaas
  2019-07-03 22:54                       ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 21+ messages in thread
From: Bjorn Helgaas @ 2019-07-03 14:19 UTC (permalink / raw)
  To: Nicholas Johnson; +Cc: Logan Gunthorpe, Benjamin Herrenschmidt, linux-pci

On Wed, Jul 03, 2019 at 01:43:52PM +0000, Nicholas Johnson wrote:
> On Tue, Jul 02, 2019 at 04:39:51PM -0500, Bjorn Helgaas wrote:
> > On Sun, Jun 30, 2019 at 02:57:37AM +0000, Nicholas Johnson wrote:
> > 
> > > - Should pci=noacpi imply pci=nocrs? It does not appear to, and I feel 
> > > like it should, as CRS is part of ACPI and relates to PCI.
> > 
> > "pci=noacpi" means "Do not use ACPI for IRQ routing or for PCI
> > scanning."
> > 
> > "pci=nocrs" means "Ignore PCI host bridge windows from ACPI."  If we
> > ignore _CRS, we have no idea what the PCI host bridge apertures are,
> > so we cannot allocate resources for devices on the root bus.
>
> But I use pci=nocrs (it is non-negotiable for assigning massive 
> MMIO_PREF with kernel parameters) and it does work. If I use pci=nocrs, 
> then the whole physical address range of the CPU goes to the root 
> complex (for example, 39-bit physical address lines on quad-core Intel 
> is 512G). I am guessing that the OS makes sure that when assigning root 
> port windows, we do not clobber the physical RAM so that any RAM 
> addresses pass straight through the root complex. I have never had funny 
> crashes that would make me think I have clobbered the RAM with nocrs. If 
> I push the limits then it fails to assign root port resources as 
> expected. Usually I assign 64G size to each Thunderbolt port for total 
> of 256G over four ports. It is total overkill but it gives me 
> satisfaction to know that the firmware is definitely not in control and 
> that if it is needed, it can be requested. For a production system, I 
> would likely tone it down a little.

"pci=nocrs" happens to work on many machines, but the _CRS information
is definitely required on many others.  For example, on any machine
with multiple host bridges, we need to know the actual host bridge
apertures to correctly assign resources to hot-added devices.

> > The "Do not use ACPI for ... PCI scanning" part indeed does suggest
> > that "pci=noacpi" could imply "pci=nocrs", but I don't think there's
> > anything to be gained by changing that now.
> > 
> > We probably *should* remove "or for PCI scanning" from the
> > documentation, because "pci=noacpi" only affects IRQs.
> > 
> > The only reason these exist at all is as a debugging aid to
> > temporarily work around issues in firmware or Linux until we can
> > develop a real fix or quirk that works without the user specifying a
> > kernel parameter.
> > 
> > > - Does anybody know why with pci=noacpi, you get dmesg warnings about 
> > > cannot find PCI int A mapping - but they do not seem to cause the 
> > > devices any issues in functioning? Is it because they are using MSI?
> > 
> > I doubt it.  I think you're just lucky.  In general the information
> > from _PRT and _CRS is essential for correct operation.
>
> Strange, because there are dozens of these warnings on multiple 
> computers and heaps of devices on Thunderbolt. If the BARs are assigned 
> then they work, every time, no questions asked. Maybe this suggests that 
> Thunderbolt is somehow exempt. Perhaps the controller has kept 
> configuration from the firmware setup and everything behind it does not 
> care.

Thunderbolt is not exempt.  _PRT tells us where INTx wires from PCI
are connected.  On systems with multiple host bridges, there are
multiple sets of those wires.  Your many examples of systems where
things seem to work are not arguments for it being safe to ignore _PRT
and _CRS in general.

> > > - Does pci=ignorefw sound good for a future proposal?
> > 
> > No, at least not without more description of what this would
> > accomplish.
> I have not given it much time and thought but basically it will be 
> something that can be added to incrementally. I would start with it 
> implying nocrs and releasing all root complex resources at boot before 
> the initial scan. That way we can see if the particular platform cares 
> if we do everything in the kernel.
> 
> > It sounds like you would want this to turn off _PRT, _CRS, and other
> > information from ACPI.  You may not like ACPI, but that information is
> > there for good reason, and if we didn't get it from ACPI we would have
> > to get it from somewhere else.
>
> The nocrs is vital because the BIOS places pitiful space behind the root 
> complex and will fail for assigning large BARs - hence why Xeon Phi 
> coprocessors with 8G or 16G BARs to map their whole RAM are only 
> supported on certain systems. I consider all BIOS / firmware to be 
> broken at this time, especially with most still catering for 32-bit OS 
> that almost nobody uses. I know not everybody feels that way, but I am 
> an idealist and aim to move things in the right direction.

Fine.  You can boot with "pci=nocrs" all you want, but it's not safe
in general.

The problem of BIOS not reporting enough space for the root complex is
a BIOS issue.  The host bridge _CRS should report all the space routed
to the host bridge.  If it doesn't, that's a BIOS issue.  In principle
Linux could work around that by reading the hardware registers that
control the host bridge apertures, but that would require Linux to
know how to program every host bridge of interest.  We don't have or
want that sort of code in Linux because it would be a huge maintenance
burden.

> I would accept ACPI if it were just a collection of tables, memory 
> mapped like MMCONFIG. I know there are more complicated things that 
> require bytecode to run (although I do assert my belief that it should 
> be avoided if possible) but if the static tables were moved out of ACPI 
> then in my mind, it would be progress.
> 
> Is there a reason why PCI SIG could not add a future extension where all 
> of this information can be accessed with an extended MMCONFIG address 
> range?

For one thing, we don't know where MMCONFIG space lives.  We learn
that from the static MCFG table.

Bjorn

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Multitude of resource assignment functions
  2019-07-03 13:43                     ` Nicholas Johnson
  2019-07-03 14:19                       ` Bjorn Helgaas
@ 2019-07-03 22:54                       ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 21+ messages in thread
From: Benjamin Herrenschmidt @ 2019-07-03 22:54 UTC (permalink / raw)
  To: Nicholas Johnson, Bjorn Helgaas; +Cc: Logan Gunthorpe, linux-pci

On Wed, 2019-07-03 at 13:43 +0000, Nicholas Johnson wrote:
> The nocrs is vital because the BIOS places pitiful space behind the root 
> complex and will fail for assigning large BARs - hence why Xeon Phi 

Can you check what you get out of _DSM #5 behind these if it exists ?

Cheers
Ben.


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2019-07-03 22:54 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <SL2P216MB01874DFDDBDE49B935A9B1B380E50@SL2P216MB0187.KORP216.PROD.OUTLOOK.COM>
2019-06-19 16:21 ` [nicholas.johnson-opensource@outlook.com.au: [PATCH v6 3/4] PCI: Fix bug resulting in double hpmemsize being assigned to MMIO window] Logan Gunthorpe
2019-06-20  0:44   ` Nicholas Johnson
2019-06-20  0:49     ` Logan Gunthorpe
2019-06-23  5:01       ` Nicholas Johnson
2019-06-24  9:13         ` Multitude of resource assignment functions Benjamin Herrenschmidt
2019-06-24 16:45           ` Logan Gunthorpe
2019-06-27  7:40             ` Nicholas Johnson
2019-06-27  8:48               ` Benjamin Herrenschmidt
2019-06-30  2:40                 ` Nicholas Johnson
2019-06-27 16:35               ` Logan Gunthorpe
2019-06-27 20:26                 ` Benjamin Herrenschmidt
2019-06-30  2:57                 ` Nicholas Johnson
2019-07-01  4:33                   ` Oliver O'Halloran
2019-07-02 21:39                   ` Bjorn Helgaas
2019-07-03 13:43                     ` Nicholas Johnson
2019-07-03 14:19                       ` Bjorn Helgaas
2019-07-03 22:54                       ` Benjamin Herrenschmidt
2019-06-20 13:43     ` [nicholas.johnson-opensource@outlook.com.au: [PATCH v6 3/4] PCI: Fix bug resulting in double hpmemsize being assigned to MMIO window] Bjorn Helgaas
2019-06-20 23:24       ` Benjamin Herrenschmidt
2019-06-27  7:50   ` Nicholas Johnson
2019-06-27 16:54     ` Logan Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).