linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Bjorn Helgaas <helgaas@kernel.org>
To: Nicholas Johnson <nicholas.johnson-opensource@outlook.com.au>
Cc: Logan Gunthorpe <logang@deltatee.com>,
	Benjamin Herrenschmidt <benh@kernel.crashing.org>,
	"linux-pci@vger.kernel.org" <linux-pci@vger.kernel.org>
Subject: Re: Multitude of resource assignment functions
Date: Wed, 3 Jul 2019 09:19:09 -0500	[thread overview]
Message-ID: <20190703141909.GM128603@google.com> (raw)
In-Reply-To: <SL2P216MB01878623FC34BC4894EB495280FB0@SL2P216MB0187.KORP216.PROD.OUTLOOK.COM>

On Wed, Jul 03, 2019 at 01:43:52PM +0000, Nicholas Johnson wrote:
> On Tue, Jul 02, 2019 at 04:39:51PM -0500, Bjorn Helgaas wrote:
> > On Sun, Jun 30, 2019 at 02:57:37AM +0000, Nicholas Johnson wrote:
> > 
> > > - Should pci=noacpi imply pci=nocrs? It does not appear to, and I feel 
> > > like it should, as CRS is part of ACPI and relates to PCI.
> > 
> > "pci=noacpi" means "Do not use ACPI for IRQ routing or for PCI
> > scanning."
> > 
> > "pci=nocrs" means "Ignore PCI host bridge windows from ACPI."  If we
> > ignore _CRS, we have no idea what the PCI host bridge apertures are,
> > so we cannot allocate resources for devices on the root bus.
>
> But I use pci=nocrs (it is non-negotiable for assigning massive 
> MMIO_PREF with kernel parameters) and it does work. If I use pci=nocrs, 
> then the whole physical address range of the CPU goes to the root 
> complex (for example, 39-bit physical address lines on quad-core Intel 
> is 512G). I am guessing that the OS makes sure that when assigning root 
> port windows, we do not clobber the physical RAM so that any RAM 
> addresses pass straight through the root complex. I have never had funny 
> crashes that would make me think I have clobbered the RAM with nocrs. If 
> I push the limits then it fails to assign root port resources as 
> expected. Usually I assign 64G size to each Thunderbolt port for total 
> of 256G over four ports. It is total overkill but it gives me 
> satisfaction to know that the firmware is definitely not in control and 
> that if it is needed, it can be requested. For a production system, I 
> would likely tone it down a little.

"pci=nocrs" happens to work on many machines, but the _CRS information
is definitely required on many others.  For example, on any machine
with multiple host bridges, we need to know the actual host bridge
apertures to correctly assign resources to hot-added devices.

> > The "Do not use ACPI for ... PCI scanning" part indeed does suggest
> > that "pci=noacpi" could imply "pci=nocrs", but I don't think there's
> > anything to be gained by changing that now.
> > 
> > We probably *should* remove "or for PCI scanning" from the
> > documentation, because "pci=noacpi" only affects IRQs.
> > 
> > The only reason these exist at all is as a debugging aid to
> > temporarily work around issues in firmware or Linux until we can
> > develop a real fix or quirk that works without the user specifying a
> > kernel parameter.
> > 
> > > - Does anybody know why with pci=noacpi, you get dmesg warnings about 
> > > cannot find PCI int A mapping - but they do not seem to cause the 
> > > devices any issues in functioning? Is it because they are using MSI?
> > 
> > I doubt it.  I think you're just lucky.  In general the information
> > from _PRT and _CRS is essential for correct operation.
>
> Strange, because there are dozens of these warnings on multiple 
> computers and heaps of devices on Thunderbolt. If the BARs are assigned 
> then they work, every time, no questions asked. Maybe this suggests that 
> Thunderbolt is somehow exempt. Perhaps the controller has kept 
> configuration from the firmware setup and everything behind it does not 
> care.

Thunderbolt is not exempt.  _PRT tells us where INTx wires from PCI
are connected.  On systems with multiple host bridges, there are
multiple sets of those wires.  Your many examples of systems where
things seem to work are not arguments for it being safe to ignore _PRT
and _CRS in general.

> > > - Does pci=ignorefw sound good for a future proposal?
> > 
> > No, at least not without more description of what this would
> > accomplish.
> I have not given it much time and thought but basically it will be 
> something that can be added to incrementally. I would start with it 
> implying nocrs and releasing all root complex resources at boot before 
> the initial scan. That way we can see if the particular platform cares 
> if we do everything in the kernel.
> 
> > It sounds like you would want this to turn off _PRT, _CRS, and other
> > information from ACPI.  You may not like ACPI, but that information is
> > there for good reason, and if we didn't get it from ACPI we would have
> > to get it from somewhere else.
>
> The nocrs is vital because the BIOS places pitiful space behind the root 
> complex and will fail for assigning large BARs - hence why Xeon Phi 
> coprocessors with 8G or 16G BARs to map their whole RAM are only 
> supported on certain systems. I consider all BIOS / firmware to be 
> broken at this time, especially with most still catering for 32-bit OS 
> that almost nobody uses. I know not everybody feels that way, but I am 
> an idealist and aim to move things in the right direction.

Fine.  You can boot with "pci=nocrs" all you want, but it's not safe
in general.

The problem of BIOS not reporting enough space for the root complex is
a BIOS issue.  The host bridge _CRS should report all the space routed
to the host bridge.  If it doesn't, that's a BIOS issue.  In principle
Linux could work around that by reading the hardware registers that
control the host bridge apertures, but that would require Linux to
know how to program every host bridge of interest.  We don't have or
want that sort of code in Linux because it would be a huge maintenance
burden.

> I would accept ACPI if it were just a collection of tables, memory 
> mapped like MMCONFIG. I know there are more complicated things that 
> require bytecode to run (although I do assert my belief that it should 
> be avoided if possible) but if the static tables were moved out of ACPI 
> then in my mind, it would be progress.
> 
> Is there a reason why PCI SIG could not add a future extension where all 
> of this information can be accessed with an extended MMCONFIG address 
> range?

For one thing, we don't know where MMCONFIG space lives.  We learn
that from the static MCFG table.

Bjorn

  reply	other threads:[~2019-07-03 14:19 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <SL2P216MB01874DFDDBDE49B935A9B1B380E50@SL2P216MB0187.KORP216.PROD.OUTLOOK.COM>
2019-06-19 16:21 ` [nicholas.johnson-opensource@outlook.com.au: [PATCH v6 3/4] PCI: Fix bug resulting in double hpmemsize being assigned to MMIO window] Logan Gunthorpe
2019-06-20  0:44   ` Nicholas Johnson
2019-06-20  0:49     ` Logan Gunthorpe
2019-06-23  5:01       ` Nicholas Johnson
2019-06-24  9:13         ` Multitude of resource assignment functions Benjamin Herrenschmidt
2019-06-24 16:45           ` Logan Gunthorpe
2019-06-27  7:40             ` Nicholas Johnson
2019-06-27  8:48               ` Benjamin Herrenschmidt
2019-06-30  2:40                 ` Nicholas Johnson
2019-06-27 16:35               ` Logan Gunthorpe
2019-06-27 20:26                 ` Benjamin Herrenschmidt
2019-06-30  2:57                 ` Nicholas Johnson
2019-07-01  4:33                   ` Oliver O'Halloran
2019-07-02 21:39                   ` Bjorn Helgaas
2019-07-03 13:43                     ` Nicholas Johnson
2019-07-03 14:19                       ` Bjorn Helgaas [this message]
2019-07-03 22:54                       ` Benjamin Herrenschmidt
2019-06-20 13:43     ` [nicholas.johnson-opensource@outlook.com.au: [PATCH v6 3/4] PCI: Fix bug resulting in double hpmemsize being assigned to MMIO window] Bjorn Helgaas
2019-06-20 23:24       ` Benjamin Herrenschmidt
2019-06-27  7:50   ` Nicholas Johnson
2019-06-27 16:54     ` Logan Gunthorpe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190703141909.GM128603@google.com \
    --to=helgaas@kernel.org \
    --cc=benh@kernel.crashing.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=logang@deltatee.com \
    --cc=nicholas.johnson-opensource@outlook.com.au \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).