LKML Archive on lore.kernel.org
 help / Atom feed
From: Bjorn Helgaas <helgaas@kernel.org>
To: joeyli <jlee@suse.com>
Cc: "Lee, Chun-Yi" <joeyli.kernel@gmail.com>,
	Bjorn Helgaas <bhelgaas@google.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, "H . Peter Anvin" <hpa@zytor.com>,
	x86@kernel.org, linux-pci@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH] x86/PCI: Claim the resources of firmware enabled IOAPIC before children bus
Date: Wed, 8 Aug 2018 16:23:22 -0500
Message-ID: <20180808212322.GF49411@bhelgaas-glaptop.roam.corp.google.com> (raw)
In-Reply-To: <20180808155318.GE13767@linux-l9pv.suse>

On Wed, Aug 08, 2018 at 11:53:18PM +0800, joeyli wrote:
> Hi Bjorn,
> 
> First, thanks for your review!
> 
> On Mon, Aug 06, 2018 at 04:48:07PM -0500, Bjorn Helgaas wrote:
> > On Tue, Jul 24, 2018 at 07:01:44PM +0800, Lee, Chun-Yi wrote:
> > > I got a machine that the resource of firmware enabled IOAPIC conflicts
> > > with the resource of a children bus when the PCI host bus be hotplug.
> > > 
> > > [ 3182.243325] PCI host bridge to bus 0001:40
> > > [ 3182.243328] pci_bus 0001:40: root bus resource [io  0xc000-0xdfff window]
> > > [ 3182.243330] pci_bus 0001:40: root bus resource [mem 0xdc000000-0xebffffff window]
> > > [ 3182.243331] pci_bus 0001:40: root bus resource [mem 0x212400000000-0x2125ffffffff window]
> > > [ 3182.243334] pci_bus 0001:40: root bus resource [bus 40-7e]
> > > ...
> > > [ 3182.244737] pci 0001:40:05.4: [8086:6f2c] type 00 class 0x080020
> > > [ 3182.244746] pci 0001:40:05.4: reg 0x10: [mem 0xdc000000-0xdc000fff]
> > > ...
> > > [ 3182.246697] pci 0001:40:02.0: PCI bridge to [bus 41]
> > > [ 3182.246702] pci 0001:40:02.0:   bridge window [mem 0xdc000000-0xdc7fffff]
> > > ...
> > > pci 0001:40:05.4: can't claim BAR 0 [mem 0xdc000000-0xdc000fff]: address conflict with PCI Bus 0001:41 [mem 0xdc000000-0xdc7fffff]
> > > 
> > > The bus topology:
> > > 
> > >  +-[0001:40]-+-02.0-[41]--
> > >  |           +-03.0-[41]--
> > >  |           +-03.2-[41]--+-00.0  Intel Corporation I350 Gigabit Network Connection
> > >  |           |            +-00.1  Intel Corporation I350 Gigabit Network Connection
> > >  |           |            +-00.2  Intel Corporation I350 Gigabit Network Connection
> > >  |           |            \-00.3  Intel Corporation I350 Gigabit Network Connection
> > >  |           +-05.0  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D Map/VTd_Misc/System Management
> > >  |           +-05.1  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D IIO Hot Plug
> > >  |           +-05.2  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D IIO RAS/Control Status/Global Errors
> > >  |           \-05.4  Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D I/O APIC
> > > 
> > > This problem causes that the NIC behine the child bus was not available
> > > after PCI host bridge hotpluged.
> > > 
> > > Kernel does not want to change resource of firmware enabled IOAPIC, but
> > > the priority of children bus's resources are higher than any other devices.
> > > So this conflict can not be handled by the reassigning logic of kernel.
> 
> Sorry for my description is not clear. The "priority" is for resources
> clamining, not for the address decoding.
> 
> > I don't understand this paragraph.  AFAIK, if two devices on a bus
> > both decode the same address, as the IOAPIC and the bridge do here,
> > the only real "priority" in PCI would be subtractive decode.  But I
> > don't think that applies here since the bridge is using a positive
> > decode window.
> 
> Sorry for I didn't understand... How could you know the bridge doesn't
> apply subtractive decode?

A subtractive decode bridge forwards anything appearing on its primary
bus to its secondary bus.  In conventional PCI, it only does this if
no other agent on the primary bus asserts DEVSEL# (PCI r3.0, sec
3.6.1).  In PCIe, the primary "bus" is a link and there's only one
device on it, so a subtractive decode bridge could forward anything it
sees.  If the subtractive decode bridge is part of a multi-function
device, I assume that multi-function device would have to determine
internally whether the subtractive decode bridge or another function
should claim the transaction.

Either way, a subtractive decode bridge can forward anything that
appears on its primary bus, so a subtractive decode window effectively
contains everything the upstream bridge passes down.  In this case you
have:

  pci_bus 0001:40: root bus resource [mem 0xdc000000-0xebffffff window]
  pci_bus 0001:40: root bus resource [mem 0x212400000000-0x2125ffffffff window]
  pci 0001:40:02.0: PCI bridge to [bus 41]
  pci 0001:40:02.0:   bridge window [mem 0xdc000000-0xdc7fffff]

The 40:02.0 bridge could see [mem 0xdc000000-0xebffffff] or [mem
0x212400000000-0x2125ffffffff] on its primary bus, so if it were a
subtractive decode bridge, its "window" would contain both of those
regions, and we should label them as "(subtractive decode)" in the
dmesg log.

Since [mem 0xdc000000-0xdc7fffff] is only part of the first root bus
resource, I infer that it must be a positively decoded window.
"lspci -vv" would tell you for sure.

> > I would expect that a read to the [mem 0xdc000000-0xdc000fff] range
> > would potentially receive two read completions, which would cause an
> > Unexpected Completion error.
> 
> Thanks for your information. The I350 NIC doesn't work after hotplug.
> So it may causes by Unexpected Completion error as you mentioned.
>  
> > Maybe you mean that we don't want to change the IOAPIC resources, but
> > it's OK if we move the bridge window, so in that sense, the IOAPIC is
> > "higher priority" than the bridge window?  This is the reverse of what
> > your paragraph seems to say.
> 
> Yes, this is what I mean. Sorry for my paragraph causes confusing. 
> 
> The memory decoder bit in the command register of 0001:40:05.4 IOAPIC
> is raised by firmware after hotplug. So kernel treats it as a
> _firmware enabled_ IOAPIC. Because kernel don't want to change the IOAPIC
> resources, so my patch try to claim the IOAPIC resources before all
> bridges. It can workaround the hotplug problem on issue machine.
> 
> But, actually I am not sure that this hacking makes sense. And, I also
> don't know why the memory decoder bit is raised by firmware when hotplug.
> Maybe this is a firmware problem.

That's a good point.  It's fine for firmware to enable the IOAPIC
memory decode.  But the address conflict:

  [ 3906.092119] pci 0001:40:05.4: reg 0x10: [mem 0xdc000000-0xdc000fff]
  [ 3906.093898] pci 0001:40:02.0: PCI bridge to [bus 41]
  [ 3906.093902] pci 0001:40:02.0:   bridge window [mem 0xdc000000-0xdc7fffff]

is not so fine.  That looks like a firmware bug to me.  These devices
should power up with zeroes in their BARs, so the addresses must have
been assigned by firmware before it sent the ACPI hotplug notification
to Linux.

In principle, Linux *should* be able to recover from that by moving
the IOAPIC or the bridge window, but obviously we don't.

What are the chances of getting a firmware fix?  Has this firmware
already shipped to customers?

> > > This patch claims the resources of firmware enabled IOAPIC before
> > > children bus. Then kernel gets a chance to reassign the resources of
> > > children bus to avoid the conflict.
> > 
> > Can you open a report at https://bugzilla.kernel.org, attach the
> > before and after dmesg logs, and include the URL in the commit log?
> 
> OK, I have filed a bug on kernel bugzilla and also attached dmesg
> log:
> 	https://bugzilla.kernel.org/show_bug.cgi?id=200765

Thanks.  Please include this URL in the changelog if you post an
updated patch.

Bjorn

  reply index

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-07-24 11:01 Lee, Chun-Yi
2018-08-06 21:48 ` Bjorn Helgaas
2018-08-08 15:53   ` joeyli
2018-08-08 21:23     ` Bjorn Helgaas [this message]
2018-08-10  9:25       ` joeyli
2018-08-10 13:58         ` Bjorn Helgaas
2018-08-12  0:15           ` joeyli
2018-08-13 18:45             ` Bjorn Helgaas

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180808212322.GF49411@bhelgaas-glaptop.roam.corp.google.com \
    --to=helgaas@kernel.org \
    --cc=bhelgaas@google.com \
    --cc=hpa@zytor.com \
    --cc=jlee@suse.com \
    --cc=joeyli.kernel@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=tglx@linutronix.de \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org linux-kernel@archiver.kernel.org
	public-inbox-index lkml


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/ public-inbox