linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] PCI: Fix kernel panic of root-port-less PCIe enum due to ASPM
@ 2016-10-06  9:34 Serge Semin
  2016-10-06 13:13 ` Bjorn Helgaas
  2016-11-08 23:29 ` Bjorn Helgaas
  0 siblings, 2 replies; 5+ messages in thread
From: Serge Semin @ 2016-10-06  9:34 UTC (permalink / raw)
  To: bhelgaas
  Cc: shawn.lin, luto, Sergey.Semin, linux-pci, linux-kernel, Serge Semin

Hello linux folks,

    Sometime ago I discovered a kernel panic popping up when PCI subsystem was
trying to enumerate PCI express bus with ASPM service enabled. Here it is:

[    5.089667] CPU 0 Unable to handle kernel paging request at virtual
address 00000060, epc == 80317004, ra == 80316ac8
[    5.120952] Oops[#1]:
          ...
[    5.528438] Call Trace:
[    5.535640] [<80317004>] pcie_aspm_init_link_state+0x6c0/0x814
[    5.552843] [<80300c44>] pci_scan_slot+0x140/0x148
[    5.566957] [<80301dcc>] pci_scan_child_bus+0x50/0x1b0
[    5.582096] [<80301944>] pci_scan_bridge+0x25c/0x694
[    5.596724] [<80301e78>] pci_scan_child_bus+0xfc/0x1b0
[    5.611862] [<80301944>] pci_scan_bridge+0x25c/0x694
[    5.626488] [<80301e78>] pci_scan_child_bus+0xfc/0x1b0
[    5.641628] [<8030215c>] pci_scan_root_bus+0x64/0x124
[    5.656528] [<804ca298>] pcibios_scanbus+0xa8/0x188

    I more than sure you are familiar with the issue, since I've found the
mailing discussion: "PCI: avoid NULL deref in alloc_pcie_link_state"
https://patchwork.kernel.org/patch/2751651/
https://bugzilla.kernel.org/show_bug.cgi?id=60111

    You closed the bugzilla ticket with the next statement:
"I'm closing this as invalid because the simulated machine where the problem
occurs has an invalid PCIe topology (an Upstream Port with no Downstream Port
or Root Port above it).  As far as I know, there is no valid topology, e.g.,
a real hardware machine in the field, that would cause this failure."

    I'm strongly disagree with it, since I've got at least two hardware with
PCIe-bus hierarchy as described in the mailing list. One of them is based on
Cavium Octeon III CN7020. Here is a ASCII-diagram of PCIe-bus:

-+-[0000:01]---00.0-[02-06]--+-02.0-[03-05]--+-00.0-[04-05]----00.0-[05]--
 |                           |               \-00.1  Device [111d:808f]
 |                           \-04.0-[06]----00.0  Device [126f:0750]
 \-[0000:00]-

where 01:00.0 is an Upstream port of IDT PCIe-swtich.
/ # /usr/local/sbin/lspci -v -s 01:00.0
01:00.0 Class 0604: Device 111d:8061
        Flags: bus master, fast devsel, latency 0
        Memory at <unassigned> (32-bit, non-prefetchable) [size=2]
        Memory at <unassigned> (32-bit, non-prefetchable) [size=2]
        Bus: primary=01, secondary=02, subordinate=06, sec-latency=0
        Memory behind bridge: 08000000-0dffffff
        Expansion ROM at <unassigned> [disabled] [size=2]
        Capabilities: [40] Express Upstream Port, MSI 00
        Capabilities: [c0] Power Management version 3
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [200] Virtual Channel
        Kernel driver in use: pcieport

As you can see PCI-bus hierarchy doesn't have root port and the very first
upstream port is directly connected to Host-PCIe bridge of MCU, which of
course is not listed by the lspci utility.

Despite of Radim Kr?má?, who suggested a fix, which would de-facto just
turned ASPM off, I found a quick solution, which disabled ASPM only in 
the first link (Host-PCIe=>Upstream port) of PCIe-bus for such hierarchy.
ASPM for other PCIe-bus topologies shall work the way it was.

I hope the fix will be helpful.
Thanks,

=============================
Serge V. Semin
Leading Programmer
Embedded SW development group
T-platforms
=============================

Signed-off-by: Serge Semin <fancer.lancer@gmail.com>

---
 drivers/pci/pcie/aspm.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
index 0ec649d..a9295f29 100644
--- a/drivers/pci/pcie/aspm.c
+++ b/drivers/pci/pcie/aspm.c
@@ -522,7 +522,8 @@ static struct pcie_link_state *alloc_pcie_link_state(struct pci_dev *pdev)
 	INIT_LIST_HEAD(&link->children);
 	INIT_LIST_HEAD(&link->link);
 	link->pdev = pdev;
-	if (pci_pcie_type(pdev) != PCI_EXP_TYPE_ROOT_PORT) {
+	if ((pci_pcie_type(pdev) != PCI_EXP_TYPE_ROOT_PORT) &&
+	    (!pci_is_root_bus(pdev->bus->parent))) {
 		struct pcie_link_state *parent;
 		parent = pdev->bus->parent->self->link_state;
 		if (!parent) {
-- 
2.6.6

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [RFC] PCI: Fix kernel panic of root-port-less PCIe enum due to ASPM
  2016-10-06  9:34 [RFC] PCI: Fix kernel panic of root-port-less PCIe enum due to ASPM Serge Semin
@ 2016-10-06 13:13 ` Bjorn Helgaas
  2016-10-06 14:27   ` Serge Semin
  2016-11-08 23:29 ` Bjorn Helgaas
  1 sibling, 1 reply; 5+ messages in thread
From: Bjorn Helgaas @ 2016-10-06 13:13 UTC (permalink / raw)
  To: Serge Semin
  Cc: bhelgaas, shawn.lin, luto, Sergey.Semin, linux-pci, linux-kernel

Hi Serge,

On Thu, Oct 06, 2016 at 12:34:15PM +0300, Serge Semin wrote:
> Hello linux folks,
> 
>     Sometime ago I discovered a kernel panic popping up when PCI subsystem was
> trying to enumerate PCI express bus with ASPM service enabled. Here it is:
> 
> [    5.089667] CPU 0 Unable to handle kernel paging request at virtual
> address 00000060, epc == 80317004, ra == 80316ac8
> [    5.120952] Oops[#1]:
>           ...
> [    5.528438] Call Trace:
> [    5.535640] [<80317004>] pcie_aspm_init_link_state+0x6c0/0x814
> [    5.552843] [<80300c44>] pci_scan_slot+0x140/0x148
> [    5.566957] [<80301dcc>] pci_scan_child_bus+0x50/0x1b0
> [    5.582096] [<80301944>] pci_scan_bridge+0x25c/0x694
> [    5.596724] [<80301e78>] pci_scan_child_bus+0xfc/0x1b0
> [    5.611862] [<80301944>] pci_scan_bridge+0x25c/0x694
> [    5.626488] [<80301e78>] pci_scan_child_bus+0xfc/0x1b0
> [    5.641628] [<8030215c>] pci_scan_root_bus+0x64/0x124
> [    5.656528] [<804ca298>] pcibios_scanbus+0xa8/0x188
> 
>     I more than sure you are familiar with the issue, since I've found the
> mailing discussion: "PCI: avoid NULL deref in alloc_pcie_link_state"
> https://patchwork.kernel.org/patch/2751651/
> https://bugzilla.kernel.org/show_bug.cgi?id=60111
> 
>     You closed the bugzilla ticket with the next statement:
> "I'm closing this as invalid because the simulated machine where the problem
> occurs has an invalid PCIe topology (an Upstream Port with no Downstream Port
> or Root Port above it).  As far as I know, there is no valid topology, e.g.,
> a real hardware machine in the field, that would cause this failure."
> 
>     I'm strongly disagree with it, since I've got at least two hardware with
> PCIe-bus hierarchy as described in the mailing list. One of them is based on
> Cavium Octeon III CN7020. Here is a ASCII-diagram of PCIe-bus:

Thanks for this information.  I reopened that bugzilla; can you attach
complete dmesg logs and "lspci -vv" output for your systems?  As I
mentioned in comment #4, I'm completely open to fixing this.  My
objections at the time were (1) there was no known hardware that could
trigger the problem, and (2) the proposed fix was ugly and prone to
future breakage.  Since we now have real systems that trip over this,
we need to revisit it.

Bjorn

> -+-[0000:01]---00.0-[02-06]--+-02.0-[03-05]--+-00.0-[04-05]----00.0-[05]--
>  |                           |               \-00.1  Device [111d:808f]
>  |                           \-04.0-[06]----00.0  Device [126f:0750]
>  \-[0000:00]-
> 
> where 01:00.0 is an Upstream port of IDT PCIe-swtich.
> / # /usr/local/sbin/lspci -v -s 01:00.0
> 01:00.0 Class 0604: Device 111d:8061
>         Flags: bus master, fast devsel, latency 0
>         Memory at <unassigned> (32-bit, non-prefetchable) [size=2]
>         Memory at <unassigned> (32-bit, non-prefetchable) [size=2]
>         Bus: primary=01, secondary=02, subordinate=06, sec-latency=0
>         Memory behind bridge: 08000000-0dffffff
>         Expansion ROM at <unassigned> [disabled] [size=2]
>         Capabilities: [40] Express Upstream Port, MSI 00
>         Capabilities: [c0] Power Management version 3
>         Capabilities: [100] Advanced Error Reporting
>         Capabilities: [200] Virtual Channel
>         Kernel driver in use: pcieport
> 
> As you can see PCI-bus hierarchy doesn't have root port and the very first
> upstream port is directly connected to Host-PCIe bridge of MCU, which of
> course is not listed by the lspci utility.
> 
> Despite of Radim Kr?má?, who suggested a fix, which would de-facto just
> turned ASPM off, I found a quick solution, which disabled ASPM only in 
> the first link (Host-PCIe=>Upstream port) of PCIe-bus for such hierarchy.
> ASPM for other PCIe-bus topologies shall work the way it was.
> 
> I hope the fix will be helpful.
> Thanks,
> 
> =============================
> Serge V. Semin
> Leading Programmer
> Embedded SW development group
> T-platforms
> =============================
> 
> Signed-off-by: Serge Semin <fancer.lancer@gmail.com>
> 
> ---
>  drivers/pci/pcie/aspm.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
> index 0ec649d..a9295f29 100644
> --- a/drivers/pci/pcie/aspm.c
> +++ b/drivers/pci/pcie/aspm.c
> @@ -522,7 +522,8 @@ static struct pcie_link_state *alloc_pcie_link_state(struct pci_dev *pdev)
>  	INIT_LIST_HEAD(&link->children);
>  	INIT_LIST_HEAD(&link->link);
>  	link->pdev = pdev;
> -	if (pci_pcie_type(pdev) != PCI_EXP_TYPE_ROOT_PORT) {
> +	if ((pci_pcie_type(pdev) != PCI_EXP_TYPE_ROOT_PORT) &&
> +	    (!pci_is_root_bus(pdev->bus->parent))) {
>  		struct pcie_link_state *parent;
>  		parent = pdev->bus->parent->self->link_state;
>  		if (!parent) {
> -- 
> 2.6.6
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC] PCI: Fix kernel panic of root-port-less PCIe enum due to ASPM
  2016-10-06 13:13 ` Bjorn Helgaas
@ 2016-10-06 14:27   ` Serge Semin
  0 siblings, 0 replies; 5+ messages in thread
From: Serge Semin @ 2016-10-06 14:27 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: bhelgaas, shawn.lin, luto, Sergey.Semin, linux-pci, linux-kernel

On Thu, Oct 06, 2016 at 08:13:58AM -0500, Bjorn Helgaas <helgaas@kernel.org> wrote:
> Hi Serge,
> 
> On Thu, Oct 06, 2016 at 12:34:15PM +0300, Serge Semin wrote:
> > Hello linux folks,
> > 
> >     Sometime ago I discovered a kernel panic popping up when PCI subsystem was
> > trying to enumerate PCI express bus with ASPM service enabled. Here it is:
> > 
> > [    5.089667] CPU 0 Unable to handle kernel paging request at virtual
> > address 00000060, epc == 80317004, ra == 80316ac8
> > [    5.120952] Oops[#1]:
> >           ...
> > [    5.528438] Call Trace:
> > [    5.535640] [<80317004>] pcie_aspm_init_link_state+0x6c0/0x814
> > [    5.552843] [<80300c44>] pci_scan_slot+0x140/0x148
> > [    5.566957] [<80301dcc>] pci_scan_child_bus+0x50/0x1b0
> > [    5.582096] [<80301944>] pci_scan_bridge+0x25c/0x694
> > [    5.596724] [<80301e78>] pci_scan_child_bus+0xfc/0x1b0
> > [    5.611862] [<80301944>] pci_scan_bridge+0x25c/0x694
> > [    5.626488] [<80301e78>] pci_scan_child_bus+0xfc/0x1b0
> > [    5.641628] [<8030215c>] pci_scan_root_bus+0x64/0x124
> > [    5.656528] [<804ca298>] pcibios_scanbus+0xa8/0x188
> > 
> >     I more than sure you are familiar with the issue, since I've found the
> > mailing discussion: "PCI: avoid NULL deref in alloc_pcie_link_state"
> > https://patchwork.kernel.org/patch/2751651/
> > https://bugzilla.kernel.org/show_bug.cgi?id=60111
> > 
> >     You closed the bugzilla ticket with the next statement:
> > "I'm closing this as invalid because the simulated machine where the problem
> > occurs has an invalid PCIe topology (an Upstream Port with no Downstream Port
> > or Root Port above it).  As far as I know, there is no valid topology, e.g.,
> > a real hardware machine in the field, that would cause this failure."
> > 
> >     I'm strongly disagree with it, since I've got at least two hardware with
> > PCIe-bus hierarchy as described in the mailing list. One of them is based on
> > Cavium Octeon III CN7020. Here is a ASCII-diagram of PCIe-bus:
> 
> Thanks for this information.  I reopened that bugzilla; can you attach
> complete dmesg logs and "lspci -vv" output for your systems?  As I
> mentioned in comment #4, I'm completely open to fixing this.  My
> objections at the time were (1) there was no known hardware that could
> trigger the problem, and (2) the proposed fix was ugly and prone to
> future breakage.  Since we now have real systems that trip over this,
> we need to revisit it.
> 
> Bjorn
> 

Done. Welcome back to the bugzilla thread.

-Serge

> > -+-[0000:01]---00.0-[02-06]--+-02.0-[03-05]--+-00.0-[04-05]----00.0-[05]--
> >  |                           |               \-00.1  Device [111d:808f]
> >  |                           \-04.0-[06]----00.0  Device [126f:0750]
> >  \-[0000:00]-
> > 
> > where 01:00.0 is an Upstream port of IDT PCIe-swtich.
> > / # /usr/local/sbin/lspci -v -s 01:00.0
> > 01:00.0 Class 0604: Device 111d:8061
> >         Flags: bus master, fast devsel, latency 0
> >         Memory at <unassigned> (32-bit, non-prefetchable) [size=2]
> >         Memory at <unassigned> (32-bit, non-prefetchable) [size=2]
> >         Bus: primary=01, secondary=02, subordinate=06, sec-latency=0
> >         Memory behind bridge: 08000000-0dffffff
> >         Expansion ROM at <unassigned> [disabled] [size=2]
> >         Capabilities: [40] Express Upstream Port, MSI 00
> >         Capabilities: [c0] Power Management version 3
> >         Capabilities: [100] Advanced Error Reporting
> >         Capabilities: [200] Virtual Channel
> >         Kernel driver in use: pcieport
> > 
> > As you can see PCI-bus hierarchy doesn't have root port and the very first
> > upstream port is directly connected to Host-PCIe bridge of MCU, which of
> > course is not listed by the lspci utility.
> > 
> > Despite of Radim Kr?má?, who suggested a fix, which would de-facto just
> > turned ASPM off, I found a quick solution, which disabled ASPM only in 
> > the first link (Host-PCIe=>Upstream port) of PCIe-bus for such hierarchy.
> > ASPM for other PCIe-bus topologies shall work the way it was.
> > 
> > I hope the fix will be helpful.
> > Thanks,
> > 
> > =============================
> > Serge V. Semin
> > Leading Programmer
> > Embedded SW development group
> > T-platforms
> > =============================
> > 
> > Signed-off-by: Serge Semin <fancer.lancer@gmail.com>
> > 
> > ---
> >  drivers/pci/pcie/aspm.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/pci/pcie/aspm.c b/drivers/pci/pcie/aspm.c
> > index 0ec649d..a9295f29 100644
> > --- a/drivers/pci/pcie/aspm.c
> > +++ b/drivers/pci/pcie/aspm.c
> > @@ -522,7 +522,8 @@ static struct pcie_link_state *alloc_pcie_link_state(struct pci_dev *pdev)
> >  	INIT_LIST_HEAD(&link->children);
> >  	INIT_LIST_HEAD(&link->link);
> >  	link->pdev = pdev;
> > -	if (pci_pcie_type(pdev) != PCI_EXP_TYPE_ROOT_PORT) {
> > +	if ((pci_pcie_type(pdev) != PCI_EXP_TYPE_ROOT_PORT) &&
> > +	    (!pci_is_root_bus(pdev->bus->parent))) {
> >  		struct pcie_link_state *parent;
> >  		parent = pdev->bus->parent->self->link_state;
> >  		if (!parent) {
> > -- 
> > 2.6.6
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC] PCI: Fix kernel panic of root-port-less PCIe enum due to ASPM
  2016-10-06  9:34 [RFC] PCI: Fix kernel panic of root-port-less PCIe enum due to ASPM Serge Semin
  2016-10-06 13:13 ` Bjorn Helgaas
@ 2016-11-08 23:29 ` Bjorn Helgaas
  2016-11-25 14:03   ` Serge Semin
  1 sibling, 1 reply; 5+ messages in thread
From: Bjorn Helgaas @ 2016-11-08 23:29 UTC (permalink / raw)
  To: Serge Semin
  Cc: bhelgaas, shawn.lin, luto, Sergey.Semin, linux-pci, linux-kernel

Hi Serge,

On Thu, Oct 06, 2016 at 12:34:15PM +0300, Serge Semin wrote:
> Hello linux folks,
> 
>     Sometime ago I discovered a kernel panic popping up when PCI subsystem was
> trying to enumerate PCI express bus with ASPM service enabled. Here it is:
> 
> [    5.089667] CPU 0 Unable to handle kernel paging request at virtual
> address 00000060, epc == 80317004, ra == 80316ac8
> [    5.120952] Oops[#1]:
>           ...
> [    5.528438] Call Trace:
> [    5.535640] [<80317004>] pcie_aspm_init_link_state+0x6c0/0x814
> [    5.552843] [<80300c44>] pci_scan_slot+0x140/0x148
> [    5.566957] [<80301dcc>] pci_scan_child_bus+0x50/0x1b0
> [    5.582096] [<80301944>] pci_scan_bridge+0x25c/0x694
> [    5.596724] [<80301e78>] pci_scan_child_bus+0xfc/0x1b0
> [    5.611862] [<80301944>] pci_scan_bridge+0x25c/0x694
> [    5.626488] [<80301e78>] pci_scan_child_bus+0xfc/0x1b0
> [    5.641628] [<8030215c>] pci_scan_root_bus+0x64/0x124
> [    5.656528] [<804ca298>] pcibios_scanbus+0xa8/0x188
> 
>     I more than sure you are familiar with the issue, since I've found the
> mailing discussion: "PCI: avoid NULL deref in alloc_pcie_link_state"
> https://patchwork.kernel.org/patch/2751651/
> https://bugzilla.kernel.org/show_bug.cgi?id=60111

I'm trying to puzzle out a few things here.  Maybe you can help me out?

- Does this issue exist in current upstream kernels?  Your dmesg shows a
  v3.19-based kernel.  c8fc9339409d ("PCI/ASPM: Use dev->has_secondary_link
  to find downstream links"), which appeared in v4.2, fixes a problem very
  similar to what you're reporting.

- When we dereference the NULL pointer, which device did we call
  pcie_aspm_init_link_state() for?

- https://bugzilla.kernel.org/attachment.cgi?id=240981 is the failing dmesg
  log, and it shows "vgaarb: device added: PCI:0000:04:00.0".
  
  Your lspci output (https://bugzilla.kernel.org/attachment.cgi?id=241001)
  shows 04:00.0 is a downstream port, but vga_arbiter_add_pci_device() only
  prints that message for VGA class devices.

  https://bugzilla.kernel.org/attachment.cgi?id=240991, the successful
  dmesg log, shows "vgaarb: device added: PCI:0000:06:00.0".  That makes
  more sense because 06:00.0 is class 0300, which is a VGA device.

Bjorn

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC] PCI: Fix kernel panic of root-port-less PCIe enum due to ASPM
  2016-11-08 23:29 ` Bjorn Helgaas
@ 2016-11-25 14:03   ` Serge Semin
  0 siblings, 0 replies; 5+ messages in thread
From: Serge Semin @ 2016-11-25 14:03 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: bhelgaas, shawn.lin, luto, Sergey.Semin, linux-pci, linux-kernel

On Tue, Nov 08, 2016 at 05:29:57PM -0600, Bjorn Helgaas <helgaas@kernel.org> wrote:

Hello Bjorn,
Here are the answers on your questions inlined in the text.

> Hi Serge,
> 
> On Thu, Oct 06, 2016 at 12:34:15PM +0300, Serge Semin wrote:
> > Hello linux folks,
> > 
> >     Sometime ago I discovered a kernel panic popping up when PCI subsystem was
> > trying to enumerate PCI express bus with ASPM service enabled. Here it is:
> > 
> > [    5.089667] CPU 0 Unable to handle kernel paging request at virtual
> > address 00000060, epc == 80317004, ra == 80316ac8
> > [    5.120952] Oops[#1]:
> >           ...
> > [    5.528438] Call Trace:
> > [    5.535640] [<80317004>] pcie_aspm_init_link_state+0x6c0/0x814
> > [    5.552843] [<80300c44>] pci_scan_slot+0x140/0x148
> > [    5.566957] [<80301dcc>] pci_scan_child_bus+0x50/0x1b0
> > [    5.582096] [<80301944>] pci_scan_bridge+0x25c/0x694
> > [    5.596724] [<80301e78>] pci_scan_child_bus+0xfc/0x1b0
> > [    5.611862] [<80301944>] pci_scan_bridge+0x25c/0x694
> > [    5.626488] [<80301e78>] pci_scan_child_bus+0xfc/0x1b0
> > [    5.641628] [<8030215c>] pci_scan_root_bus+0x64/0x124
> > [    5.656528] [<804ca298>] pcibios_scanbus+0xa8/0x188
> > 
> >     I more than sure you are familiar with the issue, since I've found the
> > mailing discussion: "PCI: avoid NULL deref in alloc_pcie_link_state"
> > https://patchwork.kernel.org/patch/2751651/
> > https://bugzilla.kernel.org/show_bug.cgi?id=60111
> 
> I'm trying to puzzle out a few things here.  Maybe you can help me out?
> 
> - Does this issue exist in current upstream kernels?  Your dmesg shows a
>   v3.19-based kernel.  c8fc9339409d ("PCI/ASPM: Use dev->has_secondary_link
>   to find downstream links"), which appeared in v4.2, fixes a problem very
>   similar to what you're reporting.
> 

I saw that fix, but alas it hasn't fixed the issue. I've tested kernel 4.4.24
without my patch applied and the problem with ASPM-related kernel panic still
exists (see the stack-trace above).

> - When we dereference the NULL pointer, which device did we call
>   pcie_aspm_init_link_state() for?
> 

My suggestion was that the problem arised in the framework of bus 2 enumeration.
Since there was no root bus on my architecture, the pci_link_state structure was
not created. So when the algorithm tried to enumerate the second bus, it needed
actual pci_link_state structure of parental bus, which hadn't been created.
That's how the NULL-dereference happened.

> - https://bugzilla.kernel.org/attachment.cgi?id=240981 is the failing dmesg
>   log, and it shows "vgaarb: device added: PCI:0000:04:00.0".
>   
>   Your lspci output (https://bugzilla.kernel.org/attachment.cgi?id=241001)
>   shows 04:00.0 is a downstream port, but vga_arbiter_add_pci_device() only
>   prints that message for VGA class devices.
> 
>   https://bugzilla.kernel.org/attachment.cgi?id=240991, the successful
>   dmesg log, shows "vgaarb: device added: PCI:0000:06:00.0".  That makes
>   more sense because 06:00.0 is class 0300, which is a VGA device.
> 
> Bjorn

I can't be sure about the reason of that strange enumeration. But I can assure
you, that that bus confusion isn't the reason of the ASPM panicing. So I can
just guess, that the misleading BDF can be caused by SMP (I've got a processor
with two cores) and ASPM panic. VGA driver initialization may happen
concurrently with PCI bus enumeration.

Regards,
-Sergey

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2016-11-25 14:03 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-06  9:34 [RFC] PCI: Fix kernel panic of root-port-less PCIe enum due to ASPM Serge Semin
2016-10-06 13:13 ` Bjorn Helgaas
2016-10-06 14:27   ` Serge Semin
2016-11-08 23:29 ` Bjorn Helgaas
2016-11-25 14:03   ` Serge Semin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).