All of lore.kernel.org
 help / color / mirror / Atom feed
* X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32
@ 2015-07-24 22:42 ` Bjorn Helgaas
  0 siblings, 0 replies; 49+ messages in thread
From: Bjorn Helgaas @ 2015-07-24 22:42 UTC (permalink / raw)
  To: Tanmay Inamdar; +Cc: Duc Dang, linux-pci, linux-arm-kernel, linux-kernel

I regularly see faults like this on an APM X-Gene:

  U-Boot 2013.04-mustang_sw_1.14.14 (Dec 16 2014 - 15:59:33)
  CPU0: APM ARM 64-bit Potenza Rev B0 2400MHz PCP 2400MHz
       32 KB ICACHE, 32 KB DCACHE
       SOC 2000MHz IOBAXI 400MHz AXI 250MHz AHB 200MHz GFC 125MHz
  ...
  Unhandled fault: synchronous external abort (0x96000010) at 0xffffff8000110034
  Internal error: : 96000010 [#1] SMP
  Modules linked in:
  CPU: 0 PID: 3723 Comm: ... 4.1.0-smp-DEV #3
  Hardware name: APM X-Gene Mustang board (DT)
  task: ffffffc7dc1a4140 ti: ffffffc7dc118000 task.ti: ffffffc7dc118000
  PC is at pci_generic_config_read32+0x4c/0xb8
  LR is at pci_generic_config_read32+0x40/0xb8
  pc : [<ffffffc00033b90c>] lr : [<ffffffc00033b900>] pstate: 600001c5
  ...
  Call trace:
  [<ffffffc00033b90c>] pci_generic_config_read32+0x4c/0xb8
  [<ffffffc00033bf58>] pci_user_read_config_byte+0x60/0xc4
  [<ffffffc0003496a8>] pci_read_config+0x15c/0x238
  [<ffffffc0002393b4>] sysfs_kf_bin_read+0x68/0xa0
  [<ffffffc00023896c>] kernfs_fop_read+0x9c/0x1ac
  [<ffffffc0001c361c>] __vfs_read+0x44/0x128
  [<ffffffc0001c3e28>] vfs_read+0x84/0x144
  [<ffffffc0001c4764>] SyS_read+0x50/0xb0

  # lspci
  00:00.0 PCI bridge: Applied Micro Circuits Corp. Device e004 (rev 04)
  01:00.0 Ethernet controller: Mellanox Technologies MT27520 Family

I first saw this on an ancient kernel and thought it was likely specific to
my environment, but I'm now using an almost unmodified v4.1 kernel and
still seeing it.  Does anybody else see this?  The box does have a PCI card
installed, but I haven't yet worked out what device's config space we're
trying to read.

Is there anything I can do to debug this?  I'm not an arm64 guy, but my
impression is that this is a page fault, and the address seems to be in the
"cfg" area ioremapped by xgene_pcie_map_reg(), so I'm not sure this is
really a PCI issue -- maybe that page mapping got trashed by somebody else?

Bjorn

^ permalink raw reply	[flat|nested] 49+ messages in thread

* X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32
@ 2015-07-24 22:42 ` Bjorn Helgaas
  0 siblings, 0 replies; 49+ messages in thread
From: Bjorn Helgaas @ 2015-07-24 22:42 UTC (permalink / raw)
  To: linux-arm-kernel

I regularly see faults like this on an APM X-Gene:

  U-Boot 2013.04-mustang_sw_1.14.14 (Dec 16 2014 - 15:59:33)
  CPU0: APM ARM 64-bit Potenza Rev B0 2400MHz PCP 2400MHz
       32 KB ICACHE, 32 KB DCACHE
       SOC 2000MHz IOBAXI 400MHz AXI 250MHz AHB 200MHz GFC 125MHz
  ...
  Unhandled fault: synchronous external abort (0x96000010) at 0xffffff8000110034
  Internal error: : 96000010 [#1] SMP
  Modules linked in:
  CPU: 0 PID: 3723 Comm: ... 4.1.0-smp-DEV #3
  Hardware name: APM X-Gene Mustang board (DT)
  task: ffffffc7dc1a4140 ti: ffffffc7dc118000 task.ti: ffffffc7dc118000
  PC is at pci_generic_config_read32+0x4c/0xb8
  LR is at pci_generic_config_read32+0x40/0xb8
  pc : [<ffffffc00033b90c>] lr : [<ffffffc00033b900>] pstate: 600001c5
  ...
  Call trace:
  [<ffffffc00033b90c>] pci_generic_config_read32+0x4c/0xb8
  [<ffffffc00033bf58>] pci_user_read_config_byte+0x60/0xc4
  [<ffffffc0003496a8>] pci_read_config+0x15c/0x238
  [<ffffffc0002393b4>] sysfs_kf_bin_read+0x68/0xa0
  [<ffffffc00023896c>] kernfs_fop_read+0x9c/0x1ac
  [<ffffffc0001c361c>] __vfs_read+0x44/0x128
  [<ffffffc0001c3e28>] vfs_read+0x84/0x144
  [<ffffffc0001c4764>] SyS_read+0x50/0xb0

  # lspci
  00:00.0 PCI bridge: Applied Micro Circuits Corp. Device e004 (rev 04)
  01:00.0 Ethernet controller: Mellanox Technologies MT27520 Family

I first saw this on an ancient kernel and thought it was likely specific to
my environment, but I'm now using an almost unmodified v4.1 kernel and
still seeing it.  Does anybody else see this?  The box does have a PCI card
installed, but I haven't yet worked out what device's config space we're
trying to read.

Is there anything I can do to debug this?  I'm not an arm64 guy, but my
impression is that this is a page fault, and the address seems to be in the
"cfg" area ioremapped by xgene_pcie_map_reg(), so I'm not sure this is
really a PCI issue -- maybe that page mapping got trashed by somebody else?

Bjorn

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32
  2015-07-24 22:42 ` Bjorn Helgaas
@ 2015-07-25  0:05   ` Duc Dang
  -1 siblings, 0 replies; 49+ messages in thread
From: Duc Dang @ 2015-07-25  0:05 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: Tanmay Inamdar, linux-pci, linux-arm, linux-kernel

Hi Bjorn,

On Fri, Jul 24, 2015 at 3:42 PM, Bjorn Helgaas <bhelgaas@google.com> wrote:
>
> I regularly see faults like this on an APM X-Gene:
>
>   U-Boot 2013.04-mustang_sw_1.14.14 (Dec 16 2014 - 15:59:33)
>   CPU0: APM ARM 64-bit Potenza Rev B0 2400MHz PCP 2400MHz
>        32 KB ICACHE, 32 KB DCACHE
>        SOC 2000MHz IOBAXI 400MHz AXI 250MHz AHB 200MHz GFC 125MHz
>   ...
>   Unhandled fault: synchronous external abort (0x96000010) at 0xffffff8000110034
>   Internal error: : 96000010 [#1] SMP
>   Modules linked in:
>   CPU: 0 PID: 3723 Comm: ... 4.1.0-smp-DEV #3
>   Hardware name: APM X-Gene Mustang board (DT)
>   task: ffffffc7dc1a4140 ti: ffffffc7dc118000 task.ti: ffffffc7dc118000
>   PC is at pci_generic_config_read32+0x4c/0xb8
>   LR is at pci_generic_config_read32+0x40/0xb8
>   pc : [<ffffffc00033b90c>] lr : [<ffffffc00033b900>] pstate: 600001c5
>   ...
>   Call trace:
>   [<ffffffc00033b90c>] pci_generic_config_read32+0x4c/0xb8
>   [<ffffffc00033bf58>] pci_user_read_config_byte+0x60/0xc4
>   [<ffffffc0003496a8>] pci_read_config+0x15c/0x238
>   [<ffffffc0002393b4>] sysfs_kf_bin_read+0x68/0xa0
>   [<ffffffc00023896c>] kernfs_fop_read+0x9c/0x1ac
>   [<ffffffc0001c361c>] __vfs_read+0x44/0x128
>   [<ffffffc0001c3e28>] vfs_read+0x84/0x144
>   [<ffffffc0001c4764>] SyS_read+0x50/0xb0

The log shows kernel gets an exception when trying to access Mellanox
card configuration space. This is usually due to suboptimal PCIe
SerDes parameters are using in your board, which will cause bad link
quality.
The PCIe SerDes programming is done in U-Boot, so I suggest you do a
U-Boot upgrade to our latest X-Gene U-Boot release.

In order to access latest X-Gene U-Boot release, please use APM
official support channel:
https://myapm.apm.com

Please register an account at myapm.apm.com if you don't have one
using following link:
https://myapm.apm.com/user/register

>
>   # lspci
>   00:00.0 PCI bridge: Applied Micro Circuits Corp. Device e004 (rev 04)
>   01:00.0 Ethernet controller: Mellanox Technologies MT27520 Family
>
> I first saw this on an ancient kernel and thought it was likely specific to
> my environment, but I'm now using an almost unmodified v4.1 kernel and
> still seeing it.  Does anybody else see this?  The box does have a PCI card
> installed, but I haven't yet worked out what device's config space we're
> trying to read.
>
> Is there anything I can do to debug this?  I'm not an arm64 guy, but my
> impression is that this is a page fault, and the address seems to be in the
> "cfg" area ioremapped by xgene_pcie_map_reg(), so I'm not sure this is
> really a PCI issue -- maybe that page mapping got trashed by somebody else?
>
> Bjorn


-- 
Regards,
Duc Dang.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32
@ 2015-07-25  0:05   ` Duc Dang
  0 siblings, 0 replies; 49+ messages in thread
From: Duc Dang @ 2015-07-25  0:05 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Bjorn,

On Fri, Jul 24, 2015 at 3:42 PM, Bjorn Helgaas <bhelgaas@google.com> wrote:
>
> I regularly see faults like this on an APM X-Gene:
>
>   U-Boot 2013.04-mustang_sw_1.14.14 (Dec 16 2014 - 15:59:33)
>   CPU0: APM ARM 64-bit Potenza Rev B0 2400MHz PCP 2400MHz
>        32 KB ICACHE, 32 KB DCACHE
>        SOC 2000MHz IOBAXI 400MHz AXI 250MHz AHB 200MHz GFC 125MHz
>   ...
>   Unhandled fault: synchronous external abort (0x96000010) at 0xffffff8000110034
>   Internal error: : 96000010 [#1] SMP
>   Modules linked in:
>   CPU: 0 PID: 3723 Comm: ... 4.1.0-smp-DEV #3
>   Hardware name: APM X-Gene Mustang board (DT)
>   task: ffffffc7dc1a4140 ti: ffffffc7dc118000 task.ti: ffffffc7dc118000
>   PC is at pci_generic_config_read32+0x4c/0xb8
>   LR is at pci_generic_config_read32+0x40/0xb8
>   pc : [<ffffffc00033b90c>] lr : [<ffffffc00033b900>] pstate: 600001c5
>   ...
>   Call trace:
>   [<ffffffc00033b90c>] pci_generic_config_read32+0x4c/0xb8
>   [<ffffffc00033bf58>] pci_user_read_config_byte+0x60/0xc4
>   [<ffffffc0003496a8>] pci_read_config+0x15c/0x238
>   [<ffffffc0002393b4>] sysfs_kf_bin_read+0x68/0xa0
>   [<ffffffc00023896c>] kernfs_fop_read+0x9c/0x1ac
>   [<ffffffc0001c361c>] __vfs_read+0x44/0x128
>   [<ffffffc0001c3e28>] vfs_read+0x84/0x144
>   [<ffffffc0001c4764>] SyS_read+0x50/0xb0

The log shows kernel gets an exception when trying to access Mellanox
card configuration space. This is usually due to suboptimal PCIe
SerDes parameters are using in your board, which will cause bad link
quality.
The PCIe SerDes programming is done in U-Boot, so I suggest you do a
U-Boot upgrade to our latest X-Gene U-Boot release.

In order to access latest X-Gene U-Boot release, please use APM
official support channel:
https://myapm.apm.com

Please register an account at myapm.apm.com if you don't have one
using following link:
https://myapm.apm.com/user/register

>
>   # lspci
>   00:00.0 PCI bridge: Applied Micro Circuits Corp. Device e004 (rev 04)
>   01:00.0 Ethernet controller: Mellanox Technologies MT27520 Family
>
> I first saw this on an ancient kernel and thought it was likely specific to
> my environment, but I'm now using an almost unmodified v4.1 kernel and
> still seeing it.  Does anybody else see this?  The box does have a PCI card
> installed, but I haven't yet worked out what device's config space we're
> trying to read.
>
> Is there anything I can do to debug this?  I'm not an arm64 guy, but my
> impression is that this is a page fault, and the address seems to be in the
> "cfg" area ioremapped by xgene_pcie_map_reg(), so I'm not sure this is
> really a PCI issue -- maybe that page mapping got trashed by somebody else?
>
> Bjorn


-- 
Regards,
Duc Dang.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32
  2015-07-25  0:05   ` Duc Dang
@ 2015-07-27 11:36     ` Catalin Marinas
  -1 siblings, 0 replies; 49+ messages in thread
From: Catalin Marinas @ 2015-07-27 11:36 UTC (permalink / raw)
  To: Duc Dang
  Cc: Bjorn Helgaas, linux-pci, Tanmay Inamdar, linux-arm, linux-kernel

On Fri, Jul 24, 2015 at 05:05:19PM -0700, Duc Dang wrote:
> On Fri, Jul 24, 2015 at 3:42 PM, Bjorn Helgaas <bhelgaas@google.com> wrote:
> > I regularly see faults like this on an APM X-Gene:
> >
> >   U-Boot 2013.04-mustang_sw_1.14.14 (Dec 16 2014 - 15:59:33)
> >   CPU0: APM ARM 64-bit Potenza Rev B0 2400MHz PCP 2400MHz
> >        32 KB ICACHE, 32 KB DCACHE
> >        SOC 2000MHz IOBAXI 400MHz AXI 250MHz AHB 200MHz GFC 125MHz
> >   ...
> >   Unhandled fault: synchronous external abort (0x96000010) at 0xffffff8000110034

That's generated by an external device (PCIe root complex, card etc.)
and some mis-configured CPU setting.

> >   Internal error: : 96000010 [#1] SMP
> >   Modules linked in:
> >   CPU: 0 PID: 3723 Comm: ... 4.1.0-smp-DEV #3
> >   Hardware name: APM X-Gene Mustang board (DT)
> >   task: ffffffc7dc1a4140 ti: ffffffc7dc118000 task.ti: ffffffc7dc118000
> >   PC is at pci_generic_config_read32+0x4c/0xb8
> >   LR is at pci_generic_config_read32+0x40/0xb8
> >   pc : [<ffffffc00033b90c>] lr : [<ffffffc00033b900>] pstate: 600001c5
> >   ...
> >   Call trace:
> >   [<ffffffc00033b90c>] pci_generic_config_read32+0x4c/0xb8
> >   [<ffffffc00033bf58>] pci_user_read_config_byte+0x60/0xc4
> >   [<ffffffc0003496a8>] pci_read_config+0x15c/0x238
> >   [<ffffffc0002393b4>] sysfs_kf_bin_read+0x68/0xa0
> >   [<ffffffc00023896c>] kernfs_fop_read+0x9c/0x1ac
> >   [<ffffffc0001c361c>] __vfs_read+0x44/0x128
> >   [<ffffffc0001c3e28>] vfs_read+0x84/0x144
> >   [<ffffffc0001c4764>] SyS_read+0x50/0xb0
> 
> The log shows kernel gets an exception when trying to access Mellanox
> card configuration space. This is usually due to suboptimal PCIe
> SerDes parameters are using in your board, which will cause bad link
> quality.

I would have hoped that "suboptimal" means that it still works, albeit
not fully optimal ;).

> The PCIe SerDes programming is done in U-Boot, so I suggest you do a
> U-Boot upgrade to our latest X-Gene U-Boot release.
> 
> In order to access latest X-Gene U-Boot release, please use APM
> official support channel:
> https://myapm.apm.com
> 
> Please register an account at myapm.apm.com if you don't have one
> using following link:
> https://myapm.apm.com/user/register

Isn't the latest U-Boot source for X-Gene publicly available anywhere?
It's GPL code anyway, so it shouldn't have proprietary code to require
registration, click-through agreements.

-- 
Catalin

^ permalink raw reply	[flat|nested] 49+ messages in thread

* X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32
@ 2015-07-27 11:36     ` Catalin Marinas
  0 siblings, 0 replies; 49+ messages in thread
From: Catalin Marinas @ 2015-07-27 11:36 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Jul 24, 2015 at 05:05:19PM -0700, Duc Dang wrote:
> On Fri, Jul 24, 2015 at 3:42 PM, Bjorn Helgaas <bhelgaas@google.com> wrote:
> > I regularly see faults like this on an APM X-Gene:
> >
> >   U-Boot 2013.04-mustang_sw_1.14.14 (Dec 16 2014 - 15:59:33)
> >   CPU0: APM ARM 64-bit Potenza Rev B0 2400MHz PCP 2400MHz
> >        32 KB ICACHE, 32 KB DCACHE
> >        SOC 2000MHz IOBAXI 400MHz AXI 250MHz AHB 200MHz GFC 125MHz
> >   ...
> >   Unhandled fault: synchronous external abort (0x96000010) at 0xffffff8000110034

That's generated by an external device (PCIe root complex, card etc.)
and some mis-configured CPU setting.

> >   Internal error: : 96000010 [#1] SMP
> >   Modules linked in:
> >   CPU: 0 PID: 3723 Comm: ... 4.1.0-smp-DEV #3
> >   Hardware name: APM X-Gene Mustang board (DT)
> >   task: ffffffc7dc1a4140 ti: ffffffc7dc118000 task.ti: ffffffc7dc118000
> >   PC is at pci_generic_config_read32+0x4c/0xb8
> >   LR is at pci_generic_config_read32+0x40/0xb8
> >   pc : [<ffffffc00033b90c>] lr : [<ffffffc00033b900>] pstate: 600001c5
> >   ...
> >   Call trace:
> >   [<ffffffc00033b90c>] pci_generic_config_read32+0x4c/0xb8
> >   [<ffffffc00033bf58>] pci_user_read_config_byte+0x60/0xc4
> >   [<ffffffc0003496a8>] pci_read_config+0x15c/0x238
> >   [<ffffffc0002393b4>] sysfs_kf_bin_read+0x68/0xa0
> >   [<ffffffc00023896c>] kernfs_fop_read+0x9c/0x1ac
> >   [<ffffffc0001c361c>] __vfs_read+0x44/0x128
> >   [<ffffffc0001c3e28>] vfs_read+0x84/0x144
> >   [<ffffffc0001c4764>] SyS_read+0x50/0xb0
> 
> The log shows kernel gets an exception when trying to access Mellanox
> card configuration space. This is usually due to suboptimal PCIe
> SerDes parameters are using in your board, which will cause bad link
> quality.

I would have hoped that "suboptimal" means that it still works, albeit
not fully optimal ;).

> The PCIe SerDes programming is done in U-Boot, so I suggest you do a
> U-Boot upgrade to our latest X-Gene U-Boot release.
> 
> In order to access latest X-Gene U-Boot release, please use APM
> official support channel:
> https://myapm.apm.com
> 
> Please register an account at myapm.apm.com if you don't have one
> using following link:
> https://myapm.apm.com/user/register

Isn't the latest U-Boot source for X-Gene publicly available anywhere?
It's GPL code anyway, so it shouldn't have proprietary code to require
registration, click-through agreements.

-- 
Catalin

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32
  2015-07-24 22:42 ` Bjorn Helgaas
  (?)
@ 2015-07-28 14:37   ` Dall, Elizabeth J
  -1 siblings, 0 replies; 49+ messages in thread
From: Dall, Elizabeth J @ 2015-07-28 14:37 UTC (permalink / raw)
  To: Bjorn Helgaas, Tanmay Inamdar
  Cc: Duc Dang, linux-pci, linux-arm-kernel, linux-kernel

On 07/24/2015 04:43 PM, Bjorn Helgaas wrote:
> I regularly see faults like this on an APM X-Gene:
> 
>   U-Boot 2013.04-mustang_sw_1.14.14 (Dec 16 2014 - 15:59:33)
>   CPU0: APM ARM 64-bit Potenza Rev B0 2400MHz PCP 2400MHz
>        32 KB ICACHE, 32 KB DCACHE
>        SOC 2000MHz IOBAXI 400MHz AXI 250MHz AHB 200MHz GFC 125MHz
>   ...
>   Unhandled fault: synchronous external abort (0x96000010) at 0xffffff8000110034

The 0x96000010 is the value of the ESR register and decodes to "Stack
Pointer Alignment exception". The ISS field for this exception code is
reserved, so no additional info.

-Betty Dall


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32
@ 2015-07-28 14:37   ` Dall, Elizabeth J
  0 siblings, 0 replies; 49+ messages in thread
From: Dall, Elizabeth J @ 2015-07-28 14:37 UTC (permalink / raw)
  To: Bjorn Helgaas, Tanmay Inamdar
  Cc: Duc Dang, linux-pci, linux-arm-kernel, linux-kernel

On 07/24/2015 04:43 PM, Bjorn Helgaas wrote:
> I regularly see faults like this on an APM X-Gene:
> 
>   U-Boot 2013.04-mustang_sw_1.14.14 (Dec 16 2014 - 15:59:33)
>   CPU0: APM ARM 64-bit Potenza Rev B0 2400MHz PCP 2400MHz
>        32 KB ICACHE, 32 KB DCACHE
>        SOC 2000MHz IOBAXI 400MHz AXI 250MHz AHB 200MHz GFC 125MHz
>   ...
>   Unhandled fault: synchronous external abort (0x96000010) at 0xffffff8000110034

The 0x96000010 is the value of the ESR register and decodes to "Stack
Pointer Alignment exception". The ISS field for this exception code is
reserved, so no additional info.

-Betty Dall


^ permalink raw reply	[flat|nested] 49+ messages in thread

* X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32
@ 2015-07-28 14:37   ` Dall, Elizabeth J
  0 siblings, 0 replies; 49+ messages in thread
From: Dall, Elizabeth J @ 2015-07-28 14:37 UTC (permalink / raw)
  To: linux-arm-kernel

On 07/24/2015 04:43 PM, Bjorn Helgaas wrote:
> I regularly see faults like this on an APM X-Gene:
> 
>   U-Boot 2013.04-mustang_sw_1.14.14 (Dec 16 2014 - 15:59:33)
>   CPU0: APM ARM 64-bit Potenza Rev B0 2400MHz PCP 2400MHz
>        32 KB ICACHE, 32 KB DCACHE
>        SOC 2000MHz IOBAXI 400MHz AXI 250MHz AHB 200MHz GFC 125MHz
>   ...
>   Unhandled fault: synchronous external abort (0x96000010) at 0xffffff8000110034

The 0x96000010 is the value of the ESR register and decodes to "Stack
Pointer Alignment exception". The ISS field for this exception code is
reserved, so no additional info.

-Betty Dall

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32
  2015-07-25  0:05   ` Duc Dang
@ 2015-07-28 16:43     ` Bjorn Helgaas
  -1 siblings, 0 replies; 49+ messages in thread
From: Bjorn Helgaas @ 2015-07-28 16:43 UTC (permalink / raw)
  To: Duc Dang; +Cc: Tanmay Inamdar, linux-pci, linux-arm, linux-kernel

On Fri, Jul 24, 2015 at 7:05 PM, Duc Dang <dhdang@apm.com> wrote:
> Hi Bjorn,
>
> On Fri, Jul 24, 2015 at 3:42 PM, Bjorn Helgaas <bhelgaas@google.com> wrote:
>>
>> I regularly see faults like this on an APM X-Gene:
>>
>>   U-Boot 2013.04-mustang_sw_1.14.14 (Dec 16 2014 - 15:59:33)
>>   CPU0: APM ARM 64-bit Potenza Rev B0 2400MHz PCP 2400MHz
>>        32 KB ICACHE, 32 KB DCACHE
>>        SOC 2000MHz IOBAXI 400MHz AXI 250MHz AHB 200MHz GFC 125MHz
>>   ...
>>   Unhandled fault: synchronous external abort (0x96000010) at 0xffffff8000110034
>>   Internal error: : 96000010 [#1] SMP
>>   Modules linked in:
>>   CPU: 0 PID: 3723 Comm: ... 4.1.0-smp-DEV #3
>>   Hardware name: APM X-Gene Mustang board (DT)
>>   task: ffffffc7dc1a4140 ti: ffffffc7dc118000 task.ti: ffffffc7dc118000
>>   PC is at pci_generic_config_read32+0x4c/0xb8
>>   LR is at pci_generic_config_read32+0x40/0xb8
>>   pc : [<ffffffc00033b90c>] lr : [<ffffffc00033b900>] pstate: 600001c5
>>   ...
>>   Call trace:
>>   [<ffffffc00033b90c>] pci_generic_config_read32+0x4c/0xb8
>>   [<ffffffc00033bf58>] pci_user_read_config_byte+0x60/0xc4
>>   [<ffffffc0003496a8>] pci_read_config+0x15c/0x238
>>   [<ffffffc0002393b4>] sysfs_kf_bin_read+0x68/0xa0
>>   [<ffffffc00023896c>] kernfs_fop_read+0x9c/0x1ac
>>   [<ffffffc0001c361c>] __vfs_read+0x44/0x128
>>   [<ffffffc0001c3e28>] vfs_read+0x84/0x144
>>   [<ffffffc0001c4764>] SyS_read+0x50/0xb0
>
> The log shows kernel gets an exception when trying to access Mellanox
> card configuration space. This is usually due to suboptimal PCIe
> SerDes parameters are using in your board, which will cause bad link
> quality.
> The PCIe SerDes programming is done in U-Boot, so I suggest you do a
> U-Boot upgrade to our latest X-Gene U-Boot release.

I installed U-Boot 1.15.12, which I thought was the latest.  I'm still
seeing this issue regularly, approx once/hour.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32
@ 2015-07-28 16:43     ` Bjorn Helgaas
  0 siblings, 0 replies; 49+ messages in thread
From: Bjorn Helgaas @ 2015-07-28 16:43 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Jul 24, 2015 at 7:05 PM, Duc Dang <dhdang@apm.com> wrote:
> Hi Bjorn,
>
> On Fri, Jul 24, 2015 at 3:42 PM, Bjorn Helgaas <bhelgaas@google.com> wrote:
>>
>> I regularly see faults like this on an APM X-Gene:
>>
>>   U-Boot 2013.04-mustang_sw_1.14.14 (Dec 16 2014 - 15:59:33)
>>   CPU0: APM ARM 64-bit Potenza Rev B0 2400MHz PCP 2400MHz
>>        32 KB ICACHE, 32 KB DCACHE
>>        SOC 2000MHz IOBAXI 400MHz AXI 250MHz AHB 200MHz GFC 125MHz
>>   ...
>>   Unhandled fault: synchronous external abort (0x96000010) at 0xffffff8000110034
>>   Internal error: : 96000010 [#1] SMP
>>   Modules linked in:
>>   CPU: 0 PID: 3723 Comm: ... 4.1.0-smp-DEV #3
>>   Hardware name: APM X-Gene Mustang board (DT)
>>   task: ffffffc7dc1a4140 ti: ffffffc7dc118000 task.ti: ffffffc7dc118000
>>   PC is at pci_generic_config_read32+0x4c/0xb8
>>   LR is at pci_generic_config_read32+0x40/0xb8
>>   pc : [<ffffffc00033b90c>] lr : [<ffffffc00033b900>] pstate: 600001c5
>>   ...
>>   Call trace:
>>   [<ffffffc00033b90c>] pci_generic_config_read32+0x4c/0xb8
>>   [<ffffffc00033bf58>] pci_user_read_config_byte+0x60/0xc4
>>   [<ffffffc0003496a8>] pci_read_config+0x15c/0x238
>>   [<ffffffc0002393b4>] sysfs_kf_bin_read+0x68/0xa0
>>   [<ffffffc00023896c>] kernfs_fop_read+0x9c/0x1ac
>>   [<ffffffc0001c361c>] __vfs_read+0x44/0x128
>>   [<ffffffc0001c3e28>] vfs_read+0x84/0x144
>>   [<ffffffc0001c4764>] SyS_read+0x50/0xb0
>
> The log shows kernel gets an exception when trying to access Mellanox
> card configuration space. This is usually due to suboptimal PCIe
> SerDes parameters are using in your board, which will cause bad link
> quality.
> The PCIe SerDes programming is done in U-Boot, so I suggest you do a
> U-Boot upgrade to our latest X-Gene U-Boot release.

I installed U-Boot 1.15.12, which I thought was the latest.  I'm still
seeing this issue regularly, approx once/hour.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32
  2015-07-27 11:36     ` Catalin Marinas
@ 2015-07-28 17:39       ` Duc Dang
  -1 siblings, 0 replies; 49+ messages in thread
From: Duc Dang @ 2015-07-28 17:39 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Bjorn Helgaas, linux-pci, Tanmay Inamdar, linux-arm,
	Linux Kernel Mailing List

On Mon, Jul 27, 2015 at 4:36 AM, Catalin Marinas
<catalin.marinas@arm.com> wrote:
> On Fri, Jul 24, 2015 at 05:05:19PM -0700, Duc Dang wrote:
>> On Fri, Jul 24, 2015 at 3:42 PM, Bjorn Helgaas <bhelgaas@google.com> wrote:
>> > I regularly see faults like this on an APM X-Gene:
>> >
>> >   U-Boot 2013.04-mustang_sw_1.14.14 (Dec 16 2014 - 15:59:33)
>> >   CPU0: APM ARM 64-bit Potenza Rev B0 2400MHz PCP 2400MHz
>> >        32 KB ICACHE, 32 KB DCACHE
>> >        SOC 2000MHz IOBAXI 400MHz AXI 250MHz AHB 200MHz GFC 125MHz
>> >   ...
>> >   Unhandled fault: synchronous external abort (0x96000010) at 0xffffff8000110034
>
> That's generated by an external device (PCIe root complex, card etc.)
> and some mis-configured CPU setting.
>
>> >   Internal error: : 96000010 [#1] SMP
>> >   Modules linked in:
>> >   CPU: 0 PID: 3723 Comm: ... 4.1.0-smp-DEV #3
>> >   Hardware name: APM X-Gene Mustang board (DT)
>> >   task: ffffffc7dc1a4140 ti: ffffffc7dc118000 task.ti: ffffffc7dc118000
>> >   PC is at pci_generic_config_read32+0x4c/0xb8
>> >   LR is at pci_generic_config_read32+0x40/0xb8
>> >   pc : [<ffffffc00033b90c>] lr : [<ffffffc00033b900>] pstate: 600001c5
>> >   ...
>> >   Call trace:
>> >   [<ffffffc00033b90c>] pci_generic_config_read32+0x4c/0xb8
>> >   [<ffffffc00033bf58>] pci_user_read_config_byte+0x60/0xc4
>> >   [<ffffffc0003496a8>] pci_read_config+0x15c/0x238
>> >   [<ffffffc0002393b4>] sysfs_kf_bin_read+0x68/0xa0
>> >   [<ffffffc00023896c>] kernfs_fop_read+0x9c/0x1ac
>> >   [<ffffffc0001c361c>] __vfs_read+0x44/0x128
>> >   [<ffffffc0001c3e28>] vfs_read+0x84/0x144
>> >   [<ffffffc0001c4764>] SyS_read+0x50/0xb0
>>
>> The log shows kernel gets an exception when trying to access Mellanox
>> card configuration space. This is usually due to suboptimal PCIe
>> SerDes parameters are using in your board, which will cause bad link
>> quality.
>
> I would have hoped that "suboptimal" means that it still works, albeit
> not fully optimal ;).

Yes, it should still work, but you may see crashes occasionally due to
link quality.

>
>> The PCIe SerDes programming is done in U-Boot, so I suggest you do a
>> U-Boot upgrade to our latest X-Gene U-Boot release.
>>
>> In order to access latest X-Gene U-Boot release, please use APM
>> official support channel:
>> https://myapm.apm.com
>>
>> Please register an account at myapm.apm.com if you don't have one
>> using following link:
>> https://myapm.apm.com/user/register
>
> Isn't the latest U-Boot source for X-Gene publicly available anywhere?
> It's GPL code anyway, so it shouldn't have proprietary code to require
> registration, click-through agreements.

APM X-Gene U-Boot isn't available publicly yet. Though, if this is
required, we can make a public GIT which will be hosted with APM
server.

As of now, customer who has a board from APM will have to use MyAPM to
get U-Boot source and binary.
>
> --
> Catalin



-- 
Regards,
Duc Dang.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32
@ 2015-07-28 17:39       ` Duc Dang
  0 siblings, 0 replies; 49+ messages in thread
From: Duc Dang @ 2015-07-28 17:39 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Jul 27, 2015 at 4:36 AM, Catalin Marinas
<catalin.marinas@arm.com> wrote:
> On Fri, Jul 24, 2015 at 05:05:19PM -0700, Duc Dang wrote:
>> On Fri, Jul 24, 2015 at 3:42 PM, Bjorn Helgaas <bhelgaas@google.com> wrote:
>> > I regularly see faults like this on an APM X-Gene:
>> >
>> >   U-Boot 2013.04-mustang_sw_1.14.14 (Dec 16 2014 - 15:59:33)
>> >   CPU0: APM ARM 64-bit Potenza Rev B0 2400MHz PCP 2400MHz
>> >        32 KB ICACHE, 32 KB DCACHE
>> >        SOC 2000MHz IOBAXI 400MHz AXI 250MHz AHB 200MHz GFC 125MHz
>> >   ...
>> >   Unhandled fault: synchronous external abort (0x96000010) at 0xffffff8000110034
>
> That's generated by an external device (PCIe root complex, card etc.)
> and some mis-configured CPU setting.
>
>> >   Internal error: : 96000010 [#1] SMP
>> >   Modules linked in:
>> >   CPU: 0 PID: 3723 Comm: ... 4.1.0-smp-DEV #3
>> >   Hardware name: APM X-Gene Mustang board (DT)
>> >   task: ffffffc7dc1a4140 ti: ffffffc7dc118000 task.ti: ffffffc7dc118000
>> >   PC is at pci_generic_config_read32+0x4c/0xb8
>> >   LR is at pci_generic_config_read32+0x40/0xb8
>> >   pc : [<ffffffc00033b90c>] lr : [<ffffffc00033b900>] pstate: 600001c5
>> >   ...
>> >   Call trace:
>> >   [<ffffffc00033b90c>] pci_generic_config_read32+0x4c/0xb8
>> >   [<ffffffc00033bf58>] pci_user_read_config_byte+0x60/0xc4
>> >   [<ffffffc0003496a8>] pci_read_config+0x15c/0x238
>> >   [<ffffffc0002393b4>] sysfs_kf_bin_read+0x68/0xa0
>> >   [<ffffffc00023896c>] kernfs_fop_read+0x9c/0x1ac
>> >   [<ffffffc0001c361c>] __vfs_read+0x44/0x128
>> >   [<ffffffc0001c3e28>] vfs_read+0x84/0x144
>> >   [<ffffffc0001c4764>] SyS_read+0x50/0xb0
>>
>> The log shows kernel gets an exception when trying to access Mellanox
>> card configuration space. This is usually due to suboptimal PCIe
>> SerDes parameters are using in your board, which will cause bad link
>> quality.
>
> I would have hoped that "suboptimal" means that it still works, albeit
> not fully optimal ;).

Yes, it should still work, but you may see crashes occasionally due to
link quality.

>
>> The PCIe SerDes programming is done in U-Boot, so I suggest you do a
>> U-Boot upgrade to our latest X-Gene U-Boot release.
>>
>> In order to access latest X-Gene U-Boot release, please use APM
>> official support channel:
>> https://myapm.apm.com
>>
>> Please register an account at myapm.apm.com if you don't have one
>> using following link:
>> https://myapm.apm.com/user/register
>
> Isn't the latest U-Boot source for X-Gene publicly available anywhere?
> It's GPL code anyway, so it shouldn't have proprietary code to require
> registration, click-through agreements.

APM X-Gene U-Boot isn't available publicly yet. Though, if this is
required, we can make a public GIT which will be hosted with APM
server.

As of now, customer who has a board from APM will have to use MyAPM to
get U-Boot source and binary.
>
> --
> Catalin



-- 
Regards,
Duc Dang.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32
  2015-07-28 16:43     ` Bjorn Helgaas
@ 2015-07-28 17:45       ` Duc Dang
  -1 siblings, 0 replies; 49+ messages in thread
From: Duc Dang @ 2015-07-28 17:45 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: Tanmay Inamdar, linux-pci, linux-arm, linux-kernel

On Tue, Jul 28, 2015 at 9:43 AM, Bjorn Helgaas <bhelgaas@google.com> wrote:
> On Fri, Jul 24, 2015 at 7:05 PM, Duc Dang <dhdang@apm.com> wrote:
>> Hi Bjorn,
>>
>> On Fri, Jul 24, 2015 at 3:42 PM, Bjorn Helgaas <bhelgaas@google.com> wrote:
>>>
>>> I regularly see faults like this on an APM X-Gene:
>>>
>>>   U-Boot 2013.04-mustang_sw_1.14.14 (Dec 16 2014 - 15:59:33)
>>>   CPU0: APM ARM 64-bit Potenza Rev B0 2400MHz PCP 2400MHz
>>>        32 KB ICACHE, 32 KB DCACHE
>>>        SOC 2000MHz IOBAXI 400MHz AXI 250MHz AHB 200MHz GFC 125MHz
>>>   ...
>>>   Unhandled fault: synchronous external abort (0x96000010) at 0xffffff8000110034
>>>   Internal error: : 96000010 [#1] SMP
>>>   Modules linked in:
>>>   CPU: 0 PID: 3723 Comm: ... 4.1.0-smp-DEV #3
>>>   Hardware name: APM X-Gene Mustang board (DT)
>>>   task: ffffffc7dc1a4140 ti: ffffffc7dc118000 task.ti: ffffffc7dc118000
>>>   PC is at pci_generic_config_read32+0x4c/0xb8
>>>   LR is at pci_generic_config_read32+0x40/0xb8
>>>   pc : [<ffffffc00033b90c>] lr : [<ffffffc00033b900>] pstate: 600001c5
>>>   ...
>>>   Call trace:
>>>   [<ffffffc00033b90c>] pci_generic_config_read32+0x4c/0xb8
>>>   [<ffffffc00033bf58>] pci_user_read_config_byte+0x60/0xc4
>>>   [<ffffffc0003496a8>] pci_read_config+0x15c/0x238
>>>   [<ffffffc0002393b4>] sysfs_kf_bin_read+0x68/0xa0
>>>   [<ffffffc00023896c>] kernfs_fop_read+0x9c/0x1ac
>>>   [<ffffffc0001c361c>] __vfs_read+0x44/0x128
>>>   [<ffffffc0001c3e28>] vfs_read+0x84/0x144
>>>   [<ffffffc0001c4764>] SyS_read+0x50/0xb0
>>
>> The log shows kernel gets an exception when trying to access Mellanox
>> card configuration space. This is usually due to suboptimal PCIe
>> SerDes parameters are using in your board, which will cause bad link
>> quality.
>> The PCIe SerDes programming is done in U-Boot, so I suggest you do a
>> U-Boot upgrade to our latest X-Gene U-Boot release.
>
> I installed U-Boot 1.15.12, which I thought was the latest.  I'm still
> seeing this issue regularly, approx once/hour.

Our latest U-Boot is 1.15.15, but U-Boot 1.15.12 is already a good
version to use. Are you running any PCIe traffic test when the error
happens? I will try to reproduce the issue with my Mustang board as
well.

And it will be useful if you can share your "lspci -vvv" output when
the board is running, we can check to see if there is any error status
reported.

-- 
Regards,
Duc Dang.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32
@ 2015-07-28 17:45       ` Duc Dang
  0 siblings, 0 replies; 49+ messages in thread
From: Duc Dang @ 2015-07-28 17:45 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Jul 28, 2015 at 9:43 AM, Bjorn Helgaas <bhelgaas@google.com> wrote:
> On Fri, Jul 24, 2015 at 7:05 PM, Duc Dang <dhdang@apm.com> wrote:
>> Hi Bjorn,
>>
>> On Fri, Jul 24, 2015 at 3:42 PM, Bjorn Helgaas <bhelgaas@google.com> wrote:
>>>
>>> I regularly see faults like this on an APM X-Gene:
>>>
>>>   U-Boot 2013.04-mustang_sw_1.14.14 (Dec 16 2014 - 15:59:33)
>>>   CPU0: APM ARM 64-bit Potenza Rev B0 2400MHz PCP 2400MHz
>>>        32 KB ICACHE, 32 KB DCACHE
>>>        SOC 2000MHz IOBAXI 400MHz AXI 250MHz AHB 200MHz GFC 125MHz
>>>   ...
>>>   Unhandled fault: synchronous external abort (0x96000010) at 0xffffff8000110034
>>>   Internal error: : 96000010 [#1] SMP
>>>   Modules linked in:
>>>   CPU: 0 PID: 3723 Comm: ... 4.1.0-smp-DEV #3
>>>   Hardware name: APM X-Gene Mustang board (DT)
>>>   task: ffffffc7dc1a4140 ti: ffffffc7dc118000 task.ti: ffffffc7dc118000
>>>   PC is at pci_generic_config_read32+0x4c/0xb8
>>>   LR is at pci_generic_config_read32+0x40/0xb8
>>>   pc : [<ffffffc00033b90c>] lr : [<ffffffc00033b900>] pstate: 600001c5
>>>   ...
>>>   Call trace:
>>>   [<ffffffc00033b90c>] pci_generic_config_read32+0x4c/0xb8
>>>   [<ffffffc00033bf58>] pci_user_read_config_byte+0x60/0xc4
>>>   [<ffffffc0003496a8>] pci_read_config+0x15c/0x238
>>>   [<ffffffc0002393b4>] sysfs_kf_bin_read+0x68/0xa0
>>>   [<ffffffc00023896c>] kernfs_fop_read+0x9c/0x1ac
>>>   [<ffffffc0001c361c>] __vfs_read+0x44/0x128
>>>   [<ffffffc0001c3e28>] vfs_read+0x84/0x144
>>>   [<ffffffc0001c4764>] SyS_read+0x50/0xb0
>>
>> The log shows kernel gets an exception when trying to access Mellanox
>> card configuration space. This is usually due to suboptimal PCIe
>> SerDes parameters are using in your board, which will cause bad link
>> quality.
>> The PCIe SerDes programming is done in U-Boot, so I suggest you do a
>> U-Boot upgrade to our latest X-Gene U-Boot release.
>
> I installed U-Boot 1.15.12, which I thought was the latest.  I'm still
> seeing this issue regularly, approx once/hour.

Our latest U-Boot is 1.15.15, but U-Boot 1.15.12 is already a good
version to use. Are you running any PCIe traffic test when the error
happens? I will try to reproduce the issue with my Mustang board as
well.

And it will be useful if you can share your "lspci -vvv" output when
the board is running, we can check to see if there is any error status
reported.

-- 
Regards,
Duc Dang.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32
  2015-07-28 17:39       ` Duc Dang
@ 2015-07-28 18:36         ` Bjorn Helgaas
  -1 siblings, 0 replies; 49+ messages in thread
From: Bjorn Helgaas @ 2015-07-28 18:36 UTC (permalink / raw)
  To: Duc Dang
  Cc: Catalin Marinas, linux-pci, Tanmay Inamdar, linux-arm,
	Linux Kernel Mailing List

On Tue, Jul 28, 2015 at 12:39 PM, Duc Dang <dhdang@apm.com> wrote:
> On Mon, Jul 27, 2015 at 4:36 AM, Catalin Marinas
> <catalin.marinas@arm.com> wrote:
>> On Fri, Jul 24, 2015 at 05:05:19PM -0700, Duc Dang wrote:
>>> On Fri, Jul 24, 2015 at 3:42 PM, Bjorn Helgaas <bhelgaas@google.com> wrote:
>>> > I regularly see faults like this on an APM X-Gene:
>>> >
>>> >   U-Boot 2013.04-mustang_sw_1.14.14 (Dec 16 2014 - 15:59:33)
>>> >   CPU0: APM ARM 64-bit Potenza Rev B0 2400MHz PCP 2400MHz
>>> >        32 KB ICACHE, 32 KB DCACHE
>>> >        SOC 2000MHz IOBAXI 400MHz AXI 250MHz AHB 200MHz GFC 125MHz
>>> >   ...
>>> >   Unhandled fault: synchronous external abort (0x96000010) at 0xffffff8000110034
>>
>> That's generated by an external device (PCIe root complex, card etc.)
>> and some mis-configured CPU setting.
>>
>>> >   Internal error: : 96000010 [#1] SMP
>>> >   Modules linked in:
>>> >   CPU: 0 PID: 3723 Comm: ... 4.1.0-smp-DEV #3
>>> >   Hardware name: APM X-Gene Mustang board (DT)
>>> >   task: ffffffc7dc1a4140 ti: ffffffc7dc118000 task.ti: ffffffc7dc118000
>>> >   PC is at pci_generic_config_read32+0x4c/0xb8
>>> >   LR is at pci_generic_config_read32+0x40/0xb8
>>> >   pc : [<ffffffc00033b90c>] lr : [<ffffffc00033b900>] pstate: 600001c5
>>> >   ...
>>> >   Call trace:
>>> >   [<ffffffc00033b90c>] pci_generic_config_read32+0x4c/0xb8
>>> >   [<ffffffc00033bf58>] pci_user_read_config_byte+0x60/0xc4
>>> >   [<ffffffc0003496a8>] pci_read_config+0x15c/0x238
>>> >   [<ffffffc0002393b4>] sysfs_kf_bin_read+0x68/0xa0
>>> >   [<ffffffc00023896c>] kernfs_fop_read+0x9c/0x1ac
>>> >   [<ffffffc0001c361c>] __vfs_read+0x44/0x128
>>> >   [<ffffffc0001c3e28>] vfs_read+0x84/0x144
>>> >   [<ffffffc0001c4764>] SyS_read+0x50/0xb0
>>>
>>> The log shows kernel gets an exception when trying to access Mellanox
>>> card configuration space. This is usually due to suboptimal PCIe
>>> SerDes parameters are using in your board, which will cause bad link
>>> quality.
>>
>> I would have hoped that "suboptimal" means that it still works, albeit
>> not fully optimal ;).
>
> Yes, it should still work, but you may see crashes occasionally due to
> link quality.

A crash seems like a too-severe response to a link quality issue.
Isn't there some way to retry the access or return an error, so we
don't have to crash the whole system?

^ permalink raw reply	[flat|nested] 49+ messages in thread

* X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32
@ 2015-07-28 18:36         ` Bjorn Helgaas
  0 siblings, 0 replies; 49+ messages in thread
From: Bjorn Helgaas @ 2015-07-28 18:36 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Jul 28, 2015 at 12:39 PM, Duc Dang <dhdang@apm.com> wrote:
> On Mon, Jul 27, 2015 at 4:36 AM, Catalin Marinas
> <catalin.marinas@arm.com> wrote:
>> On Fri, Jul 24, 2015 at 05:05:19PM -0700, Duc Dang wrote:
>>> On Fri, Jul 24, 2015 at 3:42 PM, Bjorn Helgaas <bhelgaas@google.com> wrote:
>>> > I regularly see faults like this on an APM X-Gene:
>>> >
>>> >   U-Boot 2013.04-mustang_sw_1.14.14 (Dec 16 2014 - 15:59:33)
>>> >   CPU0: APM ARM 64-bit Potenza Rev B0 2400MHz PCP 2400MHz
>>> >        32 KB ICACHE, 32 KB DCACHE
>>> >        SOC 2000MHz IOBAXI 400MHz AXI 250MHz AHB 200MHz GFC 125MHz
>>> >   ...
>>> >   Unhandled fault: synchronous external abort (0x96000010) at 0xffffff8000110034
>>
>> That's generated by an external device (PCIe root complex, card etc.)
>> and some mis-configured CPU setting.
>>
>>> >   Internal error: : 96000010 [#1] SMP
>>> >   Modules linked in:
>>> >   CPU: 0 PID: 3723 Comm: ... 4.1.0-smp-DEV #3
>>> >   Hardware name: APM X-Gene Mustang board (DT)
>>> >   task: ffffffc7dc1a4140 ti: ffffffc7dc118000 task.ti: ffffffc7dc118000
>>> >   PC is at pci_generic_config_read32+0x4c/0xb8
>>> >   LR is at pci_generic_config_read32+0x40/0xb8
>>> >   pc : [<ffffffc00033b90c>] lr : [<ffffffc00033b900>] pstate: 600001c5
>>> >   ...
>>> >   Call trace:
>>> >   [<ffffffc00033b90c>] pci_generic_config_read32+0x4c/0xb8
>>> >   [<ffffffc00033bf58>] pci_user_read_config_byte+0x60/0xc4
>>> >   [<ffffffc0003496a8>] pci_read_config+0x15c/0x238
>>> >   [<ffffffc0002393b4>] sysfs_kf_bin_read+0x68/0xa0
>>> >   [<ffffffc00023896c>] kernfs_fop_read+0x9c/0x1ac
>>> >   [<ffffffc0001c361c>] __vfs_read+0x44/0x128
>>> >   [<ffffffc0001c3e28>] vfs_read+0x84/0x144
>>> >   [<ffffffc0001c4764>] SyS_read+0x50/0xb0
>>>
>>> The log shows kernel gets an exception when trying to access Mellanox
>>> card configuration space. This is usually due to suboptimal PCIe
>>> SerDes parameters are using in your board, which will cause bad link
>>> quality.
>>
>> I would have hoped that "suboptimal" means that it still works, albeit
>> not fully optimal ;).
>
> Yes, it should still work, but you may see crashes occasionally due to
> link quality.

A crash seems like a too-severe response to a link quality issue.
Isn't there some way to retry the access or return an error, so we
don't have to crash the whole system?

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32
  2015-07-28 17:45       ` Duc Dang
@ 2015-07-28 21:29         ` Bjorn Helgaas
  -1 siblings, 0 replies; 49+ messages in thread
From: Bjorn Helgaas @ 2015-07-28 21:29 UTC (permalink / raw)
  To: Duc Dang; +Cc: Tanmay Inamdar, linux-pci, linux-arm, linux-kernel

On Tue, Jul 28, 2015 at 10:45:26AM -0700, Duc Dang wrote:
> On Tue, Jul 28, 2015 at 9:43 AM, Bjorn Helgaas <bhelgaas@google.com> wrote:
> > On Fri, Jul 24, 2015 at 7:05 PM, Duc Dang <dhdang@apm.com> wrote:
> >> Hi Bjorn,
> >>
> >> On Fri, Jul 24, 2015 at 3:42 PM, Bjorn Helgaas <bhelgaas@google.com> wrote:
> >>>
> >>> I regularly see faults like this on an APM X-Gene:
> >>>
> >>>   U-Boot 2013.04-mustang_sw_1.14.14 (Dec 16 2014 - 15:59:33)
> >>>   CPU0: APM ARM 64-bit Potenza Rev B0 2400MHz PCP 2400MHz
> >>>        32 KB ICACHE, 32 KB DCACHE
> >>>        SOC 2000MHz IOBAXI 400MHz AXI 250MHz AHB 200MHz GFC 125MHz
> >>>   ...
> >>>   Unhandled fault: synchronous external abort (0x96000010) at 0xffffff8000110034
> >>>   Internal error: : 96000010 [#1] SMP
> >>>   Modules linked in:
> >>>   CPU: 0 PID: 3723 Comm: ... 4.1.0-smp-DEV #3
> >>>   Hardware name: APM X-Gene Mustang board (DT)
> >>>   task: ffffffc7dc1a4140 ti: ffffffc7dc118000 task.ti: ffffffc7dc118000
> >>>   PC is at pci_generic_config_read32+0x4c/0xb8
> >>>   LR is at pci_generic_config_read32+0x40/0xb8
> >>>   pc : [<ffffffc00033b90c>] lr : [<ffffffc00033b900>] pstate: 600001c5
> >>>   ...
> >>>   Call trace:
> >>>   [<ffffffc00033b90c>] pci_generic_config_read32+0x4c/0xb8
> >>>   [<ffffffc00033bf58>] pci_user_read_config_byte+0x60/0xc4
> >>>   [<ffffffc0003496a8>] pci_read_config+0x15c/0x238
> >>>   [<ffffffc0002393b4>] sysfs_kf_bin_read+0x68/0xa0
> >>>   [<ffffffc00023896c>] kernfs_fop_read+0x9c/0x1ac
> >>>   [<ffffffc0001c361c>] __vfs_read+0x44/0x128
> >>>   [<ffffffc0001c3e28>] vfs_read+0x84/0x144
> >>>   [<ffffffc0001c4764>] SyS_read+0x50/0xb0
> >>
> >> The log shows kernel gets an exception when trying to access Mellanox
> >> card configuration space. This is usually due to suboptimal PCIe
> >> SerDes parameters are using in your board, which will cause bad link
> >> quality.
> >> The PCIe SerDes programming is done in U-Boot, so I suggest you do a
> >> U-Boot upgrade to our latest X-Gene U-Boot release.
> >
> > I installed U-Boot 1.15.12, which I thought was the latest.  I'm still
> > seeing this issue regularly, approx once/hour.
> 
> Our latest U-Boot is 1.15.15, but U-Boot 1.15.12 is already a good
> version to use. Are you running any PCIe traffic test when the error
> happens? 

Nope, the machine was either idle or running a reboot test; no PCIe stress
test or anything.

> And it will be useful if you can share your "lspci -vvv" output when
> the board is running, we can check to see if there is any error status
> reported.

Here's some lspci output and info about the firmware I'm running.
Obviously this lspci output was collected before a crash.  I have also
seen lspci output where "CESta: RxErr+" was set on the 00:00.0 Root Port.

U-Boot 2013.04-mustang_sw_1.15.12 (May 20 2015 - 10:03:33)

CPU0: APM ARM 64-bit Potenza Rev B0 2400MHz PCP 2400MHz
     32 KB ICACHE, 32 KB DCACHE
     SOC 2000MHz IOBAXI 400MHz AXI 250MHz AHB 200MHz GFC 125MHz
Boot from SPI-NOR
Slimpro FW:
        Ver: 2.4 (build 01.15.12.00 2015/05/20)
        PMD: 970 mV
        SOC: 950 mV
Board: Mustang - AppliedMicro APM883208-xNA24SPT Reference Board
I2C:   ready
DRAM:  ECC 32 GiB @ 1600MHz
SF: Detected N25Q256 with page size 256 Bytes, total 32 MiB
MMC:   X-Gene SD/SDIO/eMMC: 0
PCIE0: (RC) X8 GEN-3 link up
  00:00.0     - 10e8:e004 - Bridge device
   01:00.0    - 15b3:1007 - Network controller

# lspci -vvv
00:00.0 PCI bridge: Applied Micro Circuits Corp. Device e004 (rev 04) (prog-if 00 [Normal decode])
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
	I/O behind bridge: 0000f000-00000fff
	Memory behind bridge: 80000000-82ffffff
	Prefetchable memory behind bridge: 0000000083000000-00000000830fffff
	Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
	BridgeCtl: Parity- SERR- NoISA- VGA- MAbort- >Reset- FastB2B-
		PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
	Capabilities: [40] Express (v2) Root Port (Slot+), MSI 00
		DevCap:	MaxPayload 512 bytes, PhantFunc 0, Latency L0s <1us, L1 unlimited
			ExtTag- RBE+ FLReset-
		DevCtl:	Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported-
			RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 256 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr+ UncorrErr- FatalErr+ UnsuppReq- AuxPwr- TransPend+
		LnkCap:	Port #0, Speed unknown, Width x8, ASPM L0s L1, Latency L0 unlimited, L1 unlimited
			ClockPM- Surprise+ LLActRep+ BwNot+
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed unknown, Width x8, TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt-
		SltCap:	AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- Surprise-
			Slot #1, PowerLimit 10.000W; Interlock- NoCompl-
		SltCtl:	Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq- LinkChg-
			Control: AttnInd Off, PwrInd Off, Power- Interlock-
		SltSta:	Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet- Interlock-
			Changed: MRL- PresDet- LinkState+
		RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna- CRSVisible-
		RootCap: CRSVisible-
		RootSta: PME ReqID 0000, PMEStatus- PMEPending-
		DevCap2: Completion Timeout: Not Supported, TimeoutDis+ ARIFwd-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- ARIFwd-
		LnkCtl2: Target Link Speed: Unknown, EnterCompliance- SpeedDis-, Selectable De-emphasis: -3.5dB
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -6dB
	Capabilities: [80] Power Management version 3
		Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [100 v1] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr+ BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
		CEMsk:	RxErr+ BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		AERCap:	First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
	Capabilities: [180 v1] #19
	Capabilities: [150 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
	Kernel driver in use: pcieport

01:00.0 Ethernet controller: Mellanox Technologies MT27520 Family
	Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Interrupt: pin A routed to IRQ 226
	Region 0: [virtual] Memory at e182000000 (32-bit, non-prefetchable) [size=1M]
	Region 2: [virtual] Memory at e180000000 (32-bit, non-prefetchable) [size=32M]
	[virtual] Expansion ROM at e183000000 [disabled] [size=1M]
	Capabilities: [40] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [9c] MSI-X: Enable- Count=64 Masked-
		Vector table: BAR=0 offset=0007c000
		PBA: BAR=0 offset=0007d000
	Capabilities: [60] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 unlimited
			ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
		DevCtl:	Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
			RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
			MaxPayload 128 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr+ UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
		LnkCap:	Port #8, Speed unknown, Width x8, ASPM L0s, Latency L0 unlimited, L1 unlimited
			ClockPM- Surprise- LLActRep- BwNot-
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk-
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed unknown, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range ABCD, TimeoutDis+
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
		LnkCtl2: Target Link Speed: Unknown, EnterCompliance- SpeedDis-, Selectable De-emphasis: -6dB
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -6dB
	Capabilities: [100 v1] Alternative Routing-ID Interpretation (ARI)
		ARICap:	MFVC- ACS-, Next Function: 0
		ARICtl:	MFVC- ACS-, Function Group: 0
	Capabilities: [148 v1] Device Serial Number xx-xx-xx-xx-xx-xx-xx-xx
	Capabilities: [154 v2] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		AERCap:	First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
	Capabilities: [18c v1] #19
	Kernel modules: mlx4_core

^ permalink raw reply	[flat|nested] 49+ messages in thread

* X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32
@ 2015-07-28 21:29         ` Bjorn Helgaas
  0 siblings, 0 replies; 49+ messages in thread
From: Bjorn Helgaas @ 2015-07-28 21:29 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Jul 28, 2015 at 10:45:26AM -0700, Duc Dang wrote:
> On Tue, Jul 28, 2015 at 9:43 AM, Bjorn Helgaas <bhelgaas@google.com> wrote:
> > On Fri, Jul 24, 2015 at 7:05 PM, Duc Dang <dhdang@apm.com> wrote:
> >> Hi Bjorn,
> >>
> >> On Fri, Jul 24, 2015 at 3:42 PM, Bjorn Helgaas <bhelgaas@google.com> wrote:
> >>>
> >>> I regularly see faults like this on an APM X-Gene:
> >>>
> >>>   U-Boot 2013.04-mustang_sw_1.14.14 (Dec 16 2014 - 15:59:33)
> >>>   CPU0: APM ARM 64-bit Potenza Rev B0 2400MHz PCP 2400MHz
> >>>        32 KB ICACHE, 32 KB DCACHE
> >>>        SOC 2000MHz IOBAXI 400MHz AXI 250MHz AHB 200MHz GFC 125MHz
> >>>   ...
> >>>   Unhandled fault: synchronous external abort (0x96000010) at 0xffffff8000110034
> >>>   Internal error: : 96000010 [#1] SMP
> >>>   Modules linked in:
> >>>   CPU: 0 PID: 3723 Comm: ... 4.1.0-smp-DEV #3
> >>>   Hardware name: APM X-Gene Mustang board (DT)
> >>>   task: ffffffc7dc1a4140 ti: ffffffc7dc118000 task.ti: ffffffc7dc118000
> >>>   PC is at pci_generic_config_read32+0x4c/0xb8
> >>>   LR is at pci_generic_config_read32+0x40/0xb8
> >>>   pc : [<ffffffc00033b90c>] lr : [<ffffffc00033b900>] pstate: 600001c5
> >>>   ...
> >>>   Call trace:
> >>>   [<ffffffc00033b90c>] pci_generic_config_read32+0x4c/0xb8
> >>>   [<ffffffc00033bf58>] pci_user_read_config_byte+0x60/0xc4
> >>>   [<ffffffc0003496a8>] pci_read_config+0x15c/0x238
> >>>   [<ffffffc0002393b4>] sysfs_kf_bin_read+0x68/0xa0
> >>>   [<ffffffc00023896c>] kernfs_fop_read+0x9c/0x1ac
> >>>   [<ffffffc0001c361c>] __vfs_read+0x44/0x128
> >>>   [<ffffffc0001c3e28>] vfs_read+0x84/0x144
> >>>   [<ffffffc0001c4764>] SyS_read+0x50/0xb0
> >>
> >> The log shows kernel gets an exception when trying to access Mellanox
> >> card configuration space. This is usually due to suboptimal PCIe
> >> SerDes parameters are using in your board, which will cause bad link
> >> quality.
> >> The PCIe SerDes programming is done in U-Boot, so I suggest you do a
> >> U-Boot upgrade to our latest X-Gene U-Boot release.
> >
> > I installed U-Boot 1.15.12, which I thought was the latest.  I'm still
> > seeing this issue regularly, approx once/hour.
> 
> Our latest U-Boot is 1.15.15, but U-Boot 1.15.12 is already a good
> version to use. Are you running any PCIe traffic test when the error
> happens? 

Nope, the machine was either idle or running a reboot test; no PCIe stress
test or anything.

> And it will be useful if you can share your "lspci -vvv" output when
> the board is running, we can check to see if there is any error status
> reported.

Here's some lspci output and info about the firmware I'm running.
Obviously this lspci output was collected before a crash.  I have also
seen lspci output where "CESta: RxErr+" was set on the 00:00.0 Root Port.

U-Boot 2013.04-mustang_sw_1.15.12 (May 20 2015 - 10:03:33)

CPU0: APM ARM 64-bit Potenza Rev B0 2400MHz PCP 2400MHz
     32 KB ICACHE, 32 KB DCACHE
     SOC 2000MHz IOBAXI 400MHz AXI 250MHz AHB 200MHz GFC 125MHz
Boot from SPI-NOR
Slimpro FW:
        Ver: 2.4 (build 01.15.12.00 2015/05/20)
        PMD: 970 mV
        SOC: 950 mV
Board: Mustang - AppliedMicro APM883208-xNA24SPT Reference Board
I2C:   ready
DRAM:  ECC 32 GiB @ 1600MHz
SF: Detected N25Q256 with page size 256 Bytes, total 32 MiB
MMC:   X-Gene SD/SDIO/eMMC: 0
PCIE0: (RC) X8 GEN-3 link up
  00:00.0     - 10e8:e004 - Bridge device
   01:00.0    - 15b3:1007 - Network controller

# lspci -vvv
00:00.0 PCI bridge: Applied Micro Circuits Corp. Device e004 (rev 04) (prog-if 00 [Normal decode])
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
	I/O behind bridge: 0000f000-00000fff
	Memory behind bridge: 80000000-82ffffff
	Prefetchable memory behind bridge: 0000000083000000-00000000830fffff
	Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
	BridgeCtl: Parity- SERR- NoISA- VGA- MAbort- >Reset- FastB2B-
		PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
	Capabilities: [40] Express (v2) Root Port (Slot+), MSI 00
		DevCap:	MaxPayload 512 bytes, PhantFunc 0, Latency L0s <1us, L1 unlimited
			ExtTag- RBE+ FLReset-
		DevCtl:	Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported-
			RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 256 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr+ UncorrErr- FatalErr+ UnsuppReq- AuxPwr- TransPend+
		LnkCap:	Port #0, Speed unknown, Width x8, ASPM L0s L1, Latency L0 unlimited, L1 unlimited
			ClockPM- Surprise+ LLActRep+ BwNot+
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed unknown, Width x8, TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt-
		SltCap:	AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- Surprise-
			Slot #1, PowerLimit 10.000W; Interlock- NoCompl-
		SltCtl:	Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq- LinkChg-
			Control: AttnInd Off, PwrInd Off, Power- Interlock-
		SltSta:	Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet- Interlock-
			Changed: MRL- PresDet- LinkState+
		RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna- CRSVisible-
		RootCap: CRSVisible-
		RootSta: PME ReqID 0000, PMEStatus- PMEPending-
		DevCap2: Completion Timeout: Not Supported, TimeoutDis+ ARIFwd-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- ARIFwd-
		LnkCtl2: Target Link Speed: Unknown, EnterCompliance- SpeedDis-, Selectable De-emphasis: -3.5dB
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -6dB
	Capabilities: [80] Power Management version 3
		Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [100 v1] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr+ BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
		CEMsk:	RxErr+ BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		AERCap:	First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
	Capabilities: [180 v1] #19
	Capabilities: [150 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
	Kernel driver in use: pcieport

01:00.0 Ethernet controller: Mellanox Technologies MT27520 Family
	Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Interrupt: pin A routed to IRQ 226
	Region 0: [virtual] Memory at e182000000 (32-bit, non-prefetchable) [size=1M]
	Region 2: [virtual] Memory at e180000000 (32-bit, non-prefetchable) [size=32M]
	[virtual] Expansion ROM@e183000000 [disabled] [size=1M]
	Capabilities: [40] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [9c] MSI-X: Enable- Count=64 Masked-
		Vector table: BAR=0 offset=0007c000
		PBA: BAR=0 offset=0007d000
	Capabilities: [60] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 unlimited
			ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
		DevCtl:	Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
			RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
			MaxPayload 128 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr+ UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
		LnkCap:	Port #8, Speed unknown, Width x8, ASPM L0s, Latency L0 unlimited, L1 unlimited
			ClockPM- Surprise- LLActRep- BwNot-
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk-
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed unknown, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range ABCD, TimeoutDis+
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
		LnkCtl2: Target Link Speed: Unknown, EnterCompliance- SpeedDis-, Selectable De-emphasis: -6dB
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -6dB
	Capabilities: [100 v1] Alternative Routing-ID Interpretation (ARI)
		ARICap:	MFVC- ACS-, Next Function: 0
		ARICtl:	MFVC- ACS-, Function Group: 0
	Capabilities: [148 v1] Device Serial Number xx-xx-xx-xx-xx-xx-xx-xx
	Capabilities: [154 v2] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		AERCap:	First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
	Capabilities: [18c v1] #19
	Kernel modules: mlx4_core

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32
  2015-07-28 21:29         ` Bjorn Helgaas
@ 2015-07-28 21:50           ` Duc Dang
  -1 siblings, 0 replies; 49+ messages in thread
From: Duc Dang @ 2015-07-28 21:50 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: Tanmay Inamdar, linux-pci, linux-arm, linux-kernel

On Tue, Jul 28, 2015 at 2:29 PM, Bjorn Helgaas <bhelgaas@google.com> wrote:
> On Tue, Jul 28, 2015 at 10:45:26AM -0700, Duc Dang wrote:
>> On Tue, Jul 28, 2015 at 9:43 AM, Bjorn Helgaas <bhelgaas@google.com> wrote:
>> > On Fri, Jul 24, 2015 at 7:05 PM, Duc Dang <dhdang@apm.com> wrote:
>> >> Hi Bjorn,
>> >>
>> >> On Fri, Jul 24, 2015 at 3:42 PM, Bjorn Helgaas <bhelgaas@google.com> wrote:
>> >>>
>> >>> I regularly see faults like this on an APM X-Gene:
>> >>>
>> >>>   U-Boot 2013.04-mustang_sw_1.14.14 (Dec 16 2014 - 15:59:33)
>> >>>   CPU0: APM ARM 64-bit Potenza Rev B0 2400MHz PCP 2400MHz
>> >>>        32 KB ICACHE, 32 KB DCACHE
>> >>>        SOC 2000MHz IOBAXI 400MHz AXI 250MHz AHB 200MHz GFC 125MHz
>> >>>   ...
>> >>>   Unhandled fault: synchronous external abort (0x96000010) at 0xffffff8000110034
>> >>>   Internal error: : 96000010 [#1] SMP
>> >>>   Modules linked in:
>> >>>   CPU: 0 PID: 3723 Comm: ... 4.1.0-smp-DEV #3
>> >>>   Hardware name: APM X-Gene Mustang board (DT)
>> >>>   task: ffffffc7dc1a4140 ti: ffffffc7dc118000 task.ti: ffffffc7dc118000
>> >>>   PC is at pci_generic_config_read32+0x4c/0xb8
>> >>>   LR is at pci_generic_config_read32+0x40/0xb8
>> >>>   pc : [<ffffffc00033b90c>] lr : [<ffffffc00033b900>] pstate: 600001c5
>> >>>   ...
>> >>>   Call trace:
>> >>>   [<ffffffc00033b90c>] pci_generic_config_read32+0x4c/0xb8
>> >>>   [<ffffffc00033bf58>] pci_user_read_config_byte+0x60/0xc4
>> >>>   [<ffffffc0003496a8>] pci_read_config+0x15c/0x238
>> >>>   [<ffffffc0002393b4>] sysfs_kf_bin_read+0x68/0xa0
>> >>>   [<ffffffc00023896c>] kernfs_fop_read+0x9c/0x1ac
>> >>>   [<ffffffc0001c361c>] __vfs_read+0x44/0x128
>> >>>   [<ffffffc0001c3e28>] vfs_read+0x84/0x144
>> >>>   [<ffffffc0001c4764>] SyS_read+0x50/0xb0
>> >>
>> >> The log shows kernel gets an exception when trying to access Mellanox
>> >> card configuration space. This is usually due to suboptimal PCIe
>> >> SerDes parameters are using in your board, which will cause bad link
>> >> quality.
>> >> The PCIe SerDes programming is done in U-Boot, so I suggest you do a
>> >> U-Boot upgrade to our latest X-Gene U-Boot release.
>> >
>> > I installed U-Boot 1.15.12, which I thought was the latest.  I'm still
>> > seeing this issue regularly, approx once/hour.
>>
>> Our latest U-Boot is 1.15.15, but U-Boot 1.15.12 is already a good
>> version to use. Are you running any PCIe traffic test when the error
>> happens?
>
> Nope, the machine was either idle or running a reboot test; no PCIe stress
> test or anything.
>
>> And it will be useful if you can share your "lspci -vvv" output when
>> the board is running, we can check to see if there is any error status
>> reported.
>
> Here's some lspci output and info about the firmware I'm running.
> Obviously this lspci output was collected before a crash.  I have also
> seen lspci output where "CESta: RxErr+" was set on the 00:00.0 Root Port.
>
> U-Boot 2013.04-mustang_sw_1.15.12 (May 20 2015 - 10:03:33)
>
> CPU0: APM ARM 64-bit Potenza Rev B0 2400MHz PCP 2400MHz
>      32 KB ICACHE, 32 KB DCACHE
>      SOC 2000MHz IOBAXI 400MHz AXI 250MHz AHB 200MHz GFC 125MHz
> Boot from SPI-NOR
> Slimpro FW:
>         Ver: 2.4 (build 01.15.12.00 2015/05/20)
>         PMD: 970 mV
>         SOC: 950 mV
> Board: Mustang - AppliedMicro APM883208-xNA24SPT Reference Board
> I2C:   ready
> DRAM:  ECC 32 GiB @ 1600MHz
> SF: Detected N25Q256 with page size 256 Bytes, total 32 MiB
> MMC:   X-Gene SD/SDIO/eMMC: 0
> PCIE0: (RC) X8 GEN-3 link up
>   00:00.0     - 10e8:e004 - Bridge device
>    01:00.0    - 15b3:1007 - Network controller
>
> # lspci -vvv
> 00:00.0 PCI bridge: Applied Micro Circuits Corp. Device e004 (rev 04) (prog-if 00 [Normal decode])
>         Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
>         Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
>         Latency: 0
>         Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
>         I/O behind bridge: 0000f000-00000fff
>         Memory behind bridge: 80000000-82ffffff
>         Prefetchable memory behind bridge: 0000000083000000-00000000830fffff
>         Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
>         BridgeCtl: Parity- SERR- NoISA- VGA- MAbort- >Reset- FastB2B-
>                 PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
>         Capabilities: [40] Express (v2) Root Port (Slot+), MSI 00
>                 DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <1us, L1 unlimited
>                         ExtTag- RBE+ FLReset-
>                 DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported-
>                         RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
>                         MaxPayload 256 bytes, MaxReadReq 512 bytes
>                 DevSta: CorrErr+ UncorrErr- FatalErr+ UnsuppReq- AuxPwr- TransPend+
>                 LnkCap: Port #0, Speed unknown, Width x8, ASPM L0s L1, Latency L0 unlimited, L1 unlimited
>                         ClockPM- Surprise+ LLActRep+ BwNot+
>                 LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
>                         ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
>                 LnkSta: Speed unknown, Width x8, TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt-
>                 SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- Surprise-
>                         Slot #1, PowerLimit 10.000W; Interlock- NoCompl-
>                 SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq- LinkChg-
>                         Control: AttnInd Off, PwrInd Off, Power- Interlock-
>                 SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet- Interlock-
>                         Changed: MRL- PresDet- LinkState+
>                 RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna- CRSVisible-
>                 RootCap: CRSVisible-
>                 RootSta: PME ReqID 0000, PMEStatus- PMEPending-
>                 DevCap2: Completion Timeout: Not Supported, TimeoutDis+ ARIFwd-
>                 DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- ARIFwd-
>                 LnkCtl2: Target Link Speed: Unknown, EnterCompliance- SpeedDis-, Selectable De-emphasis: -3.5dB

Target Link Speed unknown is really strange. I also saw the same "Link
speed unknown" for Mellanox card below.

>                          Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
>                          Compliance De-emphasis: -6dB
>                 LnkSta2: Current De-emphasis Level: -6dB
>         Capabilities: [80] Power Management version 3
>                 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
>                 Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
>         Capabilities: [100 v1] Advanced Error Reporting
>                 UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>                 UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>                 UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
>                 CESta:  RxErr+ BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
>                 CEMsk:  RxErr+ BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
>                 AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
>         Capabilities: [180 v1] #19
>         Capabilities: [150 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
>         Kernel driver in use: pcieport
>
> 01:00.0 Ethernet controller: Mellanox Technologies MT27520 Family
>         Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-

Mem and BusMaster are disabled. So this card is not functional?

>         Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
>         Interrupt: pin A routed to IRQ 226
>         Region 0: [virtual] Memory at e182000000 (32-bit, non-prefetchable) [size=1M]
>         Region 2: [virtual] Memory at e180000000 (32-bit, non-prefetchable) [size=32M]
>         [virtual] Expansion ROM at e183000000 [disabled] [size=1M]
>         Capabilities: [40] Power Management version 3
>                 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
>                 Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
>         Capabilities: [9c] MSI-X: Enable- Count=64 Masked-

This may be unrelated, but MSI allocation fails for this card somehow.

>                 Vector table: BAR=0 offset=0007c000
>                 PBA: BAR=0 offset=0007d000
>         Capabilities: [60] Express (v2) Endpoint, MSI 00
>                 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 unlimited
>                         ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
>                 DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
>                         RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
>                         MaxPayload 128 bytes, MaxReadReq 512 bytes
>                 DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
>                 LnkCap: Port #8, Speed unknown, Width x8, ASPM L0s, Latency L0 unlimited, L1 unlimited
>                         ClockPM- Surprise- LLActRep- BwNot-
>                 LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk-
>                         ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
>                 LnkSta: Speed unknown, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
>                 DevCap2: Completion Timeout: Range ABCD, TimeoutDis+
>                 DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
>                 LnkCtl2: Target Link Speed: Unknown, EnterCompliance- SpeedDis-, Selectable De-emphasis: -6dB
>                          Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
>                          Compliance De-emphasis: -6dB
>                 LnkSta2: Current De-emphasis Level: -6dB
>         Capabilities: [100 v1] Alternative Routing-ID Interpretation (ARI)
>                 ARICap: MFVC- ACS-, Next Function: 0
>                 ARICtl: MFVC- ACS-, Function Group: 0
>         Capabilities: [148 v1] Device Serial Number xx-xx-xx-xx-xx-xx-xx-xx

The serial number here seems invalid. I have a Mellanox card but
different model (ConnectX-3 15b3:1003) that shows meaningful serial
number:
Capabilities: [148 v1] Device Serial Number f4-52-14-03-00-0b-c2-30.

Do you have another PCIe card to try on the same reboot test on this board?

>         Capabilities: [154 v2] Advanced Error Reporting
>                 UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>                 UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>                 UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
>                 CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
>                 CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
>                 AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
>         Capabilities: [18c v1] #19
>         Kernel modules: mlx4_core

-- 
Regards,
Duc Dang.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32
@ 2015-07-28 21:50           ` Duc Dang
  0 siblings, 0 replies; 49+ messages in thread
From: Duc Dang @ 2015-07-28 21:50 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Jul 28, 2015 at 2:29 PM, Bjorn Helgaas <bhelgaas@google.com> wrote:
> On Tue, Jul 28, 2015 at 10:45:26AM -0700, Duc Dang wrote:
>> On Tue, Jul 28, 2015 at 9:43 AM, Bjorn Helgaas <bhelgaas@google.com> wrote:
>> > On Fri, Jul 24, 2015 at 7:05 PM, Duc Dang <dhdang@apm.com> wrote:
>> >> Hi Bjorn,
>> >>
>> >> On Fri, Jul 24, 2015 at 3:42 PM, Bjorn Helgaas <bhelgaas@google.com> wrote:
>> >>>
>> >>> I regularly see faults like this on an APM X-Gene:
>> >>>
>> >>>   U-Boot 2013.04-mustang_sw_1.14.14 (Dec 16 2014 - 15:59:33)
>> >>>   CPU0: APM ARM 64-bit Potenza Rev B0 2400MHz PCP 2400MHz
>> >>>        32 KB ICACHE, 32 KB DCACHE
>> >>>        SOC 2000MHz IOBAXI 400MHz AXI 250MHz AHB 200MHz GFC 125MHz
>> >>>   ...
>> >>>   Unhandled fault: synchronous external abort (0x96000010) at 0xffffff8000110034
>> >>>   Internal error: : 96000010 [#1] SMP
>> >>>   Modules linked in:
>> >>>   CPU: 0 PID: 3723 Comm: ... 4.1.0-smp-DEV #3
>> >>>   Hardware name: APM X-Gene Mustang board (DT)
>> >>>   task: ffffffc7dc1a4140 ti: ffffffc7dc118000 task.ti: ffffffc7dc118000
>> >>>   PC is at pci_generic_config_read32+0x4c/0xb8
>> >>>   LR is at pci_generic_config_read32+0x40/0xb8
>> >>>   pc : [<ffffffc00033b90c>] lr : [<ffffffc00033b900>] pstate: 600001c5
>> >>>   ...
>> >>>   Call trace:
>> >>>   [<ffffffc00033b90c>] pci_generic_config_read32+0x4c/0xb8
>> >>>   [<ffffffc00033bf58>] pci_user_read_config_byte+0x60/0xc4
>> >>>   [<ffffffc0003496a8>] pci_read_config+0x15c/0x238
>> >>>   [<ffffffc0002393b4>] sysfs_kf_bin_read+0x68/0xa0
>> >>>   [<ffffffc00023896c>] kernfs_fop_read+0x9c/0x1ac
>> >>>   [<ffffffc0001c361c>] __vfs_read+0x44/0x128
>> >>>   [<ffffffc0001c3e28>] vfs_read+0x84/0x144
>> >>>   [<ffffffc0001c4764>] SyS_read+0x50/0xb0
>> >>
>> >> The log shows kernel gets an exception when trying to access Mellanox
>> >> card configuration space. This is usually due to suboptimal PCIe
>> >> SerDes parameters are using in your board, which will cause bad link
>> >> quality.
>> >> The PCIe SerDes programming is done in U-Boot, so I suggest you do a
>> >> U-Boot upgrade to our latest X-Gene U-Boot release.
>> >
>> > I installed U-Boot 1.15.12, which I thought was the latest.  I'm still
>> > seeing this issue regularly, approx once/hour.
>>
>> Our latest U-Boot is 1.15.15, but U-Boot 1.15.12 is already a good
>> version to use. Are you running any PCIe traffic test when the error
>> happens?
>
> Nope, the machine was either idle or running a reboot test; no PCIe stress
> test or anything.
>
>> And it will be useful if you can share your "lspci -vvv" output when
>> the board is running, we can check to see if there is any error status
>> reported.
>
> Here's some lspci output and info about the firmware I'm running.
> Obviously this lspci output was collected before a crash.  I have also
> seen lspci output where "CESta: RxErr+" was set on the 00:00.0 Root Port.
>
> U-Boot 2013.04-mustang_sw_1.15.12 (May 20 2015 - 10:03:33)
>
> CPU0: APM ARM 64-bit Potenza Rev B0 2400MHz PCP 2400MHz
>      32 KB ICACHE, 32 KB DCACHE
>      SOC 2000MHz IOBAXI 400MHz AXI 250MHz AHB 200MHz GFC 125MHz
> Boot from SPI-NOR
> Slimpro FW:
>         Ver: 2.4 (build 01.15.12.00 2015/05/20)
>         PMD: 970 mV
>         SOC: 950 mV
> Board: Mustang - AppliedMicro APM883208-xNA24SPT Reference Board
> I2C:   ready
> DRAM:  ECC 32 GiB @ 1600MHz
> SF: Detected N25Q256 with page size 256 Bytes, total 32 MiB
> MMC:   X-Gene SD/SDIO/eMMC: 0
> PCIE0: (RC) X8 GEN-3 link up
>   00:00.0     - 10e8:e004 - Bridge device
>    01:00.0    - 15b3:1007 - Network controller
>
> # lspci -vvv
> 00:00.0 PCI bridge: Applied Micro Circuits Corp. Device e004 (rev 04) (prog-if 00 [Normal decode])
>         Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
>         Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
>         Latency: 0
>         Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
>         I/O behind bridge: 0000f000-00000fff
>         Memory behind bridge: 80000000-82ffffff
>         Prefetchable memory behind bridge: 0000000083000000-00000000830fffff
>         Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
>         BridgeCtl: Parity- SERR- NoISA- VGA- MAbort- >Reset- FastB2B-
>                 PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
>         Capabilities: [40] Express (v2) Root Port (Slot+), MSI 00
>                 DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <1us, L1 unlimited
>                         ExtTag- RBE+ FLReset-
>                 DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported-
>                         RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
>                         MaxPayload 256 bytes, MaxReadReq 512 bytes
>                 DevSta: CorrErr+ UncorrErr- FatalErr+ UnsuppReq- AuxPwr- TransPend+
>                 LnkCap: Port #0, Speed unknown, Width x8, ASPM L0s L1, Latency L0 unlimited, L1 unlimited
>                         ClockPM- Surprise+ LLActRep+ BwNot+
>                 LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
>                         ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
>                 LnkSta: Speed unknown, Width x8, TrErr- Train- SlotClk+ DLActive+ BWMgmt+ ABWMgmt-
>                 SltCap: AttnBtn- PwrCtrl- MRL- AttnInd- PwrInd- HotPlug- Surprise-
>                         Slot #1, PowerLimit 10.000W; Interlock- NoCompl-
>                 SltCtl: Enable: AttnBtn- PwrFlt- MRL- PresDet- CmdCplt- HPIrq- LinkChg-
>                         Control: AttnInd Off, PwrInd Off, Power- Interlock-
>                 SltSta: Status: AttnBtn- PowerFlt- MRL- CmdCplt- PresDet- Interlock-
>                         Changed: MRL- PresDet- LinkState+
>                 RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna- CRSVisible-
>                 RootCap: CRSVisible-
>                 RootSta: PME ReqID 0000, PMEStatus- PMEPending-
>                 DevCap2: Completion Timeout: Not Supported, TimeoutDis+ ARIFwd-
>                 DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- ARIFwd-
>                 LnkCtl2: Target Link Speed: Unknown, EnterCompliance- SpeedDis-, Selectable De-emphasis: -3.5dB

Target Link Speed unknown is really strange. I also saw the same "Link
speed unknown" for Mellanox card below.

>                          Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
>                          Compliance De-emphasis: -6dB
>                 LnkSta2: Current De-emphasis Level: -6dB
>         Capabilities: [80] Power Management version 3
>                 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
>                 Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
>         Capabilities: [100 v1] Advanced Error Reporting
>                 UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>                 UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>                 UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
>                 CESta:  RxErr+ BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
>                 CEMsk:  RxErr+ BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
>                 AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
>         Capabilities: [180 v1] #19
>         Capabilities: [150 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
>         Kernel driver in use: pcieport
>
> 01:00.0 Ethernet controller: Mellanox Technologies MT27520 Family
>         Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-

Mem and BusMaster are disabled. So this card is not functional?

>         Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
>         Interrupt: pin A routed to IRQ 226
>         Region 0: [virtual] Memory at e182000000 (32-bit, non-prefetchable) [size=1M]
>         Region 2: [virtual] Memory at e180000000 (32-bit, non-prefetchable) [size=32M]
>         [virtual] Expansion ROM at e183000000 [disabled] [size=1M]
>         Capabilities: [40] Power Management version 3
>                 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
>                 Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
>         Capabilities: [9c] MSI-X: Enable- Count=64 Masked-

This may be unrelated, but MSI allocation fails for this card somehow.

>                 Vector table: BAR=0 offset=0007c000
>                 PBA: BAR=0 offset=0007d000
>         Capabilities: [60] Express (v2) Endpoint, MSI 00
>                 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 unlimited
>                         ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
>                 DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
>                         RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
>                         MaxPayload 128 bytes, MaxReadReq 512 bytes
>                 DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
>                 LnkCap: Port #8, Speed unknown, Width x8, ASPM L0s, Latency L0 unlimited, L1 unlimited
>                         ClockPM- Surprise- LLActRep- BwNot-
>                 LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk-
>                         ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
>                 LnkSta: Speed unknown, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
>                 DevCap2: Completion Timeout: Range ABCD, TimeoutDis+
>                 DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
>                 LnkCtl2: Target Link Speed: Unknown, EnterCompliance- SpeedDis-, Selectable De-emphasis: -6dB
>                          Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
>                          Compliance De-emphasis: -6dB
>                 LnkSta2: Current De-emphasis Level: -6dB
>         Capabilities: [100 v1] Alternative Routing-ID Interpretation (ARI)
>                 ARICap: MFVC- ACS-, Next Function: 0
>                 ARICtl: MFVC- ACS-, Function Group: 0
>         Capabilities: [148 v1] Device Serial Number xx-xx-xx-xx-xx-xx-xx-xx

The serial number here seems invalid. I have a Mellanox card but
different model (ConnectX-3 15b3:1003) that shows meaningful serial
number:
Capabilities: [148 v1] Device Serial Number f4-52-14-03-00-0b-c2-30.

Do you have another PCIe card to try on the same reboot test on this board?

>         Capabilities: [154 v2] Advanced Error Reporting
>                 UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>                 UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>                 UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
>                 CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
>                 CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
>                 AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
>         Capabilities: [18c v1] #19
>         Kernel modules: mlx4_core

-- 
Regards,
Duc Dang.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32
  2015-07-28 21:50           ` Duc Dang
@ 2015-07-29  1:22             ` Bjorn Helgaas
  -1 siblings, 0 replies; 49+ messages in thread
From: Bjorn Helgaas @ 2015-07-29  1:22 UTC (permalink / raw)
  To: Duc Dang; +Cc: Tanmay Inamdar, linux-pci, linux-arm, linux-kernel

On Tue, Jul 28, 2015 at 02:50:39PM -0700, Duc Dang wrote:
> On Tue, Jul 28, 2015 at 2:29 PM, Bjorn Helgaas <bhelgaas@google.com> wrote:
> > On Tue, Jul 28, 2015 at 10:45:26AM -0700, Duc Dang wrote:
> >> On Tue, Jul 28, 2015 at 9:43 AM, Bjorn Helgaas <bhelgaas@google.com> wrote:
> >> > On Fri, Jul 24, 2015 at 7:05 PM, Duc Dang <dhdang@apm.com> wrote:
> >> >> Hi Bjorn,
> >> >>
> >> >> On Fri, Jul 24, 2015 at 3:42 PM, Bjorn Helgaas <bhelgaas@google.com> wrote:
> >> >>>
> >> >>> I regularly see faults like this on an APM X-Gene:
> >> >>>
> >> >>>   U-Boot 2013.04-mustang_sw_1.14.14 (Dec 16 2014 - 15:59:33)
> >> >>>   CPU0: APM ARM 64-bit Potenza Rev B0 2400MHz PCP 2400MHz
> >> >>>        32 KB ICACHE, 32 KB DCACHE
> >> >>>        SOC 2000MHz IOBAXI 400MHz AXI 250MHz AHB 200MHz GFC 125MHz
> >> >>>   ...
> >> >>>   Unhandled fault: synchronous external abort (0x96000010) at 0xffffff8000110034
> >> >>>   Internal error: : 96000010 [#1] SMP
> >> >>>   Modules linked in:
> >> >>>   CPU: 0 PID: 3723 Comm: ... 4.1.0-smp-DEV #3
> >> >>>   Hardware name: APM X-Gene Mustang board (DT)
> >> >>>   task: ffffffc7dc1a4140 ti: ffffffc7dc118000 task.ti: ffffffc7dc118000
> >> >>>   PC is at pci_generic_config_read32+0x4c/0xb8
> >> >>>   LR is at pci_generic_config_read32+0x40/0xb8
> >> >>>   pc : [<ffffffc00033b90c>] lr : [<ffffffc00033b900>] pstate: 600001c5
> >> >>>   ...
> >> >>>   Call trace:
> >> >>>   [<ffffffc00033b90c>] pci_generic_config_read32+0x4c/0xb8
> >> >>>   [<ffffffc00033bf58>] pci_user_read_config_byte+0x60/0xc4
> >> >>>   [<ffffffc0003496a8>] pci_read_config+0x15c/0x238
> >> >>>   [<ffffffc0002393b4>] sysfs_kf_bin_read+0x68/0xa0
> >> >>>   [<ffffffc00023896c>] kernfs_fop_read+0x9c/0x1ac
> >> >>>   [<ffffffc0001c361c>] __vfs_read+0x44/0x128
> >> >>>   [<ffffffc0001c3e28>] vfs_read+0x84/0x144
> >> >>>   [<ffffffc0001c4764>] SyS_read+0x50/0xb0
> >> >>
> >> >> The log shows kernel gets an exception when trying to access Mellanox
> >> >> card configuration space. This is usually due to suboptimal PCIe
> >> >> SerDes parameters are using in your board, which will cause bad link
> >> >> quality.
> >> >> The PCIe SerDes programming is done in U-Boot, so I suggest you do a
> >> >> U-Boot upgrade to our latest X-Gene U-Boot release.
> >> >
> >> > I installed U-Boot 1.15.12, which I thought was the latest.  I'm still
> >> > seeing this issue regularly, approx once/hour.
> >>
> >> Our latest U-Boot is 1.15.15, but U-Boot 1.15.12 is already a good
> >> version to use. Are you running any PCIe traffic test when the error
> >> happens?
> >
> > Nope, the machine was either idle or running a reboot test; no PCIe stress
> > test or anything.
> >
> >> And it will be useful if you can share your "lspci -vvv" output when
> >> the board is running, we can check to see if there is any error status
> >> reported.
> >
> > Here's some lspci output and info about the firmware I'm running.
> > Obviously this lspci output was collected before a crash.  I have also
> > seen lspci output where "CESta: RxErr+" was set on the 00:00.0 Root Port.
> >
> > U-Boot 2013.04-mustang_sw_1.15.12 (May 20 2015 - 10:03:33)
> >
> > CPU0: APM ARM 64-bit Potenza Rev B0 2400MHz PCP 2400MHz
> >      32 KB ICACHE, 32 KB DCACHE
> >      SOC 2000MHz IOBAXI 400MHz AXI 250MHz AHB 200MHz GFC 125MHz
> > Boot from SPI-NOR
> > Slimpro FW:
> >         Ver: 2.4 (build 01.15.12.00 2015/05/20)
> >         PMD: 970 mV
> >         SOC: 950 mV
> > Board: Mustang - AppliedMicro APM883208-xNA24SPT Reference Board
> > I2C:   ready
> > DRAM:  ECC 32 GiB @ 1600MHz
> > SF: Detected N25Q256 with page size 256 Bytes, total 32 MiB
> > MMC:   X-Gene SD/SDIO/eMMC: 0
> > PCIE0: (RC) X8 GEN-3 link up
> >   00:00.0     - 10e8:e004 - Bridge device
> >    01:00.0    - 15b3:1007 - Network controller
> >
> > # lspci -vvv
> > 00:00.0 PCI bridge: Applied Micro Circuits Corp. Device e004 (rev 04) (prog-if 00 [Normal decode])

> >                 LnkCtl2: Target Link Speed: Unknown, EnterCompliance- SpeedDis-, Selectable De-emphasis: -3.5dB
> 
> Target Link Speed unknown is really strange. I also saw the same "Link
> speed unknown" for Mellanox card below.

I think this is because I have a really old lspci.  Here's the -xxx output:

    00: e8 10 04 e0 07 00 10 00 04 00 04 06 00 00 01 00
    10: 00 00 00 00 00 00 00 00 00 01 01 00 f1 01 00 00
    20: 00 80 f0 82 01 83 01 83 00 00 00 00 00 00 00 00
    30: 00 00 00 00 40 00 00 00 00 00 00 00 00 01 00 00
    40: 10 80 42 01 02 8f 00 00 36 28 21 00 83 fc 7b 00
    50: 40 00 83 70 00 05 08 00 c0 03 00 01 00 00 01 00
    60: 00 00 00 00 10 00 00 00 00 00 00 00 0e 01 00 00
    70: 43 00 1e 00 00 00 00 00 00 00 00 00 00 00 00 00
    80: 01 00 03 06 08 00 00 00 00 00 00 00 00 00 00 00
    90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

LnkCtl2 is at offset 0x30 in the PCIe capability, which starts at 0x40,
so LnkCtl2 = 0x0043.  I think that means Target Link Speed is 0x3, or
"Supported Link Speeds Vector field bit 2".  The Supported Link Speeds
Vector in LnkCap2 (which isn't decoded even by current upstream lspci)
is 0x7, so 2.5GT/s, 5.0GT/s, and 8.0GT/s are all supported, with bit 2
being 8.0GT/s.  So I think a modern lspci would show "8.0GT/s".

> > 01:00.0 Ethernet controller: Mellanox Technologies MT27520 Family
> >         Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
> 
> Mem and BusMaster are disabled. So this card is not functional?

I don't know whether it's functional; I haven't tried to use it yet.

I typically don't even load the mlx4 driver, so most of the failures I'm
seeing are when the driver isn't loaded.  User-space code is doing config
reads via /sys.

> >         Capabilities: [148 v1] Device Serial Number xx-xx-xx-xx-xx-xx-xx-xx
> 
> The serial number here seems invalid. I have a Mellanox card but
> different model (ConnectX-3 15b3:1003) that shows meaningful serial
> number:
> Capabilities: [148 v1] Device Serial Number f4-52-14-03-00-0b-c2-30.

My fault, lspci actually showed a meaningful serial number; I removed
it in a misguided attempt to avoid exposing anything proprietary.

> Do you have another PCIe card to try on the same reboot test on this board?

I've seen this on at least two Mellanox cards.  I'm running similar tests
on a different type of card now.

Bjorn

^ permalink raw reply	[flat|nested] 49+ messages in thread

* X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32
@ 2015-07-29  1:22             ` Bjorn Helgaas
  0 siblings, 0 replies; 49+ messages in thread
From: Bjorn Helgaas @ 2015-07-29  1:22 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Jul 28, 2015 at 02:50:39PM -0700, Duc Dang wrote:
> On Tue, Jul 28, 2015 at 2:29 PM, Bjorn Helgaas <bhelgaas@google.com> wrote:
> > On Tue, Jul 28, 2015 at 10:45:26AM -0700, Duc Dang wrote:
> >> On Tue, Jul 28, 2015 at 9:43 AM, Bjorn Helgaas <bhelgaas@google.com> wrote:
> >> > On Fri, Jul 24, 2015 at 7:05 PM, Duc Dang <dhdang@apm.com> wrote:
> >> >> Hi Bjorn,
> >> >>
> >> >> On Fri, Jul 24, 2015 at 3:42 PM, Bjorn Helgaas <bhelgaas@google.com> wrote:
> >> >>>
> >> >>> I regularly see faults like this on an APM X-Gene:
> >> >>>
> >> >>>   U-Boot 2013.04-mustang_sw_1.14.14 (Dec 16 2014 - 15:59:33)
> >> >>>   CPU0: APM ARM 64-bit Potenza Rev B0 2400MHz PCP 2400MHz
> >> >>>        32 KB ICACHE, 32 KB DCACHE
> >> >>>        SOC 2000MHz IOBAXI 400MHz AXI 250MHz AHB 200MHz GFC 125MHz
> >> >>>   ...
> >> >>>   Unhandled fault: synchronous external abort (0x96000010) at 0xffffff8000110034
> >> >>>   Internal error: : 96000010 [#1] SMP
> >> >>>   Modules linked in:
> >> >>>   CPU: 0 PID: 3723 Comm: ... 4.1.0-smp-DEV #3
> >> >>>   Hardware name: APM X-Gene Mustang board (DT)
> >> >>>   task: ffffffc7dc1a4140 ti: ffffffc7dc118000 task.ti: ffffffc7dc118000
> >> >>>   PC is at pci_generic_config_read32+0x4c/0xb8
> >> >>>   LR is at pci_generic_config_read32+0x40/0xb8
> >> >>>   pc : [<ffffffc00033b90c>] lr : [<ffffffc00033b900>] pstate: 600001c5
> >> >>>   ...
> >> >>>   Call trace:
> >> >>>   [<ffffffc00033b90c>] pci_generic_config_read32+0x4c/0xb8
> >> >>>   [<ffffffc00033bf58>] pci_user_read_config_byte+0x60/0xc4
> >> >>>   [<ffffffc0003496a8>] pci_read_config+0x15c/0x238
> >> >>>   [<ffffffc0002393b4>] sysfs_kf_bin_read+0x68/0xa0
> >> >>>   [<ffffffc00023896c>] kernfs_fop_read+0x9c/0x1ac
> >> >>>   [<ffffffc0001c361c>] __vfs_read+0x44/0x128
> >> >>>   [<ffffffc0001c3e28>] vfs_read+0x84/0x144
> >> >>>   [<ffffffc0001c4764>] SyS_read+0x50/0xb0
> >> >>
> >> >> The log shows kernel gets an exception when trying to access Mellanox
> >> >> card configuration space. This is usually due to suboptimal PCIe
> >> >> SerDes parameters are using in your board, which will cause bad link
> >> >> quality.
> >> >> The PCIe SerDes programming is done in U-Boot, so I suggest you do a
> >> >> U-Boot upgrade to our latest X-Gene U-Boot release.
> >> >
> >> > I installed U-Boot 1.15.12, which I thought was the latest.  I'm still
> >> > seeing this issue regularly, approx once/hour.
> >>
> >> Our latest U-Boot is 1.15.15, but U-Boot 1.15.12 is already a good
> >> version to use. Are you running any PCIe traffic test when the error
> >> happens?
> >
> > Nope, the machine was either idle or running a reboot test; no PCIe stress
> > test or anything.
> >
> >> And it will be useful if you can share your "lspci -vvv" output when
> >> the board is running, we can check to see if there is any error status
> >> reported.
> >
> > Here's some lspci output and info about the firmware I'm running.
> > Obviously this lspci output was collected before a crash.  I have also
> > seen lspci output where "CESta: RxErr+" was set on the 00:00.0 Root Port.
> >
> > U-Boot 2013.04-mustang_sw_1.15.12 (May 20 2015 - 10:03:33)
> >
> > CPU0: APM ARM 64-bit Potenza Rev B0 2400MHz PCP 2400MHz
> >      32 KB ICACHE, 32 KB DCACHE
> >      SOC 2000MHz IOBAXI 400MHz AXI 250MHz AHB 200MHz GFC 125MHz
> > Boot from SPI-NOR
> > Slimpro FW:
> >         Ver: 2.4 (build 01.15.12.00 2015/05/20)
> >         PMD: 970 mV
> >         SOC: 950 mV
> > Board: Mustang - AppliedMicro APM883208-xNA24SPT Reference Board
> > I2C:   ready
> > DRAM:  ECC 32 GiB @ 1600MHz
> > SF: Detected N25Q256 with page size 256 Bytes, total 32 MiB
> > MMC:   X-Gene SD/SDIO/eMMC: 0
> > PCIE0: (RC) X8 GEN-3 link up
> >   00:00.0     - 10e8:e004 - Bridge device
> >    01:00.0    - 15b3:1007 - Network controller
> >
> > # lspci -vvv
> > 00:00.0 PCI bridge: Applied Micro Circuits Corp. Device e004 (rev 04) (prog-if 00 [Normal decode])

> >                 LnkCtl2: Target Link Speed: Unknown, EnterCompliance- SpeedDis-, Selectable De-emphasis: -3.5dB
> 
> Target Link Speed unknown is really strange. I also saw the same "Link
> speed unknown" for Mellanox card below.

I think this is because I have a really old lspci.  Here's the -xxx output:

    00: e8 10 04 e0 07 00 10 00 04 00 04 06 00 00 01 00
    10: 00 00 00 00 00 00 00 00 00 01 01 00 f1 01 00 00
    20: 00 80 f0 82 01 83 01 83 00 00 00 00 00 00 00 00
    30: 00 00 00 00 40 00 00 00 00 00 00 00 00 01 00 00
    40: 10 80 42 01 02 8f 00 00 36 28 21 00 83 fc 7b 00
    50: 40 00 83 70 00 05 08 00 c0 03 00 01 00 00 01 00
    60: 00 00 00 00 10 00 00 00 00 00 00 00 0e 01 00 00
    70: 43 00 1e 00 00 00 00 00 00 00 00 00 00 00 00 00
    80: 01 00 03 06 08 00 00 00 00 00 00 00 00 00 00 00
    90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

LnkCtl2 is at offset 0x30 in the PCIe capability, which starts at 0x40,
so LnkCtl2 = 0x0043.  I think that means Target Link Speed is 0x3, or
"Supported Link Speeds Vector field bit 2".  The Supported Link Speeds
Vector in LnkCap2 (which isn't decoded even by current upstream lspci)
is 0x7, so 2.5GT/s, 5.0GT/s, and 8.0GT/s are all supported, with bit 2
being 8.0GT/s.  So I think a modern lspci would show "8.0GT/s".

> > 01:00.0 Ethernet controller: Mellanox Technologies MT27520 Family
> >         Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
> 
> Mem and BusMaster are disabled. So this card is not functional?

I don't know whether it's functional; I haven't tried to use it yet.

I typically don't even load the mlx4 driver, so most of the failures I'm
seeing are when the driver isn't loaded.  User-space code is doing config
reads via /sys.

> >         Capabilities: [148 v1] Device Serial Number xx-xx-xx-xx-xx-xx-xx-xx
> 
> The serial number here seems invalid. I have a Mellanox card but
> different model (ConnectX-3 15b3:1003) that shows meaningful serial
> number:
> Capabilities: [148 v1] Device Serial Number f4-52-14-03-00-0b-c2-30.

My fault, lspci actually showed a meaningful serial number; I removed
it in a misguided attempt to avoid exposing anything proprietary.

> Do you have another PCIe card to try on the same reboot test on this board?

I've seen this on at least two Mellanox cards.  I'm running similar tests
on a different type of card now.

Bjorn

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32
  2015-07-29  1:22             ` Bjorn Helgaas
@ 2015-07-29 15:55               ` Bjorn Helgaas
  -1 siblings, 0 replies; 49+ messages in thread
From: Bjorn Helgaas @ 2015-07-29 15:55 UTC (permalink / raw)
  To: Duc Dang; +Cc: Tanmay Inamdar, linux-pci, linux-arm, linux-kernel

On Tue, Jul 28, 2015 at 08:22:55PM -0500, Bjorn Helgaas wrote:
> On Tue, Jul 28, 2015 at 02:50:39PM -0700, Duc Dang wrote:

> > Do you have another PCIe card to try on the same reboot test on this board?
> 
> I've seen this on at least two Mellanox cards.  I'm running similar tests
> on a different type of card now.

FWIW, reboot tests on two machines with Mellanox cards failed, while the
same test on a machine with a different proprietary card succeeded.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32
@ 2015-07-29 15:55               ` Bjorn Helgaas
  0 siblings, 0 replies; 49+ messages in thread
From: Bjorn Helgaas @ 2015-07-29 15:55 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Jul 28, 2015 at 08:22:55PM -0500, Bjorn Helgaas wrote:
> On Tue, Jul 28, 2015 at 02:50:39PM -0700, Duc Dang wrote:

> > Do you have another PCIe card to try on the same reboot test on this board?
> 
> I've seen this on at least two Mellanox cards.  I'm running similar tests
> on a different type of card now.

FWIW, reboot tests on two machines with Mellanox cards failed, while the
same test on a machine with a different proprietary card succeeded.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32
  2015-07-29 15:55               ` Bjorn Helgaas
@ 2015-07-31 17:00                 ` Duc Dang
  -1 siblings, 0 replies; 49+ messages in thread
From: Duc Dang @ 2015-07-31 17:00 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: Tanmay Inamdar, linux-pci, linux-arm, linux-kernel

On Wed, Jul 29, 2015 at 8:55 AM, Bjorn Helgaas <bhelgaas@google.com> wrote:
> On Tue, Jul 28, 2015 at 08:22:55PM -0500, Bjorn Helgaas wrote:
>> On Tue, Jul 28, 2015 at 02:50:39PM -0700, Duc Dang wrote:
>
>> > Do you have another PCIe card to try on the same reboot test on this board?
>>
>> I've seen this on at least two Mellanox cards.  I'm running similar tests
>> on a different type of card now.
>
> FWIW, reboot tests on two machines with Mellanox cards failed, while the
> same test on a machine with a different proprietary card succeeded.

Thanks, Bjorn.

I don't have the same Mellanox card as yours, but I will also run
similar reboot test to see if I hit the same issue with my card.

-- 
Regards,
Duc Dang.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32
@ 2015-07-31 17:00                 ` Duc Dang
  0 siblings, 0 replies; 49+ messages in thread
From: Duc Dang @ 2015-07-31 17:00 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Jul 29, 2015 at 8:55 AM, Bjorn Helgaas <bhelgaas@google.com> wrote:
> On Tue, Jul 28, 2015 at 08:22:55PM -0500, Bjorn Helgaas wrote:
>> On Tue, Jul 28, 2015 at 02:50:39PM -0700, Duc Dang wrote:
>
>> > Do you have another PCIe card to try on the same reboot test on this board?
>>
>> I've seen this on at least two Mellanox cards.  I'm running similar tests
>> on a different type of card now.
>
> FWIW, reboot tests on two machines with Mellanox cards failed, while the
> same test on a machine with a different proprietary card succeeded.

Thanks, Bjorn.

I don't have the same Mellanox card as yours, but I will also run
similar reboot test to see if I hit the same issue with my card.

-- 
Regards,
Duc Dang.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32
  2015-07-31 17:00                 ` Duc Dang
@ 2015-08-10 16:18                   ` Bjorn Helgaas
  -1 siblings, 0 replies; 49+ messages in thread
From: Bjorn Helgaas @ 2015-08-10 16:18 UTC (permalink / raw)
  To: Duc Dang; +Cc: Tanmay Inamdar, linux-pci, linux-arm, linux-kernel

On Fri, Jul 31, 2015 at 12:00 PM, Duc Dang <dhdang@apm.com> wrote:
> On Wed, Jul 29, 2015 at 8:55 AM, Bjorn Helgaas <bhelgaas@google.com> wrote:
>> On Tue, Jul 28, 2015 at 08:22:55PM -0500, Bjorn Helgaas wrote:
>>> On Tue, Jul 28, 2015 at 02:50:39PM -0700, Duc Dang wrote:
>>
>>> > Do you have another PCIe card to try on the same reboot test on this board?
>>>
>>> I've seen this on at least two Mellanox cards.  I'm running similar tests
>>> on a different type of card now.
>>
>> FWIW, reboot tests on two machines with Mellanox cards failed, while the
>> same test on a machine with a different proprietary card succeeded.
>
> Thanks, Bjorn.
>
> I don't have the same Mellanox card as yours, but I will also run
> similar reboot test to see if I hit the same issue with my card.

Any more hints on this?  Nothing has changed on my end, so of course
I'm still seeing this, always on machines with Mellanox, and never on
other machines.  Could this be a hardware issue like a signal
integrity or margin issue?  I don't know where to go from here because
I'm not a hardware person, and I don't know anything to do in
software.

Bjorn

^ permalink raw reply	[flat|nested] 49+ messages in thread

* X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32
@ 2015-08-10 16:18                   ` Bjorn Helgaas
  0 siblings, 0 replies; 49+ messages in thread
From: Bjorn Helgaas @ 2015-08-10 16:18 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Jul 31, 2015 at 12:00 PM, Duc Dang <dhdang@apm.com> wrote:
> On Wed, Jul 29, 2015 at 8:55 AM, Bjorn Helgaas <bhelgaas@google.com> wrote:
>> On Tue, Jul 28, 2015 at 08:22:55PM -0500, Bjorn Helgaas wrote:
>>> On Tue, Jul 28, 2015 at 02:50:39PM -0700, Duc Dang wrote:
>>
>>> > Do you have another PCIe card to try on the same reboot test on this board?
>>>
>>> I've seen this on at least two Mellanox cards.  I'm running similar tests
>>> on a different type of card now.
>>
>> FWIW, reboot tests on two machines with Mellanox cards failed, while the
>> same test on a machine with a different proprietary card succeeded.
>
> Thanks, Bjorn.
>
> I don't have the same Mellanox card as yours, but I will also run
> similar reboot test to see if I hit the same issue with my card.

Any more hints on this?  Nothing has changed on my end, so of course
I'm still seeing this, always on machines with Mellanox, and never on
other machines.  Could this be a hardware issue like a signal
integrity or margin issue?  I don't know where to go from here because
I'm not a hardware person, and I don't know anything to do in
software.

Bjorn

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32
  2015-08-10 16:18                   ` Bjorn Helgaas
@ 2015-08-10 17:38                     ` Catalin Marinas
  -1 siblings, 0 replies; 49+ messages in thread
From: Catalin Marinas @ 2015-08-10 17:38 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Duc Dang, linux-pci, Tanmay Inamdar, linux-arm, linux-kernel

On Mon, Aug 10, 2015 at 11:18:23AM -0500, Bjorn Helgaas wrote:
> On Fri, Jul 31, 2015 at 12:00 PM, Duc Dang <dhdang@apm.com> wrote:
> > On Wed, Jul 29, 2015 at 8:55 AM, Bjorn Helgaas <bhelgaas@google.com> wrote:
> >> On Tue, Jul 28, 2015 at 08:22:55PM -0500, Bjorn Helgaas wrote:
> >>> On Tue, Jul 28, 2015 at 02:50:39PM -0700, Duc Dang wrote:
> >>
> >>> > Do you have another PCIe card to try on the same reboot test on this board?
> >>>
> >>> I've seen this on at least two Mellanox cards.  I'm running similar tests
> >>> on a different type of card now.
> >>
> >> FWIW, reboot tests on two machines with Mellanox cards failed, while the
> >> same test on a machine with a different proprietary card succeeded.
> >
> > Thanks, Bjorn.
> >
> > I don't have the same Mellanox card as yours, but I will also run
> > similar reboot test to see if I hit the same issue with my card.
> 
> Any more hints on this?  Nothing has changed on my end, so of course
> I'm still seeing this, always on machines with Mellanox, and never on
> other machines.  Could this be a hardware issue like a signal
> integrity or margin issue?  I don't know where to go from here because
> I'm not a hardware person, and I don't know anything to do in
> software.

Silly hack below, not actually a solution (and it may not even work):

diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 94d98cd1aad8..e895e96b3d13 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -369,6 +369,14 @@ static int do_bad(unsigned long addr, unsigned int esr, struct pt_regs *regs)
 	return 1;
 }
 
+/*
+ * Retry the faulty access.
+ */
+static int do_good(unsigned long addr, unsigned int esr, struct pt_regs *regs)
+{
+	return 0;
+}
+
 static struct fault_info {
 	int	(*fn)(unsigned long addr, unsigned int esr, struct pt_regs *regs);
 	int	sig;
@@ -391,7 +399,7 @@ static struct fault_info {
 	{ do_page_fault,	SIGSEGV, SEGV_ACCERR,	"level 1 permission fault"	},
 	{ do_page_fault,	SIGSEGV, SEGV_ACCERR,	"level 2 permission fault"	},
 	{ do_page_fault,	SIGSEGV, SEGV_ACCERR,	"level 3 permission fault"	},
-	{ do_bad,		SIGBUS,  0,		"synchronous external abort"	},
+	{ do_good,		SIGBUS,  0,		"synchronous external abort"	},
 	{ do_bad,		SIGBUS,  0,		"asynchronous external abort"	},
 	{ do_bad,		SIGBUS,  0,		"unknown 18"			},
 	{ do_bad,		SIGBUS,  0,		"unknown 19"			},

-- 
Catalin

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32
@ 2015-08-10 17:38                     ` Catalin Marinas
  0 siblings, 0 replies; 49+ messages in thread
From: Catalin Marinas @ 2015-08-10 17:38 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Aug 10, 2015 at 11:18:23AM -0500, Bjorn Helgaas wrote:
> On Fri, Jul 31, 2015 at 12:00 PM, Duc Dang <dhdang@apm.com> wrote:
> > On Wed, Jul 29, 2015 at 8:55 AM, Bjorn Helgaas <bhelgaas@google.com> wrote:
> >> On Tue, Jul 28, 2015 at 08:22:55PM -0500, Bjorn Helgaas wrote:
> >>> On Tue, Jul 28, 2015 at 02:50:39PM -0700, Duc Dang wrote:
> >>
> >>> > Do you have another PCIe card to try on the same reboot test on this board?
> >>>
> >>> I've seen this on at least two Mellanox cards.  I'm running similar tests
> >>> on a different type of card now.
> >>
> >> FWIW, reboot tests on two machines with Mellanox cards failed, while the
> >> same test on a machine with a different proprietary card succeeded.
> >
> > Thanks, Bjorn.
> >
> > I don't have the same Mellanox card as yours, but I will also run
> > similar reboot test to see if I hit the same issue with my card.
> 
> Any more hints on this?  Nothing has changed on my end, so of course
> I'm still seeing this, always on machines with Mellanox, and never on
> other machines.  Could this be a hardware issue like a signal
> integrity or margin issue?  I don't know where to go from here because
> I'm not a hardware person, and I don't know anything to do in
> software.

Silly hack below, not actually a solution (and it may not even work):

diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 94d98cd1aad8..e895e96b3d13 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -369,6 +369,14 @@ static int do_bad(unsigned long addr, unsigned int esr, struct pt_regs *regs)
 	return 1;
 }
 
+/*
+ * Retry the faulty access.
+ */
+static int do_good(unsigned long addr, unsigned int esr, struct pt_regs *regs)
+{
+	return 0;
+}
+
 static struct fault_info {
 	int	(*fn)(unsigned long addr, unsigned int esr, struct pt_regs *regs);
 	int	sig;
@@ -391,7 +399,7 @@ static struct fault_info {
 	{ do_page_fault,	SIGSEGV, SEGV_ACCERR,	"level 1 permission fault"	},
 	{ do_page_fault,	SIGSEGV, SEGV_ACCERR,	"level 2 permission fault"	},
 	{ do_page_fault,	SIGSEGV, SEGV_ACCERR,	"level 3 permission fault"	},
-	{ do_bad,		SIGBUS,  0,		"synchronous external abort"	},
+	{ do_good,		SIGBUS,  0,		"synchronous external abort"	},
 	{ do_bad,		SIGBUS,  0,		"asynchronous external abort"	},
 	{ do_bad,		SIGBUS,  0,		"unknown 18"			},
 	{ do_bad,		SIGBUS,  0,		"unknown 19"			},

-- 
Catalin

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32
       [not found]                   ` <CADaLNDkUQHzGACfFmYDeJWnaNrKmJUDx4Rby60OWr4FzOjx3rA@mail.gmail.com>
@ 2015-08-10 17:42                       ` Bjorn Helgaas
  0 siblings, 0 replies; 49+ messages in thread
From: Bjorn Helgaas @ 2015-08-10 17:42 UTC (permalink / raw)
  To: Duc Dang; +Cc: Tanmay Inamdar, linux-pci, linux-arm, linux-kernel

On Mon, Aug 10, 2015 at 12:16 PM, Duc Dang <dhdang@apm.com> wrote:
> On Monday, August 10, 2015, Bjorn Helgaas <bhelgaas@google.com> wrote:
>>
>> On Fri, Jul 31, 2015 at 12:00 PM, Duc Dang <dhdang@apm.com> wrote:
>> > On Wed, Jul 29, 2015 at 8:55 AM, Bjorn Helgaas <bhelgaas@google.com>
>> > wrote:
>> >> On Tue, Jul 28, 2015 at 08:22:55PM -0500, Bjorn Helgaas wrote:
>> >>> On Tue, Jul 28, 2015 at 02:50:39PM -0700, Duc Dang wrote:
>> >>
>> >>> > Do you have another PCIe card to try on the same reboot test on this
>> >>> > board?
>> >>>
>> >>> I've seen this on at least two Mellanox cards.  I'm running similar
>> >>> tests
>> >>> on a different type of card now.
>> >>
>> >> FWIW, reboot tests on two machines with Mellanox cards failed, while
>> >> the
>> >> same test on a machine with a different proprietary card succeeded.
>> >
>> > Thanks, Bjorn.
>> >
>> > I don't have the same Mellanox card as yours, but I will also run
>> > similar reboot test to see if I hit the same issue with my card.
>>
>> Any more hints on this?  Nothing has changed on my end, so of course
>> I'm still seeing this, always on machines with Mellanox, and never on
>> other machines.  Could this be a hardware issue like a signal
>> integrity or margin issue?  I don't know where to go from here because
>> I'm not a hardware person, and I don't know anything to do in
>> software.
>
>
> Hi Bjorn,
>
> I tried to run similar reboot tests on 2 different Mellanox cards (Connect-X
> family, one card has 2 10G interfaces, the other one has 1 port that
> supports InfiniBand) with U-Boot 1.15.12 and linux 4.2-rc5 and I did not see
> the crash that you encounterred.
>
> Did you check if your Mellanox cards have latest firmware? I did see some
> link issues on my Mellanox cards with its old firmware before.

Good idea; I'll check that, too.  Also, I just learned that these
cards on installed with an extender card because of some space issues,
so we're going to test again without the extender.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32
@ 2015-08-10 17:42                       ` Bjorn Helgaas
  0 siblings, 0 replies; 49+ messages in thread
From: Bjorn Helgaas @ 2015-08-10 17:42 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Aug 10, 2015 at 12:16 PM, Duc Dang <dhdang@apm.com> wrote:
> On Monday, August 10, 2015, Bjorn Helgaas <bhelgaas@google.com> wrote:
>>
>> On Fri, Jul 31, 2015 at 12:00 PM, Duc Dang <dhdang@apm.com> wrote:
>> > On Wed, Jul 29, 2015 at 8:55 AM, Bjorn Helgaas <bhelgaas@google.com>
>> > wrote:
>> >> On Tue, Jul 28, 2015 at 08:22:55PM -0500, Bjorn Helgaas wrote:
>> >>> On Tue, Jul 28, 2015 at 02:50:39PM -0700, Duc Dang wrote:
>> >>
>> >>> > Do you have another PCIe card to try on the same reboot test on this
>> >>> > board?
>> >>>
>> >>> I've seen this on at least two Mellanox cards.  I'm running similar
>> >>> tests
>> >>> on a different type of card now.
>> >>
>> >> FWIW, reboot tests on two machines with Mellanox cards failed, while
>> >> the
>> >> same test on a machine with a different proprietary card succeeded.
>> >
>> > Thanks, Bjorn.
>> >
>> > I don't have the same Mellanox card as yours, but I will also run
>> > similar reboot test to see if I hit the same issue with my card.
>>
>> Any more hints on this?  Nothing has changed on my end, so of course
>> I'm still seeing this, always on machines with Mellanox, and never on
>> other machines.  Could this be a hardware issue like a signal
>> integrity or margin issue?  I don't know where to go from here because
>> I'm not a hardware person, and I don't know anything to do in
>> software.
>
>
> Hi Bjorn,
>
> I tried to run similar reboot tests on 2 different Mellanox cards (Connect-X
> family, one card has 2 10G interfaces, the other one has 1 port that
> supports InfiniBand) with U-Boot 1.15.12 and linux 4.2-rc5 and I did not see
> the crash that you encounterred.
>
> Did you check if your Mellanox cards have latest firmware? I did see some
> link issues on my Mellanox cards with its old firmware before.

Good idea; I'll check that, too.  Also, I just learned that these
cards on installed with an extender card because of some space issues,
so we're going to test again without the extender.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32
  2015-08-10 17:42                       ` Bjorn Helgaas
@ 2015-08-10 19:07                         ` Duc Dang
  -1 siblings, 0 replies; 49+ messages in thread
From: Duc Dang @ 2015-08-10 19:07 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: Tanmay Inamdar, linux-pci, linux-arm, linux-kernel

On Mon, Aug 10, 2015 at 10:42 AM, Bjorn Helgaas <bhelgaas@google.com> wrote:
> On Mon, Aug 10, 2015 at 12:16 PM, Duc Dang <dhdang@apm.com> wrote:
>> On Monday, August 10, 2015, Bjorn Helgaas <bhelgaas@google.com> wrote:
>>>
>>> On Fri, Jul 31, 2015 at 12:00 PM, Duc Dang <dhdang@apm.com> wrote:
>>> > On Wed, Jul 29, 2015 at 8:55 AM, Bjorn Helgaas <bhelgaas@google.com>
>>> > wrote:
>>> >> On Tue, Jul 28, 2015 at 08:22:55PM -0500, Bjorn Helgaas wrote:
>>> >>> On Tue, Jul 28, 2015 at 02:50:39PM -0700, Duc Dang wrote:
>>> >>
>>> >>> > Do you have another PCIe card to try on the same reboot test on this
>>> >>> > board?
>>> >>>
>>> >>> I've seen this on at least two Mellanox cards.  I'm running similar
>>> >>> tests
>>> >>> on a different type of card now.
>>> >>
>>> >> FWIW, reboot tests on two machines with Mellanox cards failed, while
>>> >> the
>>> >> same test on a machine with a different proprietary card succeeded.
>>> >
>>> > Thanks, Bjorn.
>>> >
>>> > I don't have the same Mellanox card as yours, but I will also run
>>> > similar reboot test to see if I hit the same issue with my card.
>>>
>>> Any more hints on this?  Nothing has changed on my end, so of course
>>> I'm still seeing this, always on machines with Mellanox, and never on
>>> other machines.  Could this be a hardware issue like a signal
>>> integrity or margin issue?  I don't know where to go from here because
>>> I'm not a hardware person, and I don't know anything to do in
>>> software.
>>
>>
>> Hi Bjorn,
>>
>> I tried to run similar reboot tests on 2 different Mellanox cards (Connect-X
>> family, one card has 2 10G interfaces, the other one has 1 port that
>> supports InfiniBand) with U-Boot 1.15.12 and linux 4.2-rc5 and I did not see
>> the crash that you encounterred.
>>
>> Did you check if your Mellanox cards have latest firmware? I did see some
>> link issues on my Mellanox cards with its old firmware before.
>
> Good idea; I'll check that, too.  Also, I just learned that these
> cards on installed with an extender card because of some space issues,
> so we're going to test again without the extender.

Hi Bjorn,

Are other cards that passed your test installed directly to the
on-board PCIe slot?
If yes, then this is a good data point and it will be useful to test
the case where
your Mellanox cards are directly installed into the on-board PCIe slot.

-- 
Regards,
Duc Dang.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32
@ 2015-08-10 19:07                         ` Duc Dang
  0 siblings, 0 replies; 49+ messages in thread
From: Duc Dang @ 2015-08-10 19:07 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Aug 10, 2015 at 10:42 AM, Bjorn Helgaas <bhelgaas@google.com> wrote:
> On Mon, Aug 10, 2015 at 12:16 PM, Duc Dang <dhdang@apm.com> wrote:
>> On Monday, August 10, 2015, Bjorn Helgaas <bhelgaas@google.com> wrote:
>>>
>>> On Fri, Jul 31, 2015 at 12:00 PM, Duc Dang <dhdang@apm.com> wrote:
>>> > On Wed, Jul 29, 2015 at 8:55 AM, Bjorn Helgaas <bhelgaas@google.com>
>>> > wrote:
>>> >> On Tue, Jul 28, 2015 at 08:22:55PM -0500, Bjorn Helgaas wrote:
>>> >>> On Tue, Jul 28, 2015 at 02:50:39PM -0700, Duc Dang wrote:
>>> >>
>>> >>> > Do you have another PCIe card to try on the same reboot test on this
>>> >>> > board?
>>> >>>
>>> >>> I've seen this on at least two Mellanox cards.  I'm running similar
>>> >>> tests
>>> >>> on a different type of card now.
>>> >>
>>> >> FWIW, reboot tests on two machines with Mellanox cards failed, while
>>> >> the
>>> >> same test on a machine with a different proprietary card succeeded.
>>> >
>>> > Thanks, Bjorn.
>>> >
>>> > I don't have the same Mellanox card as yours, but I will also run
>>> > similar reboot test to see if I hit the same issue with my card.
>>>
>>> Any more hints on this?  Nothing has changed on my end, so of course
>>> I'm still seeing this, always on machines with Mellanox, and never on
>>> other machines.  Could this be a hardware issue like a signal
>>> integrity or margin issue?  I don't know where to go from here because
>>> I'm not a hardware person, and I don't know anything to do in
>>> software.
>>
>>
>> Hi Bjorn,
>>
>> I tried to run similar reboot tests on 2 different Mellanox cards (Connect-X
>> family, one card has 2 10G interfaces, the other one has 1 port that
>> supports InfiniBand) with U-Boot 1.15.12 and linux 4.2-rc5 and I did not see
>> the crash that you encounterred.
>>
>> Did you check if your Mellanox cards have latest firmware? I did see some
>> link issues on my Mellanox cards with its old firmware before.
>
> Good idea; I'll check that, too.  Also, I just learned that these
> cards on installed with an extender card because of some space issues,
> so we're going to test again without the extender.

Hi Bjorn,

Are other cards that passed your test installed directly to the
on-board PCIe slot?
If yes, then this is a good data point and it will be useful to test
the case where
your Mellanox cards are directly installed into the on-board PCIe slot.

-- 
Regards,
Duc Dang.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32
  2015-08-10 19:07                         ` Duc Dang
@ 2015-08-11 19:28                           ` Bjorn Helgaas
  -1 siblings, 0 replies; 49+ messages in thread
From: Bjorn Helgaas @ 2015-08-11 19:28 UTC (permalink / raw)
  To: Duc Dang; +Cc: Tanmay Inamdar, linux-pci, linux-arm, linux-kernel

On Mon, Aug 10, 2015 at 2:07 PM, Duc Dang <dhdang@apm.com> wrote:
> On Mon, Aug 10, 2015 at 10:42 AM, Bjorn Helgaas <bhelgaas@google.com> wrote:
>> On Mon, Aug 10, 2015 at 12:16 PM, Duc Dang <dhdang@apm.com> wrote:
>>> On Monday, August 10, 2015, Bjorn Helgaas <bhelgaas@google.com> wrote:
>>>>
>>>> On Fri, Jul 31, 2015 at 12:00 PM, Duc Dang <dhdang@apm.com> wrote:
>>>> > On Wed, Jul 29, 2015 at 8:55 AM, Bjorn Helgaas <bhelgaas@google.com>
>>>> > wrote:
>>>> >> On Tue, Jul 28, 2015 at 08:22:55PM -0500, Bjorn Helgaas wrote:
>>>> >>> On Tue, Jul 28, 2015 at 02:50:39PM -0700, Duc Dang wrote:
>>>> >>
>>>> >>> > Do you have another PCIe card to try on the same reboot test on this
>>>> >>> > board?
>>>> >>>
>>>> >>> I've seen this on at least two Mellanox cards.  I'm running similar
>>>> >>> tests
>>>> >>> on a different type of card now.
>>>> >>
>>>> >> FWIW, reboot tests on two machines with Mellanox cards failed, while
>>>> >> the
>>>> >> same test on a machine with a different proprietary card succeeded.
>>>> >
>>>> > Thanks, Bjorn.
>>>> >
>>>> > I don't have the same Mellanox card as yours, but I will also run
>>>> > similar reboot test to see if I hit the same issue with my card.
>>>>
>>>> Any more hints on this?  Nothing has changed on my end, so of course
>>>> I'm still seeing this, always on machines with Mellanox, and never on
>>>> other machines.  Could this be a hardware issue like a signal
>>>> integrity or margin issue?  I don't know where to go from here because
>>>> I'm not a hardware person, and I don't know anything to do in
>>>> software.
>>>
>>>
>>> Hi Bjorn,
>>>
>>> I tried to run similar reboot tests on 2 different Mellanox cards (Connect-X
>>> family, one card has 2 10G interfaces, the other one has 1 port that
>>> supports InfiniBand) with U-Boot 1.15.12 and linux 4.2-rc5 and I did not see
>>> the crash that you encounterred.
>>>
>>> Did you check if your Mellanox cards have latest firmware? I did see some
>>> link issues on my Mellanox cards with its old firmware before.
>>
>> Good idea; I'll check that, too.  Also, I just learned that these
>> cards on installed with an extender card because of some space issues,
>> so we're going to test again without the extender.
>
> Hi Bjorn,
>
> Are other cards that passed your test installed directly to the
> on-board PCIe slot?
> If yes, then this is a good data point and it will be useful to test
> the case where
> your Mellanox cards are directly installed into the on-board PCIe slot.

The cards that passed the test were installed directly, with  no
extender.  We removed the extender from one of the machines with the
Mellanox card and have not seen this issue since then.  I think it's
very likely that the problem is related to using the extender.

Bjorn

^ permalink raw reply	[flat|nested] 49+ messages in thread

* X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32
@ 2015-08-11 19:28                           ` Bjorn Helgaas
  0 siblings, 0 replies; 49+ messages in thread
From: Bjorn Helgaas @ 2015-08-11 19:28 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Aug 10, 2015 at 2:07 PM, Duc Dang <dhdang@apm.com> wrote:
> On Mon, Aug 10, 2015 at 10:42 AM, Bjorn Helgaas <bhelgaas@google.com> wrote:
>> On Mon, Aug 10, 2015 at 12:16 PM, Duc Dang <dhdang@apm.com> wrote:
>>> On Monday, August 10, 2015, Bjorn Helgaas <bhelgaas@google.com> wrote:
>>>>
>>>> On Fri, Jul 31, 2015 at 12:00 PM, Duc Dang <dhdang@apm.com> wrote:
>>>> > On Wed, Jul 29, 2015 at 8:55 AM, Bjorn Helgaas <bhelgaas@google.com>
>>>> > wrote:
>>>> >> On Tue, Jul 28, 2015 at 08:22:55PM -0500, Bjorn Helgaas wrote:
>>>> >>> On Tue, Jul 28, 2015 at 02:50:39PM -0700, Duc Dang wrote:
>>>> >>
>>>> >>> > Do you have another PCIe card to try on the same reboot test on this
>>>> >>> > board?
>>>> >>>
>>>> >>> I've seen this on at least two Mellanox cards.  I'm running similar
>>>> >>> tests
>>>> >>> on a different type of card now.
>>>> >>
>>>> >> FWIW, reboot tests on two machines with Mellanox cards failed, while
>>>> >> the
>>>> >> same test on a machine with a different proprietary card succeeded.
>>>> >
>>>> > Thanks, Bjorn.
>>>> >
>>>> > I don't have the same Mellanox card as yours, but I will also run
>>>> > similar reboot test to see if I hit the same issue with my card.
>>>>
>>>> Any more hints on this?  Nothing has changed on my end, so of course
>>>> I'm still seeing this, always on machines with Mellanox, and never on
>>>> other machines.  Could this be a hardware issue like a signal
>>>> integrity or margin issue?  I don't know where to go from here because
>>>> I'm not a hardware person, and I don't know anything to do in
>>>> software.
>>>
>>>
>>> Hi Bjorn,
>>>
>>> I tried to run similar reboot tests on 2 different Mellanox cards (Connect-X
>>> family, one card has 2 10G interfaces, the other one has 1 port that
>>> supports InfiniBand) with U-Boot 1.15.12 and linux 4.2-rc5 and I did not see
>>> the crash that you encounterred.
>>>
>>> Did you check if your Mellanox cards have latest firmware? I did see some
>>> link issues on my Mellanox cards with its old firmware before.
>>
>> Good idea; I'll check that, too.  Also, I just learned that these
>> cards on installed with an extender card because of some space issues,
>> so we're going to test again without the extender.
>
> Hi Bjorn,
>
> Are other cards that passed your test installed directly to the
> on-board PCIe slot?
> If yes, then this is a good data point and it will be useful to test
> the case where
> your Mellanox cards are directly installed into the on-board PCIe slot.

The cards that passed the test were installed directly, with  no
extender.  We removed the extender from one of the machines with the
Mellanox card and have not seen this issue since then.  I think it's
very likely that the problem is related to using the extender.

Bjorn

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32
  2015-08-11 19:28                           ` Bjorn Helgaas
@ 2015-09-05 20:13                             ` Jon Masters
  -1 siblings, 0 replies; 49+ messages in thread
From: Jon Masters @ 2015-09-05 20:13 UTC (permalink / raw)
  To: Bjorn Helgaas, Duc Dang
  Cc: Tanmay Inamdar, linux-pci, linux-arm, linux-kernel

On 08/11/2015 03:28 PM, Bjorn Helgaas wrote:
> On Mon, Aug 10, 2015 at 2:07 PM, Duc Dang <dhdang@apm.com> wrote:
>> On Mon, Aug 10, 2015 at 10:42 AM, Bjorn Helgaas <bhelgaas@google.com> wrote:
>>> On Mon, Aug 10, 2015 at 12:16 PM, Duc Dang <dhdang@apm.com> wrote:
>>>> On Monday, August 10, 2015, Bjorn Helgaas <bhelgaas@google.com> wrote:
>>>>>
>>>>> On Fri, Jul 31, 2015 at 12:00 PM, Duc Dang <dhdang@apm.com> wrote:
>>>>>> On Wed, Jul 29, 2015 at 8:55 AM, Bjorn Helgaas <bhelgaas@google.com>
>>>>>> wrote:
>>>>>>> On Tue, Jul 28, 2015 at 08:22:55PM -0500, Bjorn Helgaas wrote:
>>>>>>>> On Tue, Jul 28, 2015 at 02:50:39PM -0700, Duc Dang wrote:
>>>>>>>
>>>>>>>>> Do you have another PCIe card to try on the same reboot test on this
>>>>>>>>> board?
>>>>>>>>
>>>>>>>> I've seen this on at least two Mellanox cards.  I'm running similar
>>>>>>>> tests
>>>>>>>> on a different type of card now.
>>>>>>>
>>>>>>> FWIW, reboot tests on two machines with Mellanox cards failed, while
>>>>>>> the
>>>>>>> same test on a machine with a different proprietary card succeeded.
>>>>>>
>>>>>> Thanks, Bjorn.
>>>>>>
>>>>>> I don't have the same Mellanox card as yours, but I will also run
>>>>>> similar reboot test to see if I hit the same issue with my card.
>>>>>
>>>>> Any more hints on this?  Nothing has changed on my end, so of course
>>>>> I'm still seeing this, always on machines with Mellanox, and never on
>>>>> other machines.  Could this be a hardware issue like a signal
>>>>> integrity or margin issue?  I don't know where to go from here because
>>>>> I'm not a hardware person, and I don't know anything to do in
>>>>> software.
>>>>
>>>>
>>>> Hi Bjorn,
>>>>
>>>> I tried to run similar reboot tests on 2 different Mellanox cards (Connect-X
>>>> family, one card has 2 10G interfaces, the other one has 1 port that
>>>> supports InfiniBand) with U-Boot 1.15.12 and linux 4.2-rc5 and I did not see
>>>> the crash that you encounterred.
>>>>
>>>> Did you check if your Mellanox cards have latest firmware? I did see some
>>>> link issues on my Mellanox cards with its old firmware before.
>>>
>>> Good idea; I'll check that, too.  Also, I just learned that these
>>> cards on installed with an extender card because of some space issues,
>>> so we're going to test again without the extender.
>>
>> Hi Bjorn,
>>
>> Are other cards that passed your test installed directly to the
>> on-board PCIe slot?
>> If yes, then this is a good data point and it will be useful to test
>> the case where
>> your Mellanox cards are directly installed into the on-board PCIe slot.
> 
> The cards that passed the test were installed directly, with  no
> extender.  We removed the extender from one of the machines with the
> Mellanox card and have not seen this issue since then.  I think it's
> very likely that the problem is related to using the extender.

If you're trying to use Mellanox cards in (for example) an APM Mustang
like system with a PCIe extender card (for example a 90 degree angle
adjustment for a low profile server case), you might want to ping me
offline. I have procured a number of these over the past couple of years
for my home lab and have found one that works (almost) reliably on that
particular hardware platform and does 10G in my home lab.

Jon.



^ permalink raw reply	[flat|nested] 49+ messages in thread

* X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32
@ 2015-09-05 20:13                             ` Jon Masters
  0 siblings, 0 replies; 49+ messages in thread
From: Jon Masters @ 2015-09-05 20:13 UTC (permalink / raw)
  To: linux-arm-kernel

On 08/11/2015 03:28 PM, Bjorn Helgaas wrote:
> On Mon, Aug 10, 2015 at 2:07 PM, Duc Dang <dhdang@apm.com> wrote:
>> On Mon, Aug 10, 2015 at 10:42 AM, Bjorn Helgaas <bhelgaas@google.com> wrote:
>>> On Mon, Aug 10, 2015 at 12:16 PM, Duc Dang <dhdang@apm.com> wrote:
>>>> On Monday, August 10, 2015, Bjorn Helgaas <bhelgaas@google.com> wrote:
>>>>>
>>>>> On Fri, Jul 31, 2015 at 12:00 PM, Duc Dang <dhdang@apm.com> wrote:
>>>>>> On Wed, Jul 29, 2015 at 8:55 AM, Bjorn Helgaas <bhelgaas@google.com>
>>>>>> wrote:
>>>>>>> On Tue, Jul 28, 2015 at 08:22:55PM -0500, Bjorn Helgaas wrote:
>>>>>>>> On Tue, Jul 28, 2015 at 02:50:39PM -0700, Duc Dang wrote:
>>>>>>>
>>>>>>>>> Do you have another PCIe card to try on the same reboot test on this
>>>>>>>>> board?
>>>>>>>>
>>>>>>>> I've seen this on at least two Mellanox cards.  I'm running similar
>>>>>>>> tests
>>>>>>>> on a different type of card now.
>>>>>>>
>>>>>>> FWIW, reboot tests on two machines with Mellanox cards failed, while
>>>>>>> the
>>>>>>> same test on a machine with a different proprietary card succeeded.
>>>>>>
>>>>>> Thanks, Bjorn.
>>>>>>
>>>>>> I don't have the same Mellanox card as yours, but I will also run
>>>>>> similar reboot test to see if I hit the same issue with my card.
>>>>>
>>>>> Any more hints on this?  Nothing has changed on my end, so of course
>>>>> I'm still seeing this, always on machines with Mellanox, and never on
>>>>> other machines.  Could this be a hardware issue like a signal
>>>>> integrity or margin issue?  I don't know where to go from here because
>>>>> I'm not a hardware person, and I don't know anything to do in
>>>>> software.
>>>>
>>>>
>>>> Hi Bjorn,
>>>>
>>>> I tried to run similar reboot tests on 2 different Mellanox cards (Connect-X
>>>> family, one card has 2 10G interfaces, the other one has 1 port that
>>>> supports InfiniBand) with U-Boot 1.15.12 and linux 4.2-rc5 and I did not see
>>>> the crash that you encounterred.
>>>>
>>>> Did you check if your Mellanox cards have latest firmware? I did see some
>>>> link issues on my Mellanox cards with its old firmware before.
>>>
>>> Good idea; I'll check that, too.  Also, I just learned that these
>>> cards on installed with an extender card because of some space issues,
>>> so we're going to test again without the extender.
>>
>> Hi Bjorn,
>>
>> Are other cards that passed your test installed directly to the
>> on-board PCIe slot?
>> If yes, then this is a good data point and it will be useful to test
>> the case where
>> your Mellanox cards are directly installed into the on-board PCIe slot.
> 
> The cards that passed the test were installed directly, with  no
> extender.  We removed the extender from one of the machines with the
> Mellanox card and have not seen this issue since then.  I think it's
> very likely that the problem is related to using the extender.

If you're trying to use Mellanox cards in (for example) an APM Mustang
like system with a PCIe extender card (for example a 90 degree angle
adjustment for a low profile server case), you might want to ping me
offline. I have procured a number of these over the past couple of years
for my home lab and have found one that works (almost) reliably on that
particular hardware platform and does 10G in my home lab.

Jon.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32
  2015-09-05 20:13                             ` Jon Masters
@ 2015-09-05 20:22                               ` Jon Masters
  -1 siblings, 0 replies; 49+ messages in thread
From: Jon Masters @ 2015-09-05 20:22 UTC (permalink / raw)
  To: Bjorn Helgaas, Duc Dang
  Cc: Tanmay Inamdar, linux-pci, linux-arm, linux-kernel

On 09/05/2015 04:13 PM, Jon Masters wrote:
> On 08/11/2015 03:28 PM, Bjorn Helgaas wrote:
>> On Mon, Aug 10, 2015 at 2:07 PM, Duc Dang <dhdang@apm.com> wrote:
>>> On Mon, Aug 10, 2015 at 10:42 AM, Bjorn Helgaas <bhelgaas@google.com> wrote:
>>>> On Mon, Aug 10, 2015 at 12:16 PM, Duc Dang <dhdang@apm.com> wrote:
>>>>> On Monday, August 10, 2015, Bjorn Helgaas <bhelgaas@google.com> wrote:
>>>>>>
>>>>>> On Fri, Jul 31, 2015 at 12:00 PM, Duc Dang <dhdang@apm.com> wrote:
>>>>>>> On Wed, Jul 29, 2015 at 8:55 AM, Bjorn Helgaas <bhelgaas@google.com>
>>>>>>> wrote:
>>>>>>>> On Tue, Jul 28, 2015 at 08:22:55PM -0500, Bjorn Helgaas wrote:
>>>>>>>>> On Tue, Jul 28, 2015 at 02:50:39PM -0700, Duc Dang wrote:
>>>>>>>>
>>>>>>>>>> Do you have another PCIe card to try on the same reboot test on this
>>>>>>>>>> board?
>>>>>>>>>
>>>>>>>>> I've seen this on at least two Mellanox cards.  I'm running similar
>>>>>>>>> tests
>>>>>>>>> on a different type of card now.
>>>>>>>>
>>>>>>>> FWIW, reboot tests on two machines with Mellanox cards failed, while
>>>>>>>> the
>>>>>>>> same test on a machine with a different proprietary card succeeded.
>>>>>>>
>>>>>>> Thanks, Bjorn.
>>>>>>>
>>>>>>> I don't have the same Mellanox card as yours, but I will also run
>>>>>>> similar reboot test to see if I hit the same issue with my card.
>>>>>>
>>>>>> Any more hints on this?  Nothing has changed on my end, so of course
>>>>>> I'm still seeing this, always on machines with Mellanox, and never on
>>>>>> other machines.  Could this be a hardware issue like a signal
>>>>>> integrity or margin issue?  I don't know where to go from here because
>>>>>> I'm not a hardware person, and I don't know anything to do in
>>>>>> software.
>>>>>
>>>>>
>>>>> Hi Bjorn,
>>>>>
>>>>> I tried to run similar reboot tests on 2 different Mellanox cards (Connect-X
>>>>> family, one card has 2 10G interfaces, the other one has 1 port that
>>>>> supports InfiniBand) with U-Boot 1.15.12 and linux 4.2-rc5 and I did not see
>>>>> the crash that you encounterred.
>>>>>
>>>>> Did you check if your Mellanox cards have latest firmware? I did see some
>>>>> link issues on my Mellanox cards with its old firmware before.
>>>>
>>>> Good idea; I'll check that, too.  Also, I just learned that these
>>>> cards on installed with an extender card because of some space issues,
>>>> so we're going to test again without the extender.
>>>
>>> Hi Bjorn,
>>>
>>> Are other cards that passed your test installed directly to the
>>> on-board PCIe slot?
>>> If yes, then this is a good data point and it will be useful to test
>>> the case where
>>> your Mellanox cards are directly installed into the on-board PCIe slot.
>>
>> The cards that passed the test were installed directly, with  no
>> extender.  We removed the extender from one of the machines with the
>> Mellanox card and have not seen this issue since then.  I think it's
>> very likely that the problem is related to using the extender.
> 
> If you're trying to use Mellanox cards in (for example) an APM Mustang
> like system with a PCIe extender card (for example a 90 degree angle
> adjustment for a low profile server case), you might want to ping me
> offline. I have procured a number of these over the past couple of years
> for my home lab and have found one that works (almost) reliably on that
> particular hardware platform and does 10G in my home lab.

Traveling for the holiday, but I guess it doesn't need to be a secret. I
think I have found some success with this one (but I have ordered many
different ones over the past year so will confirm next week):

http://www.amazon.com/gp/product/B00H8VVD00?psc=1&redirect=true&ref_=oh_aui_search_detailpage

Specifically, the fixed angle adapter brackets generally DO NOT work.

Jon.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32
@ 2015-09-05 20:22                               ` Jon Masters
  0 siblings, 0 replies; 49+ messages in thread
From: Jon Masters @ 2015-09-05 20:22 UTC (permalink / raw)
  To: linux-arm-kernel

On 09/05/2015 04:13 PM, Jon Masters wrote:
> On 08/11/2015 03:28 PM, Bjorn Helgaas wrote:
>> On Mon, Aug 10, 2015 at 2:07 PM, Duc Dang <dhdang@apm.com> wrote:
>>> On Mon, Aug 10, 2015 at 10:42 AM, Bjorn Helgaas <bhelgaas@google.com> wrote:
>>>> On Mon, Aug 10, 2015 at 12:16 PM, Duc Dang <dhdang@apm.com> wrote:
>>>>> On Monday, August 10, 2015, Bjorn Helgaas <bhelgaas@google.com> wrote:
>>>>>>
>>>>>> On Fri, Jul 31, 2015 at 12:00 PM, Duc Dang <dhdang@apm.com> wrote:
>>>>>>> On Wed, Jul 29, 2015 at 8:55 AM, Bjorn Helgaas <bhelgaas@google.com>
>>>>>>> wrote:
>>>>>>>> On Tue, Jul 28, 2015 at 08:22:55PM -0500, Bjorn Helgaas wrote:
>>>>>>>>> On Tue, Jul 28, 2015 at 02:50:39PM -0700, Duc Dang wrote:
>>>>>>>>
>>>>>>>>>> Do you have another PCIe card to try on the same reboot test on this
>>>>>>>>>> board?
>>>>>>>>>
>>>>>>>>> I've seen this on at least two Mellanox cards.  I'm running similar
>>>>>>>>> tests
>>>>>>>>> on a different type of card now.
>>>>>>>>
>>>>>>>> FWIW, reboot tests on two machines with Mellanox cards failed, while
>>>>>>>> the
>>>>>>>> same test on a machine with a different proprietary card succeeded.
>>>>>>>
>>>>>>> Thanks, Bjorn.
>>>>>>>
>>>>>>> I don't have the same Mellanox card as yours, but I will also run
>>>>>>> similar reboot test to see if I hit the same issue with my card.
>>>>>>
>>>>>> Any more hints on this?  Nothing has changed on my end, so of course
>>>>>> I'm still seeing this, always on machines with Mellanox, and never on
>>>>>> other machines.  Could this be a hardware issue like a signal
>>>>>> integrity or margin issue?  I don't know where to go from here because
>>>>>> I'm not a hardware person, and I don't know anything to do in
>>>>>> software.
>>>>>
>>>>>
>>>>> Hi Bjorn,
>>>>>
>>>>> I tried to run similar reboot tests on 2 different Mellanox cards (Connect-X
>>>>> family, one card has 2 10G interfaces, the other one has 1 port that
>>>>> supports InfiniBand) with U-Boot 1.15.12 and linux 4.2-rc5 and I did not see
>>>>> the crash that you encounterred.
>>>>>
>>>>> Did you check if your Mellanox cards have latest firmware? I did see some
>>>>> link issues on my Mellanox cards with its old firmware before.
>>>>
>>>> Good idea; I'll check that, too.  Also, I just learned that these
>>>> cards on installed with an extender card because of some space issues,
>>>> so we're going to test again without the extender.
>>>
>>> Hi Bjorn,
>>>
>>> Are other cards that passed your test installed directly to the
>>> on-board PCIe slot?
>>> If yes, then this is a good data point and it will be useful to test
>>> the case where
>>> your Mellanox cards are directly installed into the on-board PCIe slot.
>>
>> The cards that passed the test were installed directly, with  no
>> extender.  We removed the extender from one of the machines with the
>> Mellanox card and have not seen this issue since then.  I think it's
>> very likely that the problem is related to using the extender.
> 
> If you're trying to use Mellanox cards in (for example) an APM Mustang
> like system with a PCIe extender card (for example a 90 degree angle
> adjustment for a low profile server case), you might want to ping me
> offline. I have procured a number of these over the past couple of years
> for my home lab and have found one that works (almost) reliably on that
> particular hardware platform and does 10G in my home lab.

Traveling for the holiday, but I guess it doesn't need to be a secret. I
think I have found some success with this one (but I have ordered many
different ones over the past year so will confirm next week):

http://www.amazon.com/gp/product/B00H8VVD00?psc=1&redirect=true&ref_=oh_aui_search_detailpage

Specifically, the fixed angle adapter brackets generally DO NOT work.

Jon.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32
  2015-07-28 21:29         ` Bjorn Helgaas
@ 2016-04-13  9:58           ` Sudeep Holla
  -1 siblings, 0 replies; 49+ messages in thread
From: Sudeep Holla @ 2016-04-13  9:58 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Duc Dang, Tanmay Inamdar, linux-pci, linux-arm, linux-kernel,
	Sudeep Holla

Hi,

(sorry for replying on the old thread, but I found it could be related
to the issue
I have now)

On Tue, Jul 28, 2015 at 10:29 PM, Bjorn Helgaas <bhelgaas@google.com> wrote:
> On Tue, Jul 28, 2015 at 10:45:26AM -0700, Duc Dang wrote:
>> On Tue, Jul 28, 2015 at 9:43 AM, Bjorn Helgaas <bhelgaas@google.com> wrote:
>> > On Fri, Jul 24, 2015 at 7:05 PM, Duc Dang <dhdang@apm.com> wrote:
>> >> Hi Bjorn,
>> >>
>> >> On Fri, Jul 24, 2015 at 3:42 PM, Bjorn Helgaas <bhelgaas@google.com> wrote:
>> >>>
>> >>> I regularly see faults like this on an APM X-Gene:
>> >>>
>> >>>   U-Boot 2013.04-mustang_sw_1.14.14 (Dec 16 2014 - 15:59:33)
>> >>>   CPU0: APM ARM 64-bit Potenza Rev B0 2400MHz PCP 2400MHz
>> >>>        32 KB ICACHE, 32 KB DCACHE
>> >>>        SOC 2000MHz IOBAXI 400MHz AXI 250MHz AHB 200MHz GFC 125MHz
>> >>>   ...
>> >>>   Unhandled fault: synchronous external abort (0x96000010) at 0xffffff8000110034
>> >>>   Internal error: : 96000010 [#1] SMP
>> >>>   Modules linked in:
>> >>>   CPU: 0 PID: 3723 Comm: ... 4.1.0-smp-DEV #3
>> >>>   Hardware name: APM X-Gene Mustang board (DT)
>> >>>   task: ffffffc7dc1a4140 ti: ffffffc7dc118000 task.ti: ffffffc7dc118000
>> >>>   PC is at pci_generic_config_read32+0x4c/0xb8
>> >>>   LR is at pci_generic_config_read32+0x40/0xb8
>> >>>   pc : [<ffffffc00033b90c>] lr : [<ffffffc00033b900>] pstate: 600001c5
>> >>>   ...
>> >>>   Call trace:
>> >>>   [<ffffffc00033b90c>] pci_generic_config_read32+0x4c/0xb8
>> >>>   [<ffffffc00033bf58>] pci_user_read_config_byte+0x60/0xc4
>> >>>   [<ffffffc0003496a8>] pci_read_config+0x15c/0x238
>> >>>   [<ffffffc0002393b4>] sysfs_kf_bin_read+0x68/0xa0
>> >>>   [<ffffffc00023896c>] kernfs_fop_read+0x9c/0x1ac
>> >>>   [<ffffffc0001c361c>] __vfs_read+0x44/0x128
>> >>>   [<ffffffc0001c3e28>] vfs_read+0x84/0x144
>> >>>   [<ffffffc0001c4764>] SyS_read+0x50/0xb0
>> >>
>> >> The log shows kernel gets an exception when trying to access Mellanox
>> >> card configuration space. This is usually due to suboptimal PCIe
>> >> SerDes parameters are using in your board, which will cause bad link
>> >> quality.
>> >> The PCIe SerDes programming is done in U-Boot, so I suggest you do a
>> >> U-Boot upgrade to our latest X-Gene U-Boot release.
>> >
>> > I installed U-Boot 1.15.12, which I thought was the latest.  I'm still
>> > seeing this issue regularly, approx once/hour.
>>
>> Our latest U-Boot is 1.15.15, but U-Boot 1.15.12 is already a good
>> version to use. Are you running any PCIe traffic test when the error
>> happens?
>
> Nope, the machine was either idle or running a reboot test; no PCIe stress
> test or anything.
>

Was there any conclusion on this ?
I am having similar issue[1] on my Juno with sky2 PCIe driver during reboot.

Regards,
Sudeep

[1] http://marc.info/?l=linux-netdev&m=146046999701956&w=2

^ permalink raw reply	[flat|nested] 49+ messages in thread

* X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32
@ 2016-04-13  9:58           ` Sudeep Holla
  0 siblings, 0 replies; 49+ messages in thread
From: Sudeep Holla @ 2016-04-13  9:58 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

(sorry for replying on the old thread, but I found it could be related
to the issue
I have now)

On Tue, Jul 28, 2015 at 10:29 PM, Bjorn Helgaas <bhelgaas@google.com> wrote:
> On Tue, Jul 28, 2015 at 10:45:26AM -0700, Duc Dang wrote:
>> On Tue, Jul 28, 2015 at 9:43 AM, Bjorn Helgaas <bhelgaas@google.com> wrote:
>> > On Fri, Jul 24, 2015 at 7:05 PM, Duc Dang <dhdang@apm.com> wrote:
>> >> Hi Bjorn,
>> >>
>> >> On Fri, Jul 24, 2015 at 3:42 PM, Bjorn Helgaas <bhelgaas@google.com> wrote:
>> >>>
>> >>> I regularly see faults like this on an APM X-Gene:
>> >>>
>> >>>   U-Boot 2013.04-mustang_sw_1.14.14 (Dec 16 2014 - 15:59:33)
>> >>>   CPU0: APM ARM 64-bit Potenza Rev B0 2400MHz PCP 2400MHz
>> >>>        32 KB ICACHE, 32 KB DCACHE
>> >>>        SOC 2000MHz IOBAXI 400MHz AXI 250MHz AHB 200MHz GFC 125MHz
>> >>>   ...
>> >>>   Unhandled fault: synchronous external abort (0x96000010) at 0xffffff8000110034
>> >>>   Internal error: : 96000010 [#1] SMP
>> >>>   Modules linked in:
>> >>>   CPU: 0 PID: 3723 Comm: ... 4.1.0-smp-DEV #3
>> >>>   Hardware name: APM X-Gene Mustang board (DT)
>> >>>   task: ffffffc7dc1a4140 ti: ffffffc7dc118000 task.ti: ffffffc7dc118000
>> >>>   PC is at pci_generic_config_read32+0x4c/0xb8
>> >>>   LR is at pci_generic_config_read32+0x40/0xb8
>> >>>   pc : [<ffffffc00033b90c>] lr : [<ffffffc00033b900>] pstate: 600001c5
>> >>>   ...
>> >>>   Call trace:
>> >>>   [<ffffffc00033b90c>] pci_generic_config_read32+0x4c/0xb8
>> >>>   [<ffffffc00033bf58>] pci_user_read_config_byte+0x60/0xc4
>> >>>   [<ffffffc0003496a8>] pci_read_config+0x15c/0x238
>> >>>   [<ffffffc0002393b4>] sysfs_kf_bin_read+0x68/0xa0
>> >>>   [<ffffffc00023896c>] kernfs_fop_read+0x9c/0x1ac
>> >>>   [<ffffffc0001c361c>] __vfs_read+0x44/0x128
>> >>>   [<ffffffc0001c3e28>] vfs_read+0x84/0x144
>> >>>   [<ffffffc0001c4764>] SyS_read+0x50/0xb0
>> >>
>> >> The log shows kernel gets an exception when trying to access Mellanox
>> >> card configuration space. This is usually due to suboptimal PCIe
>> >> SerDes parameters are using in your board, which will cause bad link
>> >> quality.
>> >> The PCIe SerDes programming is done in U-Boot, so I suggest you do a
>> >> U-Boot upgrade to our latest X-Gene U-Boot release.
>> >
>> > I installed U-Boot 1.15.12, which I thought was the latest.  I'm still
>> > seeing this issue regularly, approx once/hour.
>>
>> Our latest U-Boot is 1.15.15, but U-Boot 1.15.12 is already a good
>> version to use. Are you running any PCIe traffic test when the error
>> happens?
>
> Nope, the machine was either idle or running a reboot test; no PCIe stress
> test or anything.
>

Was there any conclusion on this ?
I am having similar issue[1] on my Juno with sky2 PCIe driver during reboot.

Regards,
Sudeep

[1] http://marc.info/?l=linux-netdev&m=146046999701956&w=2

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32
  2016-04-13  9:58           ` Sudeep Holla
@ 2016-04-13 13:21             ` Bjorn Helgaas
  -1 siblings, 0 replies; 49+ messages in thread
From: Bjorn Helgaas @ 2016-04-13 13:21 UTC (permalink / raw)
  To: Sudeep Holla
  Cc: Bjorn Helgaas, Duc Dang, Tanmay Inamdar, linux-pci, linux-arm,
	linux-kernel

On Wed, Apr 13, 2016 at 10:58:18AM +0100, Sudeep Holla wrote:
> Hi,
> 
> (sorry for replying on the old thread, but I found it could be related
> to the issue
> I have now)
> 
> On Tue, Jul 28, 2015 at 10:29 PM, Bjorn Helgaas <bhelgaas@google.com> wrote:
> > On Tue, Jul 28, 2015 at 10:45:26AM -0700, Duc Dang wrote:
> >> On Tue, Jul 28, 2015 at 9:43 AM, Bjorn Helgaas <bhelgaas@google.com> wrote:
> >> > On Fri, Jul 24, 2015 at 7:05 PM, Duc Dang <dhdang@apm.com> wrote:
> >> >> Hi Bjorn,
> >> >>
> >> >> On Fri, Jul 24, 2015 at 3:42 PM, Bjorn Helgaas <bhelgaas@google.com> wrote:
> >> >>>
> >> >>> I regularly see faults like this on an APM X-Gene:
> >> >>>
> >> >>>   U-Boot 2013.04-mustang_sw_1.14.14 (Dec 16 2014 - 15:59:33)
> >> >>>   CPU0: APM ARM 64-bit Potenza Rev B0 2400MHz PCP 2400MHz
> >> >>>        32 KB ICACHE, 32 KB DCACHE
> >> >>>        SOC 2000MHz IOBAXI 400MHz AXI 250MHz AHB 200MHz GFC 125MHz
> >> >>>   ...
> >> >>>   Unhandled fault: synchronous external abort (0x96000010) at 0xffffff8000110034
> >> >>>   Internal error: : 96000010 [#1] SMP
> >> >>>   Modules linked in:
> >> >>>   CPU: 0 PID: 3723 Comm: ... 4.1.0-smp-DEV #3
> >> >>>   Hardware name: APM X-Gene Mustang board (DT)
> >> >>>   task: ffffffc7dc1a4140 ti: ffffffc7dc118000 task.ti: ffffffc7dc118000
> >> >>>   PC is at pci_generic_config_read32+0x4c/0xb8
> >> >>>   LR is at pci_generic_config_read32+0x40/0xb8
> >> >>>   pc : [<ffffffc00033b90c>] lr : [<ffffffc00033b900>] pstate: 600001c5
> >> >>>   ...
> >> >>>   Call trace:
> >> >>>   [<ffffffc00033b90c>] pci_generic_config_read32+0x4c/0xb8
> >> >>>   [<ffffffc00033bf58>] pci_user_read_config_byte+0x60/0xc4
> >> >>>   [<ffffffc0003496a8>] pci_read_config+0x15c/0x238
> >> >>>   [<ffffffc0002393b4>] sysfs_kf_bin_read+0x68/0xa0
> >> >>>   [<ffffffc00023896c>] kernfs_fop_read+0x9c/0x1ac
> >> >>>   [<ffffffc0001c361c>] __vfs_read+0x44/0x128
> >> >>>   [<ffffffc0001c3e28>] vfs_read+0x84/0x144
> >> >>>   [<ffffffc0001c4764>] SyS_read+0x50/0xb0
> >> >>
> >> >> The log shows kernel gets an exception when trying to access Mellanox
> >> >> card configuration space. This is usually due to suboptimal PCIe
> >> >> SerDes parameters are using in your board, which will cause bad link
> >> >> quality.
> >> >> The PCIe SerDes programming is done in U-Boot, so I suggest you do a
> >> >> U-Boot upgrade to our latest X-Gene U-Boot release.
> >> >
> >> > I installed U-Boot 1.15.12, which I thought was the latest.  I'm still
> >> > seeing this issue regularly, approx once/hour.
> >>
> >> Our latest U-Boot is 1.15.15, but U-Boot 1.15.12 is already a good
> >> version to use. Are you running any PCIe traffic test when the error
> >> happens?
> >
> > Nope, the machine was either idle or running a reboot test; no PCIe stress
> > test or anything.
> >
> 
> Was there any conclusion on this ?
> I am having similar issue[1] on my Juno with sky2 PCIe driver during reboot.

We found that the unhandled faults occurred when using an extender
card.  After removing the extender card, we didn't see the faults any
more.

> [1] http://marc.info/?l=linux-netdev&m=146046999701956&w=2

^ permalink raw reply	[flat|nested] 49+ messages in thread

* X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32
@ 2016-04-13 13:21             ` Bjorn Helgaas
  0 siblings, 0 replies; 49+ messages in thread
From: Bjorn Helgaas @ 2016-04-13 13:21 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Apr 13, 2016 at 10:58:18AM +0100, Sudeep Holla wrote:
> Hi,
> 
> (sorry for replying on the old thread, but I found it could be related
> to the issue
> I have now)
> 
> On Tue, Jul 28, 2015 at 10:29 PM, Bjorn Helgaas <bhelgaas@google.com> wrote:
> > On Tue, Jul 28, 2015 at 10:45:26AM -0700, Duc Dang wrote:
> >> On Tue, Jul 28, 2015 at 9:43 AM, Bjorn Helgaas <bhelgaas@google.com> wrote:
> >> > On Fri, Jul 24, 2015 at 7:05 PM, Duc Dang <dhdang@apm.com> wrote:
> >> >> Hi Bjorn,
> >> >>
> >> >> On Fri, Jul 24, 2015 at 3:42 PM, Bjorn Helgaas <bhelgaas@google.com> wrote:
> >> >>>
> >> >>> I regularly see faults like this on an APM X-Gene:
> >> >>>
> >> >>>   U-Boot 2013.04-mustang_sw_1.14.14 (Dec 16 2014 - 15:59:33)
> >> >>>   CPU0: APM ARM 64-bit Potenza Rev B0 2400MHz PCP 2400MHz
> >> >>>        32 KB ICACHE, 32 KB DCACHE
> >> >>>        SOC 2000MHz IOBAXI 400MHz AXI 250MHz AHB 200MHz GFC 125MHz
> >> >>>   ...
> >> >>>   Unhandled fault: synchronous external abort (0x96000010) at 0xffffff8000110034
> >> >>>   Internal error: : 96000010 [#1] SMP
> >> >>>   Modules linked in:
> >> >>>   CPU: 0 PID: 3723 Comm: ... 4.1.0-smp-DEV #3
> >> >>>   Hardware name: APM X-Gene Mustang board (DT)
> >> >>>   task: ffffffc7dc1a4140 ti: ffffffc7dc118000 task.ti: ffffffc7dc118000
> >> >>>   PC is at pci_generic_config_read32+0x4c/0xb8
> >> >>>   LR is at pci_generic_config_read32+0x40/0xb8
> >> >>>   pc : [<ffffffc00033b90c>] lr : [<ffffffc00033b900>] pstate: 600001c5
> >> >>>   ...
> >> >>>   Call trace:
> >> >>>   [<ffffffc00033b90c>] pci_generic_config_read32+0x4c/0xb8
> >> >>>   [<ffffffc00033bf58>] pci_user_read_config_byte+0x60/0xc4
> >> >>>   [<ffffffc0003496a8>] pci_read_config+0x15c/0x238
> >> >>>   [<ffffffc0002393b4>] sysfs_kf_bin_read+0x68/0xa0
> >> >>>   [<ffffffc00023896c>] kernfs_fop_read+0x9c/0x1ac
> >> >>>   [<ffffffc0001c361c>] __vfs_read+0x44/0x128
> >> >>>   [<ffffffc0001c3e28>] vfs_read+0x84/0x144
> >> >>>   [<ffffffc0001c4764>] SyS_read+0x50/0xb0
> >> >>
> >> >> The log shows kernel gets an exception when trying to access Mellanox
> >> >> card configuration space. This is usually due to suboptimal PCIe
> >> >> SerDes parameters are using in your board, which will cause bad link
> >> >> quality.
> >> >> The PCIe SerDes programming is done in U-Boot, so I suggest you do a
> >> >> U-Boot upgrade to our latest X-Gene U-Boot release.
> >> >
> >> > I installed U-Boot 1.15.12, which I thought was the latest.  I'm still
> >> > seeing this issue regularly, approx once/hour.
> >>
> >> Our latest U-Boot is 1.15.15, but U-Boot 1.15.12 is already a good
> >> version to use. Are you running any PCIe traffic test when the error
> >> happens?
> >
> > Nope, the machine was either idle or running a reboot test; no PCIe stress
> > test or anything.
> >
> 
> Was there any conclusion on this ?
> I am having similar issue[1] on my Juno with sky2 PCIe driver during reboot.

We found that the unhandled faults occurred when using an extender
card.  After removing the extender card, we didn't see the faults any
more.

> [1] http://marc.info/?l=linux-netdev&m=146046999701956&w=2

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32
  2016-04-13 13:21             ` Bjorn Helgaas
@ 2016-04-13 13:29               ` Sudeep Holla
  -1 siblings, 0 replies; 49+ messages in thread
From: Sudeep Holla @ 2016-04-13 13:29 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Sudeep Holla, Bjorn Helgaas, Duc Dang, Tanmay Inamdar, linux-pci,
	linux-arm, linux-kernel



On 13/04/16 14:21, Bjorn Helgaas wrote:
> On Wed, Apr 13, 2016 at 10:58:18AM +0100, Sudeep Holla wrote:
>> Hi,
>>
>> (sorry for replying on the old thread, but I found it could be related
>> to the issue
>> I have now)
>>
>> On Tue, Jul 28, 2015 at 10:29 PM, Bjorn Helgaas <bhelgaas@google.com> wrote:

[...]

>>>>
>>>> Our latest U-Boot is 1.15.15, but U-Boot 1.15.12 is already a good
>>>> version to use. Are you running any PCIe traffic test when the error
>>>> happens?
>>>
>>> Nope, the machine was either idle or running a reboot test; no PCIe stress
>>> test or anything.
>>>
>>
>> Was there any conclusion on this ?
>> I am having similar issue[1] on my Juno with sky2 PCIe driver during reboot.
>
> We found that the unhandled faults occurred when using an extender
> card.  After removing the extender card, we didn't see the faults any
> more.
>

Thanks for the response. It's not related then, I saw report referencing
reboot tests and hence linked them together. Sorry for the noise.

-- 
Regards,
Sudeep

>> [1] http://marc.info/?l=linux-netdev&m=146046999701956&w=2

^ permalink raw reply	[flat|nested] 49+ messages in thread

* X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32
@ 2016-04-13 13:29               ` Sudeep Holla
  0 siblings, 0 replies; 49+ messages in thread
From: Sudeep Holla @ 2016-04-13 13:29 UTC (permalink / raw)
  To: linux-arm-kernel



On 13/04/16 14:21, Bjorn Helgaas wrote:
> On Wed, Apr 13, 2016 at 10:58:18AM +0100, Sudeep Holla wrote:
>> Hi,
>>
>> (sorry for replying on the old thread, but I found it could be related
>> to the issue
>> I have now)
>>
>> On Tue, Jul 28, 2015 at 10:29 PM, Bjorn Helgaas <bhelgaas@google.com> wrote:

[...]

>>>>
>>>> Our latest U-Boot is 1.15.15, but U-Boot 1.15.12 is already a good
>>>> version to use. Are you running any PCIe traffic test when the error
>>>> happens?
>>>
>>> Nope, the machine was either idle or running a reboot test; no PCIe stress
>>> test or anything.
>>>
>>
>> Was there any conclusion on this ?
>> I am having similar issue[1] on my Juno with sky2 PCIe driver during reboot.
>
> We found that the unhandled faults occurred when using an extender
> card.  After removing the extender card, we didn't see the faults any
> more.
>

Thanks for the response. It's not related then, I saw report referencing
reboot tests and hence linked them together. Sorry for the noise.

-- 
Regards,
Sudeep

>> [1] http://marc.info/?l=linux-netdev&m=146046999701956&w=2

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32
  2016-04-13 13:29               ` Sudeep Holla
@ 2016-04-13 22:17                 ` Jon Masters
  -1 siblings, 0 replies; 49+ messages in thread
From: Jon Masters @ 2016-04-13 22:17 UTC (permalink / raw)
  To: Sudeep Holla, Bjorn Helgaas
  Cc: Bjorn Helgaas, Duc Dang, Tanmay Inamdar, linux-pci, linux-arm,
	linux-kernel

On 04/13/2016 09:29 AM, Sudeep Holla wrote:
> 
> 
> On 13/04/16 14:21, Bjorn Helgaas wrote:
>> On Wed, Apr 13, 2016 at 10:58:18AM +0100, Sudeep Holla wrote:
>>> Hi,
>>>
>>> (sorry for replying on the old thread, but I found it could be related
>>> to the issue
>>> I have now)
>>>
>>> On Tue, Jul 28, 2015 at 10:29 PM, Bjorn Helgaas <bhelgaas@google.com>
>>> wrote:
> 
> [...]
> 
>>>>>
>>>>> Our latest U-Boot is 1.15.15, but U-Boot 1.15.12 is already a good
>>>>> version to use. Are you running any PCIe traffic test when the error
>>>>> happens?
>>>>
>>>> Nope, the machine was either idle or running a reboot test; no PCIe
>>>> stress
>>>> test or anything.
>>>>
>>>
>>> Was there any conclusion on this ?
>>> I am having similar issue[1] on my Juno with sky2 PCIe driver during
>>> reboot.
>>
>> We found that the unhandled faults occurred when using an extender
>> card.  After removing the extender card, we didn't see the faults any
>> more.
>>
> 
> Thanks for the response. It's not related then, I saw report referencing
> reboot tests and hence linked them together. Sorry for the noise.

For the record, I've had success with this cable on X-Gene:

http://www.amazon.com/PCI-E-Riser-Flexible-Ribbon-Extension/dp/B00H8VVD00?ie=UTF8&psc=1&redirect=true&ref_=oh_aui_search_detailpage

But it's hit or miss. The only public platform where I've been reliably
able to use an extender cable so far is AMD Seattle. On that platform,
the PCIe IP is so rock solid that I can talk to very funky PCIe IP I've
implemented myself in a FPGA (and I can see link quality is fine too).

There's one other non-public platform so far where PCIe extenders work
without a single hitch as well, and a number where more work is needed.

Jon.

-- 
Computer Architect

^ permalink raw reply	[flat|nested] 49+ messages in thread

* X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32
@ 2016-04-13 22:17                 ` Jon Masters
  0 siblings, 0 replies; 49+ messages in thread
From: Jon Masters @ 2016-04-13 22:17 UTC (permalink / raw)
  To: linux-arm-kernel

On 04/13/2016 09:29 AM, Sudeep Holla wrote:
> 
> 
> On 13/04/16 14:21, Bjorn Helgaas wrote:
>> On Wed, Apr 13, 2016 at 10:58:18AM +0100, Sudeep Holla wrote:
>>> Hi,
>>>
>>> (sorry for replying on the old thread, but I found it could be related
>>> to the issue
>>> I have now)
>>>
>>> On Tue, Jul 28, 2015 at 10:29 PM, Bjorn Helgaas <bhelgaas@google.com>
>>> wrote:
> 
> [...]
> 
>>>>>
>>>>> Our latest U-Boot is 1.15.15, but U-Boot 1.15.12 is already a good
>>>>> version to use. Are you running any PCIe traffic test when the error
>>>>> happens?
>>>>
>>>> Nope, the machine was either idle or running a reboot test; no PCIe
>>>> stress
>>>> test or anything.
>>>>
>>>
>>> Was there any conclusion on this ?
>>> I am having similar issue[1] on my Juno with sky2 PCIe driver during
>>> reboot.
>>
>> We found that the unhandled faults occurred when using an extender
>> card.  After removing the extender card, we didn't see the faults any
>> more.
>>
> 
> Thanks for the response. It's not related then, I saw report referencing
> reboot tests and hence linked them together. Sorry for the noise.

For the record, I've had success with this cable on X-Gene:

http://www.amazon.com/PCI-E-Riser-Flexible-Ribbon-Extension/dp/B00H8VVD00?ie=UTF8&psc=1&redirect=true&ref_=oh_aui_search_detailpage

But it's hit or miss. The only public platform where I've been reliably
able to use an extender cable so far is AMD Seattle. On that platform,
the PCIe IP is so rock solid that I can talk to very funky PCIe IP I've
implemented myself in a FPGA (and I can see link quality is fine too).

There's one other non-public platform so far where PCIe extenders work
without a single hitch as well, and a number where more work is needed.

Jon.

-- 
Computer Architect

^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2016-04-13 22:17 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-07-24 22:42 X-Gene: Unhandled fault: synchronous external abort in pci_generic_config_read32 Bjorn Helgaas
2015-07-24 22:42 ` Bjorn Helgaas
2015-07-25  0:05 ` Duc Dang
2015-07-25  0:05   ` Duc Dang
2015-07-27 11:36   ` Catalin Marinas
2015-07-27 11:36     ` Catalin Marinas
2015-07-28 17:39     ` Duc Dang
2015-07-28 17:39       ` Duc Dang
2015-07-28 18:36       ` Bjorn Helgaas
2015-07-28 18:36         ` Bjorn Helgaas
2015-07-28 16:43   ` Bjorn Helgaas
2015-07-28 16:43     ` Bjorn Helgaas
2015-07-28 17:45     ` Duc Dang
2015-07-28 17:45       ` Duc Dang
2015-07-28 21:29       ` Bjorn Helgaas
2015-07-28 21:29         ` Bjorn Helgaas
2015-07-28 21:50         ` Duc Dang
2015-07-28 21:50           ` Duc Dang
2015-07-29  1:22           ` Bjorn Helgaas
2015-07-29  1:22             ` Bjorn Helgaas
2015-07-29 15:55             ` Bjorn Helgaas
2015-07-29 15:55               ` Bjorn Helgaas
2015-07-31 17:00               ` Duc Dang
2015-07-31 17:00                 ` Duc Dang
2015-08-10 16:18                 ` Bjorn Helgaas
2015-08-10 16:18                   ` Bjorn Helgaas
2015-08-10 17:38                   ` Catalin Marinas
2015-08-10 17:38                     ` Catalin Marinas
     [not found]                   ` <CADaLNDkUQHzGACfFmYDeJWnaNrKmJUDx4Rby60OWr4FzOjx3rA@mail.gmail.com>
2015-08-10 17:42                     ` Bjorn Helgaas
2015-08-10 17:42                       ` Bjorn Helgaas
2015-08-10 19:07                       ` Duc Dang
2015-08-10 19:07                         ` Duc Dang
2015-08-11 19:28                         ` Bjorn Helgaas
2015-08-11 19:28                           ` Bjorn Helgaas
2015-09-05 20:13                           ` Jon Masters
2015-09-05 20:13                             ` Jon Masters
2015-09-05 20:22                             ` Jon Masters
2015-09-05 20:22                               ` Jon Masters
2016-04-13  9:58         ` Sudeep Holla
2016-04-13  9:58           ` Sudeep Holla
2016-04-13 13:21           ` Bjorn Helgaas
2016-04-13 13:21             ` Bjorn Helgaas
2016-04-13 13:29             ` Sudeep Holla
2016-04-13 13:29               ` Sudeep Holla
2016-04-13 22:17               ` Jon Masters
2016-04-13 22:17                 ` Jon Masters
2015-07-28 14:37 ` Dall, Elizabeth J
2015-07-28 14:37   ` Dall, Elizabeth J
2015-07-28 14:37   ` Dall, Elizabeth J

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.