linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Keith Busch <kbusch@kernel.org>
To: Hinko Kocevar <hinko.kocevar@ess.eu>
Cc: "Kelley, Sean V" <sean.v.kelley@intel.com>,
	Linux PCI <linux-pci@vger.kernel.org>,
	Bjorn Helgaas <helgaas@kernel.org>
Subject: Re: [PATCHv2 0/5] aer handling fixups
Date: Tue, 12 Jan 2021 15:17:44 -0800	[thread overview]
Message-ID: <20210112231744.GB1508433@dhcp-10-100-145-180.wdc.com> (raw)
In-Reply-To: <8650281b-4430-1938-5d45-53f09010497b@ess.eu>

On Tue, Jan 12, 2021 at 11:19:37PM +0100, Hinko Kocevar wrote:
> I feel inclined to provide a little bit more info about the system I'm
> running this on as it is not a regular PC/server/laptop. It is a modular
> micro TCA system with a single CPU and MCH. MCH and CPU are separate cards,
> as are the other processing cards (AMCs) that link up to CPU through the MCH
> PEX8748 switch. I can power each card individually, or perform complete
> system power cycle. The normal power up sequence is: MCH, AMCs, CPU. The CPU
> is powered 30 sec after all other cards so that their PCIe links are up and
> ready for Linux.
> 
> All buses below CPU side 02:01.0 are on MCH PEX8748 switch:
> 
> [dev@bd-cpu18 ~]$ sudo /usr/local/bin/pcicrawler -t
> 00:01.0 root_port, "J6B2", slot 1, device present, speed 8GT/s, width x8
>  ├─01:00.0 upstream_port, PLX Technology, Inc. (10b5), device 8725
>  │  ├─02:01.0 downstream_port, slot 1, device present, power: Off, speed 8GT/s, width x4
>  │  │  └─03:00.0 upstream_port, PLX Technology, Inc. (10b5) PEX 8748 48-Lane, 12-Port PCI Express Gen 3 (8 GT/s) Switch, 27 x 27mm FCBGA (8748)
>  │  │     ├─04:01.0 downstream_port, slot 4, power: Off
>  │  │     ├─04:03.0 downstream_port, slot 3, power: Off
>  │  │     ├─04:08.0 downstream_port, slot 5, power: Off
>  │  │     ├─04:0a.0 downstream_port, slot 6, device present, power: Off, speed 8GT/s, width x4
>  │  │     │  └─08:00.0 endpoint, Xilinx Corporation (10ee), device 8034
>  │  │     └─04:12.0 downstream_port, slot 1, power: Off
>  │  ├─02:02.0 downstream_port, slot 2
>  │  ├─02:08.0 downstream_port, slot 8
>  │  ├─02:09.0 downstream_port, slot 9, power: Off
>  │  └─02:0a.0 downstream_port, slot 10
>  ├─01:00.1 endpoint, PLX Technology, Inc. (10b5), device 87d0
>  ├─01:00.2 endpoint, PLX Technology, Inc. (10b5), device 87d0
>  ├─01:00.3 endpoint, PLX Technology, Inc. (10b5), device 87d0
>  └─01:00.4 endpoint, PLX Technology, Inc. (10b5), device 87d0
> 
> 
> The lockups most frequently appear after the cold boot of the system. If I
> restart the CPU card only, and leave the MCH (where the PEX8748 switch
> resides) powered, the lockups do *not* happen. I'm injecting the same error
> into the root port and the system card configuration/location/count is
> always the same.
> 
> Nevertheless, in rare occasions while booting the same kernel image after
> complete system power cycle, no lockup is observed.
> 
> So far I observed that the lockups seem to always happen when recovery is
> dealing with the 02:01.0 device/bus.
> 
> If the system recovers from a first injected error, I can repeat the
> injection and the system recovers always. If the first recovery fails I have
> to either reboot the CPU or power cycle the complete system.
> 
> To me it looks like this behavior is somehow related to the system/setup I
> have, and for some reason is triggered by VC restoration (VC is not is use
> by my system at all, AFAIK).
 
> Are you able to tell which part of the code the CPU is actually spinning in
> when the lockup is detected? I added many printk()s in the
> pci_restore_vc_state(), in the AER IRQ handler, and around to see something
> being continuously printed, but nothing appeared..

It sounds like your setup is having difficulting completing config
cycles timely after a secondary bus reset. I don't see right now how
anything I've provided in this series is causing that.

All the stack traces you've provided so far are all within virtual
channel restoration. Subsequent stack traces are never the same though,
so it does not appear to be permanently stuck; it's just incredibly
slow. This particular capability restoration happens to require more
config cycles than most other capabilities, so I'm guessing it happens
to show up in your observation because of that rather than anything
specific about VC.

The long delays seem like a CTO should have kicked in, but maybe your
hardware isn't doing it right. Your lspci says Completion Timeout
configuration is not supported, so it should default to 50msec maximum,
but since it's taking long enough to trigger a stuck CPU watchdog, and
you appear to be getting valid data back, it doesn't look like CTO is
happening.

  reply	other threads:[~2021-01-12 23:18 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-01-04 23:02 [PATCHv2 0/5] aer handling fixups Keith Busch
2021-01-04 23:02 ` [PATCHv2 1/5] PCI/ERR: Clear status of the reporting device Keith Busch
2021-01-04 23:02 ` [PATCHv2 2/5] PCI/AER: Actually get the root port Keith Busch
2021-01-04 23:02 ` [PATCHv2 3/5] PCI/ERR: Retain status from error notification Keith Busch
2021-03-03  5:34   ` Williams, Dan J
2021-03-03  5:46     ` Kuppuswamy, Sathyanarayanan
2021-03-04 20:01       ` Keith Busch
2021-03-04 22:11         ` Dan Williams
     [not found]           ` <23551edc-965c-21dc-0da8-a492c27c362d@intel.com>
2021-03-04 22:59             ` Dan Williams
2021-03-04 23:19               ` Kuppuswamy, Sathyanarayanan
2021-03-05  0:23                 ` Dan Williams
2021-03-05  0:54                   ` Keith Busch
2021-01-04 23:02 ` [PATCHv2 4/5] PCI/AER: Specify the type of port that was reset Keith Busch
2021-01-04 23:03 ` [PATCHv2 5/5] PCI/portdrv: Report reset for frozen channel Keith Busch
2021-01-05 14:21 ` [PATCHv2 0/5] aer handling fixups Hinko Kocevar
2021-01-05 15:06   ` Hinko Kocevar
2021-01-05 18:33     ` Keith Busch
2021-01-05 23:07       ` Kelley, Sean V
2021-01-07 21:42         ` Keith Busch
2021-01-08  9:38           ` Hinko Kocevar
2021-01-11 13:39             ` Hinko Kocevar
2021-01-11 16:37               ` Keith Busch
2021-01-11 20:02                 ` Hinko Kocevar
2021-01-11 22:09                   ` Keith Busch
     [not found]                     ` <ed8256dd-d70d-b8dc-fdc0-a78b9aa3bbd9@ess.eu>
2021-01-12 19:27                       ` Keith Busch
2021-01-12 22:19                         ` Hinko Kocevar
2021-01-12 23:17                           ` Keith Busch [this message]
2021-01-18  8:00                             ` Hinko Kocevar
2021-01-19 18:28                               ` Keith Busch
2021-02-03  0:03 ` Keith Busch
2021-02-04  8:35   ` Hinko Kocevar
2021-02-08 12:55 ` Hedi Berriche
2021-02-09 23:06 ` Bjorn Helgaas
2021-02-10  4:05   ` Keith Busch
2021-02-10 21:38     ` Bjorn Helgaas
2021-02-10  9:36 ` Yicong Yang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210112231744.GB1508433@dhcp-10-100-145-180.wdc.com \
    --to=kbusch@kernel.org \
    --cc=helgaas@kernel.org \
    --cc=hinko.kocevar@ess.eu \
    --cc=linux-pci@vger.kernel.org \
    --cc=sean.v.kelley@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).