All of lore.kernel.org
 help / color / mirror / Atom feed
From: Bjorn Helgaas <helgaas@kernel.org>
To: Jerry Hoemann <jerry.hoemann@hpe.com>
Cc: Kairui Song <kasong@redhat.com>, Baoquan He <bhe@redhat.com>,
	Deepa Dinamani <deepa.kernel@gmail.com>,
	jroedel@suse.de, Myron Stowe <myron.stowe@redhat.com>,
	linux-pci@vger.kernel.org, kexec@lists.infradead.org,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Randy Wright <rwright@hpe.com>, Dave Young <dyoung@redhat.com>,
	Khalid Aziz <khalid@gonehiking.org>
Subject: Re: [RFC PATCH] PCI, kdump: Clear bus master bit upon shutdown in kdump kernel
Date: Wed, 22 Jul 2020 19:00:12 -0500	[thread overview]
Message-ID: <20200723000012.GA1325359@bjorn-Precision-5520> (raw)
In-Reply-To: <20200722215048.GL220876@anatevka.americas.hpqcorp.net>

On Wed, Jul 22, 2020 at 03:50:48PM -0600, Jerry Hoemann wrote:
> On Wed, Jul 22, 2020 at 10:21:23AM -0500, Bjorn Helgaas wrote:
> > On Wed, Jul 22, 2020 at 10:52:26PM +0800, Kairui Song wrote:

> > > I think I didn't make one thing clear, The PCI UR error never arrives
> > > in kernel, it's the iLo BMC on that HPE machine caught the error, and
> > > send kernel an NMI. kernel is panicked by NMI, I'm still trying to
> > > figure out why the NMI hanged kernel, even with panic=-1,
> > > panic_on_io_nmi, panic_on_unknown_nmi all set. But if we can avoid the
> > > NMI by shutdown the devices in right order, that's also a solution.

ACPI v6.3, chapter 18, does mention NMIs several times, e.g., Table
18-394 and sec 18.4.  I'm not familiar enough with APEI to know
whether Linux correctly supports all those cases.  Maybe this is a
symptom that we don't?

> > I'm not sure how much sympathy to have for this situation.  A PCIe UR
> > is fatal for the transaction and maybe even the device, but from the
> > overall system point of view, it *should* be a recoverable error and
> > we shouldn't panic.
> > 
> > Errors like that should be reported via the normal AER or ACPI/APEI
> > mechanisms.  It sounds like in this case, the platform has decided
> > these aren't enough and it is trying to force a reboot?  If this is
> > "special" platform behavior, I'm not sure how much we need to cater
> > for it.
> 
> Are these AER errors the type processed by the GHES code?

My understanding from ACPI v6.3, sec 18.3.2, is that the Hardware
Error Source Table may contain Error Source Descriptors of types like:

  IA-32 Machine Check Exception
  IA-32 Corrected Machine Check
  IA-32 Non-Maskable Interrupt
  PCIe Root Port AER
  PCIe Device AER
  Generic Hardware Error Source (GHES)
  Hardware Error Notification
  IA-32 Deferred Machine Check

I would naively expect PCIe UR errors to be reported via one of the
PCIe Error Sources, not GHES, but maybe there's some reason to use
GHES.

The kernel should already know how to deal with the PCIe AER errors,
but we'd have to add new device-specific code to handle things
reported via GHES, along the lines of what Shiju is doing here:

  https://lore.kernel.org/r/20200722104245.1060-1-shiju.jose@huawei.com

> I'll note that RedHat runs their crash kernel with:  hest_disable.
> So, the ghes code is disabled in the crash kernel.

That would disable all the HEST error sources, including the PCIe AER
ones as well as GHES ones.  If we turn off some of the normal error
handling mechanisms, I guess we have to expect that some errors won't
be handled correctly.

Bjorn

WARNING: multiple messages have this Message-ID (diff)
From: Bjorn Helgaas <helgaas@kernel.org>
To: Jerry Hoemann <jerry.hoemann@hpe.com>
Cc: jroedel@suse.de, Kairui Song <kasong@redhat.com>,
	Baoquan He <bhe@redhat.com>,
	linux-pci@vger.kernel.org, Dave Young <dyoung@redhat.com>,
	kexec@lists.infradead.org,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Randy Wright <rwright@hpe.com>,
	Deepa Dinamani <deepa.kernel@gmail.com>,
	Myron Stowe <myron.stowe@redhat.com>,
	Khalid Aziz <khalid@gonehiking.org>
Subject: Re: [RFC PATCH] PCI, kdump: Clear bus master bit upon shutdown in kdump kernel
Date: Wed, 22 Jul 2020 19:00:12 -0500	[thread overview]
Message-ID: <20200723000012.GA1325359@bjorn-Precision-5520> (raw)
In-Reply-To: <20200722215048.GL220876@anatevka.americas.hpqcorp.net>

On Wed, Jul 22, 2020 at 03:50:48PM -0600, Jerry Hoemann wrote:
> On Wed, Jul 22, 2020 at 10:21:23AM -0500, Bjorn Helgaas wrote:
> > On Wed, Jul 22, 2020 at 10:52:26PM +0800, Kairui Song wrote:

> > > I think I didn't make one thing clear, The PCI UR error never arrives
> > > in kernel, it's the iLo BMC on that HPE machine caught the error, and
> > > send kernel an NMI. kernel is panicked by NMI, I'm still trying to
> > > figure out why the NMI hanged kernel, even with panic=-1,
> > > panic_on_io_nmi, panic_on_unknown_nmi all set. But if we can avoid the
> > > NMI by shutdown the devices in right order, that's also a solution.

ACPI v6.3, chapter 18, does mention NMIs several times, e.g., Table
18-394 and sec 18.4.  I'm not familiar enough with APEI to know
whether Linux correctly supports all those cases.  Maybe this is a
symptom that we don't?

> > I'm not sure how much sympathy to have for this situation.  A PCIe UR
> > is fatal for the transaction and maybe even the device, but from the
> > overall system point of view, it *should* be a recoverable error and
> > we shouldn't panic.
> > 
> > Errors like that should be reported via the normal AER or ACPI/APEI
> > mechanisms.  It sounds like in this case, the platform has decided
> > these aren't enough and it is trying to force a reboot?  If this is
> > "special" platform behavior, I'm not sure how much we need to cater
> > for it.
> 
> Are these AER errors the type processed by the GHES code?

My understanding from ACPI v6.3, sec 18.3.2, is that the Hardware
Error Source Table may contain Error Source Descriptors of types like:

  IA-32 Machine Check Exception
  IA-32 Corrected Machine Check
  IA-32 Non-Maskable Interrupt
  PCIe Root Port AER
  PCIe Device AER
  Generic Hardware Error Source (GHES)
  Hardware Error Notification
  IA-32 Deferred Machine Check

I would naively expect PCIe UR errors to be reported via one of the
PCIe Error Sources, not GHES, but maybe there's some reason to use
GHES.

The kernel should already know how to deal with the PCIe AER errors,
but we'd have to add new device-specific code to handle things
reported via GHES, along the lines of what Shiju is doing here:

  https://lore.kernel.org/r/20200722104245.1060-1-shiju.jose@huawei.com

> I'll note that RedHat runs their crash kernel with:  hest_disable.
> So, the ghes code is disabled in the crash kernel.

That would disable all the HEST error sources, including the PCIe AER
ones as well as GHES ones.  If we turn off some of the normal error
handling mechanisms, I guess we have to expect that some errors won't
be handled correctly.

Bjorn

_______________________________________________
kexec mailing list
kexec@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/kexec

  reply	other threads:[~2020-07-23  0:00 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-12-25 19:21 [RFC PATCH] PCI, kdump: Clear bus master bit upon shutdown in kdump kernel Kairui Song
2019-12-25 19:21 ` Kairui Song
2019-12-27 11:11 ` kbuild test robot
2020-01-03  7:58 ` Kairui Song
2020-01-10 21:42 ` Bjorn Helgaas
2020-01-10 22:25   ` Khalid Aziz and Shuah Khan
2020-01-10 23:00     ` Jerry Hoemann
2020-01-11  0:18       ` Khalid Aziz
2020-01-11  0:50         ` Baoquan He
2020-01-11  3:45           ` Khalid Aziz
2020-01-11  9:35             ` Kairui Song
2020-01-11 18:32               ` Deepa Dinamani
2020-01-13 17:07                 ` Kairui Song
2020-01-15  1:16                   ` Deepa Dinamani
2020-01-15  7:56                     ` Kairui Song
2020-01-15 17:30                   ` Khalid Aziz
2020-01-15 18:05                     ` Kairui Song
2020-01-15 21:17                       ` Khalid Aziz
2020-01-17  3:24                         ` Dave Young
2020-01-17  3:46                           ` Baoquan He
2020-01-17 15:44                           ` Khalid Aziz
2020-01-11 10:04             ` Baoquan He
2020-01-11  0:45       ` Baoquan He
2020-01-11  0:51         ` Baoquan He
2020-01-11  1:46         ` Baoquan He
2020-01-11  9:24         ` Kairui Song
2020-01-10 23:36   ` Jerry Hoemann
2020-01-11  8:46   ` Kairui Song
2020-02-22 16:56 ` Bjorn Helgaas
2020-02-24  4:56   ` Dave Young
2020-02-24 17:30   ` Kairui Song
2020-02-28 19:53     ` Deepa Dinamani
2020-03-03 21:01       ` Deepa Dinamani
2020-03-05  3:53         ` Baoquan He
2020-03-05  4:53           ` Deepa Dinamani
2020-03-05  6:06             ` Deepa Dinamani
2020-03-06  9:38             ` Baoquan He
2020-07-22 14:52               ` Kairui Song
2020-07-22 14:52                 ` Kairui Song
2020-07-22 15:21                 ` Bjorn Helgaas
2020-07-22 15:21                   ` Bjorn Helgaas
2020-07-22 21:50                   ` Jerry Hoemann
2020-07-22 21:50                     ` Jerry Hoemann
2020-07-23  0:00                     ` Bjorn Helgaas [this message]
2020-07-23  0:00                       ` Bjorn Helgaas
2020-07-23 18:34                       ` Kairui Song
2020-07-23 18:34                         ` Kairui Song

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200723000012.GA1325359@bjorn-Precision-5520 \
    --to=helgaas@kernel.org \
    --cc=bhe@redhat.com \
    --cc=deepa.kernel@gmail.com \
    --cc=dyoung@redhat.com \
    --cc=jerry.hoemann@hpe.com \
    --cc=jroedel@suse.de \
    --cc=kasong@redhat.com \
    --cc=kexec@lists.infradead.org \
    --cc=khalid@gonehiking.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=myron.stowe@redhat.com \
    --cc=rwright@hpe.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.