All of lore.kernel.org
 help / color / mirror / Atom feed
From: Christoph Hellwig <hch@infradead.org>
To: Bjorn Helgaas <helgaas@kernel.org>
Cc: Kai-Heng Feng <kai.heng.feng@canonical.com>,
	Joerg Roedel <jroedel@suse.de>,
	"open list:PCI ENHANCED ERROR HANDLING (EEH) FOR POWERPC" 
	<linuxppc-dev@lists.ozlabs.org>,
	"open list:PCI SUBSYSTEM" <linux-pci@vger.kernel.org>,
	open list <linux-kernel@vger.kernel.org>,
	Lalithambika Krishnakumar <lalithambika.krishnakumar@intel.com>,
	Alex Williamson <alex.williamson@redhat.com>,
	Oliver O'Halloran <oohall@gmail.com>,
	Bjorn Helgaas <bhelgaas@google.com>,
	Mika Westerberg <mika.westerberg@linux.intel.com>,
	Lu Baolu <baolu.lu@linux.intel.com>
Subject: Re: [PATCH 1/2] PCI/AER: Disable AER interrupt during suspend
Date: Fri, 23 Jul 2021 06:24:22 +0100	[thread overview]
Message-ID: <YPpShrTa448OpGjA@infradead.org> (raw)
In-Reply-To: <20210722222351.GA354095@bjorn-Precision-5520>

On Thu, Jul 22, 2021 at 05:23:51PM -0500, Bjorn Helgaas wrote:
> Marking both of these as "not applicable" for now because I don't
> think we really understand what's going on.
> 
> Apparently a DMA occurs during suspend or resume and triggers an ACS
> violation.  I don't think think such a DMA should occur in the first
> place.
> 
> Or maybe, since you say the problem happens right after ACS is enabled
> during resume, we're doing the ACS enable incorrectly?  Although I
> would think we should not be doing DMA at the same time we're enabling
> ACS, either.
> 
> If this really is a system firmware issue, both HP and Dell should
> have the knowledge and equipment to figure out what's going on.

DMA on resume sounds really odd.  OTOH the below mentioned case of
a DMA during suspend seems very like in some setup.  NVMe has the
concept of a host memory buffer (HMB) that allows the PCIe device
to use arbitrary host memory for internal purposes.  Combine this
with the "Storage D3" misfeature in modern x86 platforms that force
a slot into d3cold without consulting the driver first and you'd see
symptoms like this.  Another case would be the NVMe equivalent of the
AER which could lead to a completion without host activity.

We now have quirks in the ACPI layer and NVMe to fully shut down the
NVMe controllers on these messed up systems with the "Storage D3"
misfeature which should avoid such "spurious" DMAs at the cost of
wearning out the device much faster.

WARNING: multiple messages have this Message-ID (diff)
From: Christoph Hellwig <hch@infradead.org>
To: Bjorn Helgaas <helgaas@kernel.org>
Cc: Joerg Roedel <jroedel@suse.de>,
	Mika Westerberg <mika.westerberg@linux.intel.com>,
	"open list:PCI SUBSYSTEM" <linux-pci@vger.kernel.org>,
	open list <linux-kernel@vger.kernel.org>,
	Lalithambika Krishnakumar <lalithambika.krishnakumar@intel.com>,
	Alex Williamson <alex.williamson@redhat.com>,
	Kai-Heng Feng <kai.heng.feng@canonical.com>,
	Oliver O'Halloran <oohall@gmail.com>,
	Bjorn Helgaas <bhelgaas@google.com>,
	"open list:PCI ENHANCED ERROR HANDLING \(EEH\) FOR POWERPC"
	<linuxppc-dev@lists.ozlabs.org>,
	Lu Baolu <baolu.lu@linux.intel.com>
Subject: Re: [PATCH 1/2] PCI/AER: Disable AER interrupt during suspend
Date: Fri, 23 Jul 2021 06:24:22 +0100	[thread overview]
Message-ID: <YPpShrTa448OpGjA@infradead.org> (raw)
In-Reply-To: <20210722222351.GA354095@bjorn-Precision-5520>

On Thu, Jul 22, 2021 at 05:23:51PM -0500, Bjorn Helgaas wrote:
> Marking both of these as "not applicable" for now because I don't
> think we really understand what's going on.
> 
> Apparently a DMA occurs during suspend or resume and triggers an ACS
> violation.  I don't think think such a DMA should occur in the first
> place.
> 
> Or maybe, since you say the problem happens right after ACS is enabled
> during resume, we're doing the ACS enable incorrectly?  Although I
> would think we should not be doing DMA at the same time we're enabling
> ACS, either.
> 
> If this really is a system firmware issue, both HP and Dell should
> have the knowledge and equipment to figure out what's going on.

DMA on resume sounds really odd.  OTOH the below mentioned case of
a DMA during suspend seems very like in some setup.  NVMe has the
concept of a host memory buffer (HMB) that allows the PCIe device
to use arbitrary host memory for internal purposes.  Combine this
with the "Storage D3" misfeature in modern x86 platforms that force
a slot into d3cold without consulting the driver first and you'd see
symptoms like this.  Another case would be the NVMe equivalent of the
AER which could lead to a completion without host activity.

We now have quirks in the ACPI layer and NVMe to fully shut down the
NVMe controllers on these messed up systems with the "Storage D3"
misfeature which should avoid such "spurious" DMAs at the cost of
wearning out the device much faster.

  reply	other threads:[~2021-07-23  5:24 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-01-27 17:31 [PATCH 1/2] PCI/AER: Disable AER interrupt during suspend Kai-Heng Feng
2021-01-27 17:31 ` Kai-Heng Feng
2021-01-27 17:31 ` [PATCH 2/2] PCI/DPC: Disable DPC " Kai-Heng Feng
2021-01-27 17:31   ` Kai-Heng Feng
2021-01-27 20:50 ` [PATCH 1/2] PCI/AER: Disable AER " Bjorn Helgaas
2021-01-27 20:50   ` Bjorn Helgaas
2021-01-28  4:09   ` Kai-Heng Feng
2021-01-28  4:09     ` Kai-Heng Feng
2021-02-04 23:27     ` Bjorn Helgaas
2021-02-04 23:27       ` Bjorn Helgaas
2021-02-05 15:17       ` Kai-Heng Feng
2021-02-05 15:17         ` Kai-Heng Feng
2021-07-22 22:23         ` Bjorn Helgaas
2021-07-22 22:23           ` Bjorn Helgaas
2021-07-23  5:24           ` Christoph Hellwig [this message]
2021-07-23  5:24             ` Christoph Hellwig
2021-07-23  7:05             ` Kai-Heng Feng
2021-07-23  7:05               ` Kai-Heng Feng

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YPpShrTa448OpGjA@infradead.org \
    --to=hch@infradead.org \
    --cc=alex.williamson@redhat.com \
    --cc=baolu.lu@linux.intel.com \
    --cc=bhelgaas@google.com \
    --cc=helgaas@kernel.org \
    --cc=jroedel@suse.de \
    --cc=kai.heng.feng@canonical.com \
    --cc=lalithambika.krishnakumar@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=mika.westerberg@linux.intel.com \
    --cc=oohall@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.