All of lore.kernel.org
 help / color / mirror / Atom feed
From: Bjorn Helgaas <helgaas@kernel.org>
To: Kai-Heng Feng <kai.heng.feng@canonical.com>
Cc: bhelgaas@google.com, Russell Currey <ruscur@russell.cc>,
	Oliver O'Halloran <oohall@gmail.com>,
	Mika Westerberg <mika.westerberg@linux.intel.com>,
	Lalithambika Krishnakumar <lalithambika.krishnakumar@intel.com>,
	Lu Baolu <baolu.lu@linux.intel.com>,
	Joerg Roedel <jroedel@suse.de>,
	"open list:PCI ENHANCED ERROR HANDLING (EEH) FOR POWERPC" 
	<linuxppc-dev@lists.ozlabs.org>,
	"open list:PCI SUBSYSTEM" <linux-pci@vger.kernel.org>,
	open list <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 1/2] PCI/AER: Disable AER interrupt during suspend
Date: Wed, 27 Jan 2021 14:50:53 -0600	[thread overview]
Message-ID: <20210127205053.GA3049358@bjorn-Precision-5520> (raw)
In-Reply-To: <20210127173101.446940-1-kai.heng.feng@canonical.com>

On Thu, Jan 28, 2021 at 01:31:00AM +0800, Kai-Heng Feng wrote:
> Commit 50310600ebda ("iommu/vt-d: Enable PCI ACS for platform opt in
> hint") enables ACS, and some platforms lose its NVMe after resume from
> firmware:
> [   50.947816] pcieport 0000:00:1b.0: DPC: containment event, status:0x1f01 source:0x0000
> [   50.947817] pcieport 0000:00:1b.0: DPC: unmasked uncorrectable error detected
> [   50.947829] pcieport 0000:00:1b.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver ID)
> [   50.947830] pcieport 0000:00:1b.0:   device [8086:06ac] error status/mask=00200000/00010000
> [   50.947831] pcieport 0000:00:1b.0:    [21] ACSViol                (First)
> [   50.947841] pcieport 0000:00:1b.0: AER: broadcast error_detected message
> [   50.947843] nvme nvme0: frozen state error detected, reset controller
> 
> It happens right after ACS gets enabled during resume.
> 
> To prevent that from happening, disable AER interrupt and enable it on
> system suspend and resume, respectively.

Lots of questions here.  Maybe this is what we'll end up doing, but I
am curious about why the error is reported in the first place.

Is this a consequence of the link going down and back up?

Is it consequence of the device doing a DMA when it shouldn't?

Are we doing something in the wrong order during suspend?  Or maybe
resume, since I assume the error is reported during resume?

If we *do* take the error, why doesn't DPC recovery work?

> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=209149
> Fixes: 50310600ebda ("iommu/vt-d: Enable PCI ACS for platform opt in hint")
> Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
> ---
>  drivers/pci/pcie/aer.c | 18 ++++++++++++++++++
>  1 file changed, 18 insertions(+)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 77b0f2c45bc0..0e9a85530ae6 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -1365,6 +1365,22 @@ static int aer_probe(struct pcie_device *dev)
>  	return 0;
>  }
>  
> +static int aer_suspend(struct pcie_device *dev)
> +{
> +	struct aer_rpc *rpc = get_service_data(dev);
> +
> +	aer_disable_rootport(rpc);
> +	return 0;
> +}
> +
> +static int aer_resume(struct pcie_device *dev)
> +{
> +	struct aer_rpc *rpc = get_service_data(dev);
> +
> +	aer_enable_rootport(rpc);
> +	return 0;
> +}
> +
>  /**
>   * aer_root_reset - reset Root Port hierarchy, RCEC, or RCiEP
>   * @dev: pointer to Root Port, RCEC, or RCiEP
> @@ -1437,6 +1453,8 @@ static struct pcie_port_service_driver aerdriver = {
>  	.service	= PCIE_PORT_SERVICE_AER,
>  
>  	.probe		= aer_probe,
> +	.suspend	= aer_suspend,
> +	.resume		= aer_resume,
>  	.remove		= aer_remove,
>  };
>  
> -- 
> 2.29.2
> 

WARNING: multiple messages have this Message-ID (diff)
From: Bjorn Helgaas <helgaas@kernel.org>
To: Kai-Heng Feng <kai.heng.feng@canonical.com>
Cc: Joerg Roedel <jroedel@suse.de>,
	"open list:PCI ENHANCED ERROR HANDLING \(EEH\) FOR POWERPC"
	<linuxppc-dev@lists.ozlabs.org>,
	"open list:PCI SUBSYSTEM" <linux-pci@vger.kernel.org>,
	open list <linux-kernel@vger.kernel.org>,
	Lalithambika Krishnakumar <lalithambika.krishnakumar@intel.com>,
	Oliver O'Halloran <oohall@gmail.com>,
	bhelgaas@google.com,
	Mika Westerberg <mika.westerberg@linux.intel.com>,
	Lu Baolu <baolu.lu@linux.intel.com>
Subject: Re: [PATCH 1/2] PCI/AER: Disable AER interrupt during suspend
Date: Wed, 27 Jan 2021 14:50:53 -0600	[thread overview]
Message-ID: <20210127205053.GA3049358@bjorn-Precision-5520> (raw)
In-Reply-To: <20210127173101.446940-1-kai.heng.feng@canonical.com>

On Thu, Jan 28, 2021 at 01:31:00AM +0800, Kai-Heng Feng wrote:
> Commit 50310600ebda ("iommu/vt-d: Enable PCI ACS for platform opt in
> hint") enables ACS, and some platforms lose its NVMe after resume from
> firmware:
> [   50.947816] pcieport 0000:00:1b.0: DPC: containment event, status:0x1f01 source:0x0000
> [   50.947817] pcieport 0000:00:1b.0: DPC: unmasked uncorrectable error detected
> [   50.947829] pcieport 0000:00:1b.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver ID)
> [   50.947830] pcieport 0000:00:1b.0:   device [8086:06ac] error status/mask=00200000/00010000
> [   50.947831] pcieport 0000:00:1b.0:    [21] ACSViol                (First)
> [   50.947841] pcieport 0000:00:1b.0: AER: broadcast error_detected message
> [   50.947843] nvme nvme0: frozen state error detected, reset controller
> 
> It happens right after ACS gets enabled during resume.
> 
> To prevent that from happening, disable AER interrupt and enable it on
> system suspend and resume, respectively.

Lots of questions here.  Maybe this is what we'll end up doing, but I
am curious about why the error is reported in the first place.

Is this a consequence of the link going down and back up?

Is it consequence of the device doing a DMA when it shouldn't?

Are we doing something in the wrong order during suspend?  Or maybe
resume, since I assume the error is reported during resume?

If we *do* take the error, why doesn't DPC recovery work?

> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=209149
> Fixes: 50310600ebda ("iommu/vt-d: Enable PCI ACS for platform opt in hint")
> Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
> ---
>  drivers/pci/pcie/aer.c | 18 ++++++++++++++++++
>  1 file changed, 18 insertions(+)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 77b0f2c45bc0..0e9a85530ae6 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -1365,6 +1365,22 @@ static int aer_probe(struct pcie_device *dev)
>  	return 0;
>  }
>  
> +static int aer_suspend(struct pcie_device *dev)
> +{
> +	struct aer_rpc *rpc = get_service_data(dev);
> +
> +	aer_disable_rootport(rpc);
> +	return 0;
> +}
> +
> +static int aer_resume(struct pcie_device *dev)
> +{
> +	struct aer_rpc *rpc = get_service_data(dev);
> +
> +	aer_enable_rootport(rpc);
> +	return 0;
> +}
> +
>  /**
>   * aer_root_reset - reset Root Port hierarchy, RCEC, or RCiEP
>   * @dev: pointer to Root Port, RCEC, or RCiEP
> @@ -1437,6 +1453,8 @@ static struct pcie_port_service_driver aerdriver = {
>  	.service	= PCIE_PORT_SERVICE_AER,
>  
>  	.probe		= aer_probe,
> +	.suspend	= aer_suspend,
> +	.resume		= aer_resume,
>  	.remove		= aer_remove,
>  };
>  
> -- 
> 2.29.2
> 

  parent reply	other threads:[~2021-01-27 20:51 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-01-27 17:31 [PATCH 1/2] PCI/AER: Disable AER interrupt during suspend Kai-Heng Feng
2021-01-27 17:31 ` Kai-Heng Feng
2021-01-27 17:31 ` [PATCH 2/2] PCI/DPC: Disable DPC " Kai-Heng Feng
2021-01-27 17:31   ` Kai-Heng Feng
2021-01-27 20:50 ` Bjorn Helgaas [this message]
2021-01-27 20:50   ` [PATCH 1/2] PCI/AER: Disable AER " Bjorn Helgaas
2021-01-28  4:09   ` Kai-Heng Feng
2021-01-28  4:09     ` Kai-Heng Feng
2021-02-04 23:27     ` Bjorn Helgaas
2021-02-04 23:27       ` Bjorn Helgaas
2021-02-05 15:17       ` Kai-Heng Feng
2021-02-05 15:17         ` Kai-Heng Feng
2021-07-22 22:23         ` Bjorn Helgaas
2021-07-22 22:23           ` Bjorn Helgaas
2021-07-23  5:24           ` Christoph Hellwig
2021-07-23  5:24             ` Christoph Hellwig
2021-07-23  7:05             ` Kai-Heng Feng
2021-07-23  7:05               ` Kai-Heng Feng

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210127205053.GA3049358@bjorn-Precision-5520 \
    --to=helgaas@kernel.org \
    --cc=baolu.lu@linux.intel.com \
    --cc=bhelgaas@google.com \
    --cc=jroedel@suse.de \
    --cc=kai.heng.feng@canonical.com \
    --cc=lalithambika.krishnakumar@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=mika.westerberg@linux.intel.com \
    --cc=oohall@gmail.com \
    --cc=ruscur@russell.cc \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.