linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Bjorn Helgaas <helgaas@kernel.org>
To: Sean V Kelley <seanvk.dev@oregontracks.org>, Jonathan.Cameron@huawei.com
Cc: bhelgaas@google.com, rafael.j.wysocki@intel.com,
	ashok.raj@intel.com, tony.luck@intel.com,
	sathyanarayanan.kuppuswamy@intel.com, qiuxu.zhuo@intel.com,
	linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org,
	Sean V Kelley <sean.v.kelley@intel.com>
Subject: Re: [PATCH v9 12/15] PCI/RCEC: Add RCiEP's linked RCEC to AER/ERR
Date: Fri, 16 Oct 2020 15:30:37 -0500	[thread overview]
Message-ID: <20201016203037.GA90074@bjorn-Precision-5520> (raw)
In-Reply-To: <20201016001113.2301761-13-seanvk.dev@oregontracks.org>

[+to Jonathan]

On Thu, Oct 15, 2020 at 05:11:10PM -0700, Sean V Kelley wrote:
> From: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
> 
> When attempting error recovery for an RCiEP associated with an RCEC device,
> there needs to be a way to update the Root Error Status, the Uncorrectable
> Error Status and the Uncorrectable Error Severity of the parent RCEC.  In
> some non-native cases in which there is no OS-visible device associated
> with the RCiEP, there is nothing to act upon as the firmware is acting
> before the OS.
> 
> Add handling for the linked RCEC in AER/ERR while taking into account
> non-native cases.
> 
> Co-developed-by: Sean V Kelley <sean.v.kelley@intel.com>
> Link: https://lore.kernel.org/r/20201002184735.1229220-12-seanvk.dev@oregontracks.org
> Signed-off-by: Sean V Kelley <sean.v.kelley@intel.com>
> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> ---
>  drivers/pci/pcie/aer.c | 53 ++++++++++++++++++++++++++++++------------
>  drivers/pci/pcie/err.c | 20 ++++++++--------
>  2 files changed, 48 insertions(+), 25 deletions(-)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 65dff5f3457a..083f69b67bfd 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -1357,27 +1357,50 @@ static int aer_probe(struct pcie_device *dev)
>   */
>  static pci_ers_result_t aer_root_reset(struct pci_dev *dev)
>  {
> -	int aer = dev->aer_cap;
> +	int type = pci_pcie_type(dev);
> +	struct pci_dev *root;
> +	int aer = 0;
> +	int rc = 0;
>  	u32 reg32;
> -	int rc;
>  
> +	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_END)

"type == PCI_EXP_TYPE_RC_END"

> +		/*
> +		 * The reset should only clear the Root Error Status
> +		 * of the RCEC. Only perform this for the
> +		 * native case, i.e., an RCEC is present.
> +		 */
> +		root = dev->rcec;
> +	else
> +		root = dev;
>  
> -	/* Disable Root's interrupt in response to error messages */
> -	pci_read_config_dword(dev, aer + PCI_ERR_ROOT_COMMAND, &reg32);
> -	reg32 &= ~ROOT_PORT_INTR_ON_MESG_MASK;
> -	pci_write_config_dword(dev, aer + PCI_ERR_ROOT_COMMAND, reg32);
> +	if (root)
> +		aer = dev->aer_cap;
>  
> -	rc = pci_bus_error_reset(dev);
> -	pci_info(dev, "Root Port link has been reset\n");
> +	if (aer) {
> +		/* Disable Root's interrupt in response to error messages */
> +		pci_read_config_dword(root, aer + PCI_ERR_ROOT_COMMAND, &reg32);
> +		reg32 &= ~ROOT_PORT_INTR_ON_MESG_MASK;
> +		pci_write_config_dword(root, aer + PCI_ERR_ROOT_COMMAND, reg32);

Not directly related to *this* patch, but my assumption was that in
the APEI case, the firmware should retain ownership of the AER
Capability, so the OS should not touch PCI_ERR_ROOT_COMMAND and
PCI_ERR_ROOT_STATUS.

But this code appears to ignore that ownership.  Jonathan, you must
have looked at this recently for 068c29a248b6 ("PCI/ERR: Clear PCIe
Device Status errors only if OS owns AER").  Do you have any insight
about this?

> -	/* Clear Root Error Status */
> -	pci_read_config_dword(dev, aer + PCI_ERR_ROOT_STATUS, &reg32);
> -	pci_write_config_dword(dev, aer + PCI_ERR_ROOT_STATUS, reg32);
> +		/* Clear Root Error Status */
> +		pci_read_config_dword(root, aer + PCI_ERR_ROOT_STATUS, &reg32);
> +		pci_write_config_dword(root, aer + PCI_ERR_ROOT_STATUS, reg32);
>  
> -	/* Enable Root Port's interrupt in response to error messages */
> -	pci_read_config_dword(dev, aer + PCI_ERR_ROOT_COMMAND, &reg32);
> -	reg32 |= ROOT_PORT_INTR_ON_MESG_MASK;
> -	pci_write_config_dword(dev, aer + PCI_ERR_ROOT_COMMAND, reg32);
> +		/* Enable Root Port's interrupt in response to error messages */
> +		pci_read_config_dword(root, aer + PCI_ERR_ROOT_COMMAND, &reg32);
> +		reg32 |= ROOT_PORT_INTR_ON_MESG_MASK;
> +		pci_write_config_dword(root, aer + PCI_ERR_ROOT_COMMAND, reg32);
> +	}
> +
> +	if ((type == PCI_EXP_TYPE_RC_EC) || (type == PCI_EXP_TYPE_RC_END)) {
> +		if (pcie_has_flr(root)) {
> +			rc = pcie_flr(root);
> +			pci_info(dev, "has been reset (%d)\n", rc);
> +		}
> +	} else {
> +		rc = pci_bus_error_reset(root);

Don't we want "dev" for both the FLR and pci_bus_error_reset()?  I
think "root == dev" except when dev is an RCiEP.  When dev is an
RCiEP, "root" is the RCEC (if present), and we want to reset the
RCiEP, not the RCEC.

> +		pci_info(dev, "Root Port link has been reset (%d)\n", rc);
> +	}

There are a couple changes here that I think should be split out.

Based on my theory that when firmware retains control of AER, the OS
should not touch PCI_ERR_ROOT_COMMAND and PCI_ERR_ROOT_STATUS, and any
updates to them would have to be done by firmware before we get here,
I suggested reordering this:

  - clear PCI_ERR_ROOT_COMMAND ROOT_PORT_INTR_ON_MESG_MASK
  - do reset
  - clear PCI_ERR_ROOT_STATUS (for APEI, presumably done by firmware?)
  - enable PCI_ERR_ROOT_COMMAND ROOT_PORT_INTR_ON_MESG_MASK

to this:

  - clear PCI_ERR_ROOT_COMMAND ROOT_PORT_INTR_ON_MESG_MASK
  - clear PCI_ERR_ROOT_STATUS
  - enable PCI_ERR_ROOT_COMMAND ROOT_PORT_INTR_ON_MESG_MASK
  - do reset

If my theory is correct, I think we should still reorder this, but:

  - It's a significant behavior change that deserves its own patch so
    we can document/bisect/revert.

  - I'm not sure why we clear the PCI_ERR_ROOT_COMMAND error reporting
    bits.  In the new "clear COMMAND, clear STATUS, enable COMMAND"
    order, it looks superfluous.  There's no reason to disable error
    reporting while clearing the status bits.

    The current "clear, reset, enable" order suggests that the reset
    might cause errors that we should ignore.  I don't know whether
    that's the case or not.  It dates from 6c2b374d7485 ("PCI-Express
    AER implemetation: AER core and aerdriver"), which doesn't
    elaborate.

  - Should we also test for OS ownership of AER before touching
    PCI_ERR_ROOT_STATUS?

  - If we remove the PCI_ERR_ROOT_COMMAND fiddling (and I tentatively
    think we *should* unless we can justify it), that would also
    deserve its own patch.  Possibly (1) remove PCI_ERR_ROOT_COMMAND
    fiddling, (2) reorder PCI_ERR_ROOT_STATUS clearing and reset, (3)
    test for OS ownership of AER (?), (4) the rest of this patch.

>  	return rc ? PCI_ERS_RESULT_DISCONNECT : PCI_ERS_RESULT_RECOVERED;
>  }
> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> index 7883c9791562..cbc5abfe767b 100644
> --- a/drivers/pci/pcie/err.c
> +++ b/drivers/pci/pcie/err.c
> @@ -148,10 +148,10 @@ static int report_resume(struct pci_dev *dev, void *data)
>  
>  /**
>   * pci_walk_bridge - walk bridges potentially AER affected
> - * @bridge:	bridge which may be a Port, an RCEC with associated RCiEPs,
> - *		or an RCiEP associated with an RCEC
> - * @cb:		callback to be called for each device found
> - * @userdata:	arbitrary pointer to be passed to callback
> + * @bridge   bridge which may be an RCEC with associated RCiEPs,
> + *           or a Port.
> + * @cb       callback to be called for each device found
> + * @userdata arbitrary pointer to be passed to callback.
>   *
>   * If the device provided is a bridge, walk the subordinate bus, including
>   * any bridged devices on buses under this bus.  Call the provided callback
> @@ -164,8 +164,14 @@ static void pci_walk_bridge(struct pci_dev *bridge,
>  			    int (*cb)(struct pci_dev *, void *),
>  			    void *userdata)
>  {
> +	/*
> +	 * In a non-native case where there is no OS-visible reporting
> +	 * device the bridge will be NULL, i.e., no RCEC, no Downstream Port.
> +	 */
>  	if (bridge->subordinate)
>  		pci_walk_bus(bridge->subordinate, cb, userdata);
> +	else if (bridge->rcec)
> +		cb(bridge->rcec, userdata);
>  	else
>  		cb(bridge, userdata);
>  }
> @@ -194,12 +200,6 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>  	pci_dbg(bridge, "broadcast error_detected message\n");
>  	if (state == pci_channel_io_frozen) {
>  		pci_walk_bridge(bridge, report_frozen_detected, &status);
> -		if (type == PCI_EXP_TYPE_RC_END) {
> -			pci_warn(dev, "subordinate device reset not possible for RCiEP\n");
> -			status = PCI_ERS_RESULT_NONE;
> -			goto failed;
> -		}
> -
>  		status = reset_subordinates(bridge);
>  		if (status != PCI_ERS_RESULT_RECOVERED) {
>  			pci_warn(bridge, "subordinate device reset failed\n");
> -- 
> 2.28.0
> 

  reply	other threads:[~2020-10-16 20:30 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-16  0:10 [PATCH v9 00/15] Add RCEC handling to PCI/AER Sean V Kelley
2020-10-16  0:10 ` [PATCH v9 01/15] PCI/RCEC: Add RCEC class code and extended capability Sean V Kelley
2020-10-16  0:11 ` [PATCH v9 02/15] PCI/RCEC: Bind RCEC devices to the Root Port driver Sean V Kelley
2020-10-16  0:11 ` [PATCH v9 03/15] PCI/RCEC: Cache RCEC capabilities in pci_init_capabilities() Sean V Kelley
2020-10-16  0:11 ` [PATCH v9 04/15] PCI/ERR: Rename reset_link() to reset_subordinates() Sean V Kelley
2020-10-16  0:11 ` [PATCH v9 05/15] PCI/ERR: Simplify by using pci_upstream_bridge() Sean V Kelley
2020-10-16  0:11 ` [PATCH v9 06/15] PCI/ERR: Simplify by computing pci_pcie_type() once Sean V Kelley
2020-10-16  0:11 ` [PATCH v9 07/15] PCI/ERR: Use "bridge" for clarity in pcie_do_recovery() Sean V Kelley
2020-10-16  0:11 ` [PATCH v9 08/15] PCI/ERR: Avoid negated conditional for clarity Sean V Kelley
2020-10-16  0:11 ` [PATCH v9 09/15] PCI/ERR: Add pci_walk_bridge() to pcie_do_recovery() Sean V Kelley
2020-10-16 17:19   ` Bjorn Helgaas
2020-10-16  0:11 ` [PATCH v9 10/15] PCI/ERR: Limit AER resets in pcie_do_recovery() Sean V Kelley
2020-10-16 17:22   ` Bjorn Helgaas
2020-10-16 17:36     ` Sean V Kelley
2020-10-16  0:11 ` [PATCH v9 11/15] PCI/RCEC: Add pcie_link_rcec() to associate RCiEPs Sean V Kelley
2020-10-16  0:11 ` [PATCH v9 12/15] PCI/RCEC: Add RCiEP's linked RCEC to AER/ERR Sean V Kelley
2020-10-16 20:30   ` Bjorn Helgaas [this message]
2020-10-16 22:29     ` Bjorn Helgaas
2020-10-17  0:28       ` Kuppuswamy, Sathyanarayanan
2020-10-17  1:45       ` Kuppuswamy, Sathyanarayanan
2020-10-19 10:49       ` Ethan Zhao
2020-10-19 18:31         ` Sean V Kelley
2020-10-19 18:59           ` Kuppuswamy, Sathyanarayanan
2020-10-19 20:50             ` Sean V Kelley
2020-10-20 12:59               ` Jonathan Cameron
2020-10-20 16:28                 ` Bjorn Helgaas
2020-10-17 16:14     ` Sean V Kelley
2020-10-19 19:07       ` Sean V Kelley
2020-10-16  0:11 ` [PATCH v9 13/15] PCI/AER: Add pcie_walk_rcec() to RCEC AER handling Sean V Kelley
2020-10-16  0:11 ` [PATCH v9 14/15] PCI/PME: Add pcie_walk_rcec() to RCEC PME handling Sean V Kelley
2020-10-16  0:11 ` [PATCH v9 15/15] PCI/AER: Add RCEC AER error injection support Sean V Kelley

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201016203037.GA90074@bjorn-Precision-5520 \
    --to=helgaas@kernel.org \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=ashok.raj@intel.com \
    --cc=bhelgaas@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=qiuxu.zhuo@intel.com \
    --cc=rafael.j.wysocki@intel.com \
    --cc=sathyanarayanan.kuppuswamy@intel.com \
    --cc=sean.v.kelley@intel.com \
    --cc=seanvk.dev@oregontracks.org \
    --cc=tony.luck@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).