All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Sean V Kelley" <sean.v.kelley@intel.com>
To: "Bjorn Helgaas" <helgaas@kernel.org>
Cc: "Sean V Kelley" <seanvk.dev@oregontracks.org>,
	bhelgaas@google.com, Jonathan.Cameron@huawei.com,
	rafael.j.wysocki@intel.com, ashok.raj@intel.com,
	tony.luck@intel.com, sathyanarayanan.kuppuswamy@intel.com,
	qiuxu.zhuo@intel.com, linux-pci@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH v8 11/14] PCI/RCEC: Add RCiEP's linked RCEC to AER/ERR
Date: Fri, 09 Oct 2020 12:48:07 -0700	[thread overview]
Message-ID: <5D41DFBA-CF5D-4DFD-A6B2-A28B621CB228@intel.com> (raw)
In-Reply-To: <B3F01188-01A5-4807-B2C8-33AD5AD4A506@intel.com>

On 9 Oct 2020, at 11:53, Sean V Kelley wrote:

> On 9 Oct 2020, at 11:34, Sean V Kelley wrote:
>
>> On 9 Oct 2020, at 11:26, Sean V Kelley wrote:
>>
>>> Hi Bjorn,
>>>
>>> On 9 Oct 2020, at 10:57, Bjorn Helgaas wrote:
>>>
>>>> On Fri, Oct 02, 2020 at 11:47:32AM -0700, Sean V Kelley wrote:
>>>>> From: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
>>>>>
>>>>> When attempting error recovery for an RCiEP associated with an 
>>>>> RCEC device,
>>>>> there needs to be a way to update the Root Error Status, the 
>>>>> Uncorrectable
>>>>> Error Status and the Uncorrectable Error Severity of the parent 
>>>>> RCEC.
>>>>> In some non-native cases in which there is no OS visible device
>>>>> associated with the RCiEP, there is nothing to act upon as the 
>>>>> firmware
>>>>> is acting before the OS. So add handling for the linked 'rcec' in 
>>>>> AER/ERR
>>>>> while taking into account non-native cases.
>>>>>
>>>>> Co-developed-by: Sean V Kelley <sean.v.kelley@intel.com>
>>>>> Signed-off-by: Sean V Kelley <sean.v.kelley@intel.com>
>>>>> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
>>>>> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>>>>> ---
>>>>>  drivers/pci/pcie/aer.c |  9 +++++----
>>>>>  drivers/pci/pcie/err.c | 39 
>>>>> ++++++++++++++++++++++++++++-----------
>>>>>  2 files changed, 33 insertions(+), 15 deletions(-)
>>>>>
>>>>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>>>>> index 65dff5f3457a..dccdba60b5d9 100644
>>>>> --- a/drivers/pci/pcie/aer.c
>>>>> +++ b/drivers/pci/pcie/aer.c
>>>>> @@ -1358,17 +1358,18 @@ static int aer_probe(struct pcie_device 
>>>>> *dev)
>>>>>  static pci_ers_result_t aer_root_reset(struct pci_dev *dev)
>>>>>  {
>>>>>  	int aer = dev->aer_cap;
>>>>> +	int rc = 0;
>>>>>  	u32 reg32;
>>>>> -	int rc;
>>>>> -
>>>>>
>>>>>  	/* Disable Root's interrupt in response to error messages */
>>>>>  	pci_read_config_dword(dev, aer + PCI_ERR_ROOT_COMMAND, &reg32);
>>>>>  	reg32 &= ~ROOT_PORT_INTR_ON_MESG_MASK;
>>>>>  	pci_write_config_dword(dev, aer + PCI_ERR_ROOT_COMMAND, reg32);
>>>>>
>>>>> -	rc = pci_bus_error_reset(dev);
>>>>> -	pci_info(dev, "Root Port link has been reset\n");
>>>>> +	if (pci_pcie_type(dev) != PCI_EXP_TYPE_RC_EC) {
>>>>> +		rc = pci_bus_error_reset(dev);
>>>>> +		pci_info(dev, "Root Port link has been reset\n");
>>>>> +	}
>>>>>
>>>>>  	/* Clear Root Error Status */
>>>>>  	pci_read_config_dword(dev, aer + PCI_ERR_ROOT_STATUS, &reg32);
>>>>> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
>>>>> index 38abd7984996..956ad4c86d53 100644
>>>>> --- a/drivers/pci/pcie/err.c
>>>>> +++ b/drivers/pci/pcie/err.c
>>>>> @@ -149,7 +149,8 @@ static int report_resume(struct pci_dev *dev, 
>>>>> void *data)
>>>>>  /**
>>>>>   * pci_walk_bridge - walk bridges potentially AER affected
>>>>>   * @bridge   bridge which may be an RCEC with associated RCiEPs,
>>>>> - *           an RCiEP associated with an RCEC, or a Port.
>>>>> + *           or a Port.
>>>>> + * @dev      an RCiEP lacking an associated RCEC.
>>>>>   * @cb       callback to be called for each device found
>>>>>   * @userdata arbitrary pointer to be passed to callback.
>>>>>   *
>>>>> @@ -160,13 +161,20 @@ static int report_resume(struct pci_dev 
>>>>> *dev, void *data)
>>>>>   * If the device provided has no subordinate bus, call the 
>>>>> provided
>>>>>   * callback on the device itself.
>>>>>   */
>>>>> -static void pci_walk_bridge(struct pci_dev *bridge, int 
>>>>> (*cb)(struct pci_dev *, void *),
>>>>> +static void pci_walk_bridge(struct pci_dev *bridge, struct 
>>>>> pci_dev *dev,
>>>>> +			    int (*cb)(struct pci_dev *, void *),
>>>>>  			    void *userdata)
>>>>>  {
>>>>> -	if (bridge->subordinate)
>>>>> +	/*
>>>>> +	 * In a non-native case where there is no OS-visible reporting
>>>>> +	 * device the bridge will be NULL, i.e., no RCEC, no PORT.
>>>>> +	 */
>>>>> +	if (bridge && bridge->subordinate)
>>>>>  		pci_walk_bus(bridge->subordinate, cb, userdata);
>>>>> -	else
>>>>> +	else if (bridge)
>>>>>  		cb(bridge, userdata);
>>>>> +	else
>>>>> +		cb(dev, userdata);
>>>>>  }
>>>>>
>>>>>  static pci_ers_result_t flr_on_rciep(struct pci_dev *dev)
>>>>> @@ -196,16 +204,25 @@ pci_ers_result_t pcie_do_recovery(struct 
>>>>> pci_dev *dev,
>>>>>  	type = pci_pcie_type(dev);
>>>>>  	if (type == PCI_EXP_TYPE_ROOT_PORT ||
>>>>>  	    type == PCI_EXP_TYPE_DOWNSTREAM ||
>>>>> -	    type == PCI_EXP_TYPE_RC_EC ||
>>>>> -	    type == PCI_EXP_TYPE_RC_END)
>>>>> +	    type == PCI_EXP_TYPE_RC_EC)
>>>>>  		bridge = dev;
>>>>> +	else if (type == PCI_EXP_TYPE_RC_END)
>>>>> +		bridge = dev->rcec;
>>>>>  	else
>>>>>  		bridge = pci_upstream_bridge(dev);
>>>>>
>>>>>  	pci_dbg(dev, "broadcast error_detected message\n");
>>>>>  	if (state == pci_channel_io_frozen) {
>>>>> -		pci_walk_bridge(bridge, report_frozen_detected, &status);
>>>>> +		pci_walk_bridge(bridge, dev, report_frozen_detected, &status);
>>>>>  		if (type == PCI_EXP_TYPE_RC_END) {
>>>>> +			/*
>>>>> +			 * The callback only clears the Root Error Status
>>>>> +			 * of the RCEC (see aer.c). Only perform this for the
>>>>> +			 * native case, i.e., an RCEC is present.
>>>>> +			 */
>>>>> +			if (bridge)
>>>>> +				reset_subordinate_devices(bridge);
>>>>
>>>> Help me understand this.  There are lots of callbacks in this 
>>>> picture,
>>>> but I guess this "callback only clears Root Error Status" must 
>>>> refer
>>>> to aer_root_reset(), i.e., the reset_subordinate_devices pointer?
>>>
>>> The ‘bridge’ in this case will always be dev->rcec, the event 
>>> collector for the associated RC_END. And that’s what’s being 
>>> cleared in aer.c via aer_root_reset() as the callback. It’s also 
>>> being checked for native/non-native here.
>>>
>>>>
>>>> Of course, the caller of pcie_do_recovery() supplied that pointer.
>>>> And we can infer that it must be aer_root_reset(), not
>>>> dpc_reset_link(), because RCECs and RCiEPs are not allowed to
>>>> implement DPC.
>>>
>>> Correct.
>>>
>>>>
>>>> I wish we didn't have either this assumption about what
>>>> reset_subordinate_devices points to, or the assumption about what
>>>> aer_root_reset() does.  They both seem a little bit tenuous.
>>>
>>> Agree. It’s the relationship between the RC_END and the RC_EC.
>>>
>>>>
>>>> We already made aer_root_reset() smart enough to check for RCECs.  
>>>> Can
>>>> we put the FLR there, too?  Then we wouldn't have this weird 
>>>> situation
>>>> where reset_subordinate_devices() does a reset and clears error
>>>> status, EXCEPT for this case where it only clears error status and 
>>>> we
>>>> do the reset here?
>>>
>>> We could add the smarts to aer_root_reset() to check for an RC_END 
>>> in that callback and perform the clear there on its RC_EC. We just 
>>> wouldn’t map ‘bridge’ to dev->rcec for RC_END in 
>>> pcie_do_recovery() which would simplify things.
>>>
>>> Further, the FLR in the case of flr_on_rciep() below is specific to 
>>> the RCiEP itself. So it could be performed either in 
>>> aer_root_reset() or remain in the pcie_do_recovery().
>>>
>>> That should work.
>>>
>>> Sean
>>
>> Thinking more on this, you could still pass dev->rcec to the callback 
>> (eventually aer_root_reset()), but you won’t have the ability to 
>> handle the FLR there without the pointer to the RC_END. That’s why 
>> I suggested passing the RC_END and checking for its RC_EC in 
>> aer_root_reset() if you want to also handle FLR there too from below.
>
> Where you get into trouble is that you don’t want to do anything in 
> the non-native case and shouldn’t be going to the callback. So this 
> would add complexity to aer_root_reset() for checking on dev->rcec == 
> NULL…where before you never would have invoked 
> reset_subordinate_devices() in the non-native case.
>
>

Unmapping of bridge to dev->rcec in pcie_do_recovery() also causes 
problems for pci_walk_bridge() which pays attention to the non-native 
case, even making available the dev for when there is no ‘bridge’ 
like device.

I think we may be just shifting things elsewhere despite the constraints 
remaining the same. So while it should work, I believe that I prefer the 
current approach.

Sean



>>
>> Sean
>>
>>>
>>>
>>>>
>>>>>  			status = flr_on_rciep(dev);
>>>>>  			if (status != PCI_ERS_RESULT_RECOVERED) {
>>>>>  				pci_warn(dev, "function level reset failed\n");
>>>>> @@ -219,13 +236,13 @@ pci_ers_result_t pcie_do_recovery(struct 
>>>>> pci_dev *dev,
>>>>>  			}
>>>>>  		}
>>>>>  	} else {
>>>>> -		pci_walk_bridge(bridge, report_normal_detected, &status);
>>>>> +		pci_walk_bridge(bridge, dev, report_normal_detected, &status);
>>>>>  	}
>>>>>
>>>>>  	if (status == PCI_ERS_RESULT_CAN_RECOVER) {
>>>>>  		status = PCI_ERS_RESULT_RECOVERED;
>>>>>  		pci_dbg(dev, "broadcast mmio_enabled message\n");
>>>>> -		pci_walk_bridge(bridge, report_mmio_enabled, &status);
>>>>> +		pci_walk_bridge(bridge, dev, report_mmio_enabled, &status);
>>>>>  	}
>>>>>
>>>>>  	if (status == PCI_ERS_RESULT_NEED_RESET) {
>>>>> @@ -236,14 +253,14 @@ pci_ers_result_t pcie_do_recovery(struct 
>>>>> pci_dev *dev,
>>>>>  		 */
>>>>>  		status = PCI_ERS_RESULT_RECOVERED;
>>>>>  		pci_dbg(dev, "broadcast slot_reset message\n");
>>>>> -		pci_walk_bridge(bridge, report_slot_reset, &status);
>>>>> +		pci_walk_bridge(bridge, dev, report_slot_reset, &status);
>>>>>  	}
>>>>>
>>>>>  	if (status != PCI_ERS_RESULT_RECOVERED)
>>>>>  		goto failed;
>>>>>
>>>>>  	pci_dbg(dev, "broadcast resume message\n");
>>>>> -	pci_walk_bridge(bridge, report_resume, &status);
>>>>> +	pci_walk_bridge(bridge, dev, report_resume, &status);
>>>>>
>>>>>  	if (type == PCI_EXP_TYPE_ROOT_PORT ||
>>>>>  	    type == PCI_EXP_TYPE_DOWNSTREAM ||
>>>>> -- 
>>>>> 2.28.0
>>>>>

  reply	other threads:[~2020-10-09 19:48 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-02 18:47 [PATCH v8 00/14] Add RCEC handling to PCI/AER Sean V Kelley
2020-10-02 18:47 ` [PATCH v8 01/14] PCI/RCEC: Add RCEC class code and extended capability Sean V Kelley
2020-10-02 18:47 ` [PATCH v8 02/14] PCI/RCEC: Bind RCEC devices to the Root Port driver Sean V Kelley
2020-10-02 18:47 ` [PATCH v8 03/14] PCI/RCEC: Cache RCEC capabilities in pci_init_capabilities() Sean V Kelley
2020-10-02 18:47 ` [PATCH v8 04/14] PCI/ERR: Rename reset_link() to reset_subordinate_device() Sean V Kelley
2020-10-02 18:47 ` [PATCH v8 05/14] PCI/ERR: Use "bridge" for clarity in pcie_do_recovery() Sean V Kelley
2020-10-02 18:47 ` [PATCH v8 06/14] PCI/ERR: Add pci_walk_bridge() to pcie_do_recovery() Sean V Kelley
2020-10-02 18:47 ` [PATCH v8 07/14] PCI/ERR: Limit AER resets in pcie_do_recovery() Sean V Kelley
2020-10-02 18:47 ` [PATCH v8 08/14] PCI/AER: Extend AER error handling to RCECs Sean V Kelley
2020-10-09 21:27   ` Bjorn Helgaas
2020-10-09 21:54     ` Sean V Kelley
2020-10-02 18:47 ` [PATCH v8 09/14] PCI/AER: Apply function level reset to RCiEP on fatal error Sean V Kelley
2020-10-02 18:47 ` [PATCH v8 10/14] PCI/RCEC: Add pcie_link_rcec() to associate RCiEPs Sean V Kelley
2020-10-02 18:47 ` [PATCH v8 11/14] PCI/RCEC: Add RCiEP's linked RCEC to AER/ERR Sean V Kelley
2020-10-09 17:57   ` Bjorn Helgaas
2020-10-09 18:26     ` Sean V Kelley
2020-10-09 18:34       ` Sean V Kelley
2020-10-09 18:53         ` Sean V Kelley
2020-10-09 19:48           ` Sean V Kelley [this message]
2020-10-09 21:30     ` Bjorn Helgaas
2020-10-09 22:07       ` Sean V Kelley
2020-10-09 23:51         ` Kelley, Sean V
2020-10-12 22:58           ` Bjorn Helgaas
2020-10-13 16:55             ` Sean V Kelley
2020-10-02 18:47 ` [PATCH v8 12/14] PCI/AER: Add pcie_walk_rcec() to RCEC AER handling Sean V Kelley
2020-10-02 18:47 ` [PATCH v8 13/14] PCI/PME: Add pcie_walk_rcec() to RCEC PME handling Sean V Kelley
2020-10-02 18:47 ` [PATCH v8 14/14] PCI/AER: Add RCEC AER error injection support Sean V Kelley
2020-10-09 15:53 ` [PATCH v8 00/14] Add RCEC handling to PCI/AER Bjorn Helgaas

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5D41DFBA-CF5D-4DFD-A6B2-A28B621CB228@intel.com \
    --to=sean.v.kelley@intel.com \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=ashok.raj@intel.com \
    --cc=bhelgaas@google.com \
    --cc=helgaas@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=qiuxu.zhuo@intel.com \
    --cc=rafael.j.wysocki@intel.com \
    --cc=sathyanarayanan.kuppuswamy@intel.com \
    --cc=seanvk.dev@oregontracks.org \
    --cc=tony.luck@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.