All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Sean V Kelley" <sean.v.kelley@intel.com>
To: "Jonathan Cameron" <Jonathan.Cameron@Huawei.com>
Cc: bhelgaas@google.com, rjw@rjwysocki.net, ashok.raj@kernel.org,
	tony.luck@intel.com, sathyanarayanan.kuppuswamy@linux.intel.com,
	linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH 4/9] PCI/AER: Extend AER error handling to RCECs
Date: Mon, 27 Jul 2020 07:58:59 -0700	[thread overview]
Message-ID: <763CB36C-4D47-4A59-BBEF-577E48FD1A18@intel.com> (raw)
In-Reply-To: <20200727120019.000030d2@Huawei.com>

On 27 Jul 2020, at 4:00, Jonathan Cameron wrote:

> On Fri, 24 Jul 2020 10:22:18 -0700
> Sean V Kelley <sean.v.kelley@intel.com> wrote:
>
>> From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>>
>> Currently the kernel does not handle AER errors for Root Complex 
>> integrated
>> End Points (RCiEPs)[0]. These devices sit on a root bus within the 
>> Root Complex
>> (RC). AER handling is performed by a Root Complex Event Collector 
>> (RCEC) [1]
>> which is a effectively a type of RCiEP on the same root bus.
>>
>> For an RCEC (technically not a Bridge), error messages "received" 
>> from
>> associated RCiEPs must be enabled for "transmission" in order to 
>> cause a
>> System Error via the Root Control register or (when the Advanced 
>> Error
>> Reporting Capability is present) reporting via the Root Error Command
>> register and logging in the Root Error Status register and Error 
>> Source
>> Identification register.
>>
>> In addition to the defined OS level handling of the reset flow for 
>> the
>> associated RCiEPs of an RCEC, it is possible to also have a firmware 
>> first
>> model. In that case there is no need to take any actions on the RCEC 
>> because
>> the firmware is responsible for them. This is true where APEI [2] is 
>> used
>> to report the AER errors via a GHES[v2] HEST entry [3] and relevant
>> AER CPER record [4] and Firmware First handling is in use.
>>
>> We effectively end up with two different types of discovery for
>> purposes of handling AER errors:
>>
>> 1) Normal bus walk - we pass the downstream port above a bus to which
>> the device is attached and it walks everything below that point.
>>
>> 2) An RCiEP with no visible association with an RCEC as there is no 
>> need to
>> walk devices. In that case, the flow is to just call the callbacks 
>> for the actual
>> device.
>>
>> A new walk function, similar to pci_bus_walk is provided that takes a 
>> pci_dev
>> instead of a bus. If that dev corresponds to a downstream port it 
>> will walk
>> the subordinate bus of that downstream port. If the dev does not then 
>> it
>> will call the function on that device alone.
>>
>> [0] ACPI PCI Express Base Specification 5.0-1 1.3.2.3 Root Complex 
>> Integrated
>>     Endpoint Rules.
>> [1] ACPI PCI Express Base Specification 5.0-1 6.2 Error Signalling 
>> and Logging
>> [2] ACPI Specification 6.3 Chapter 18 ACPI Platform Error Interface 
>> (APEI)
>> [3] ACPI Specification 6.3 18.2.3.7 Generic Hardware Error Source
>> [4] UEFI Specification 2.8, N.2.7 PCI Express Error Section
>>
>> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>> Signed-off-by: Sean V Kelley <sean.v.kelley@intel.com>
>> ---
>> Changes since v2 [1]:
>>
>> - Renamed to pci_walk_dev_affected() to reflect the aer affected 
>> devices
> Make sense.
>
>> - Localized to err.c and made static
>
> Makes sense.
>
>> - Added check for RCEC to reflect
> That comment probably needs a bit more...

Will add to the details.

>
>> - Tightened up commit log from earlier inquiry focused RFC
> Cool.
>
>
> Looks good to me and I like the new naming.
>
> A few really trivial tidy ups suggested for things that were less than 
> neat in my patch.
>
> Jonathan
>
>>
>> [1] 
>> https://lore.kernel.org/linux-pci/20200622114402.892798-1-Jonathan.Cameron@huawei.com/
>> ---
>>  drivers/pci/pcie/err.c | 55 
>> ++++++++++++++++++++++++++++++++++--------
>>  1 file changed, 45 insertions(+), 10 deletions(-)
>>
>> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
>> index 14bb8f54723e..044df004f20b 100644
>> --- a/drivers/pci/pcie/err.c
>> +++ b/drivers/pci/pcie/err.c
>> @@ -146,38 +146,69 @@ static int report_resume(struct pci_dev *dev, 
>> void *data)
>>  	return 0;
>>  }
>>
>> +/** pci_walk_dev_affected - walk devices potentially AER affected
> /**
>  * pci_walk_dev_affected
>
> There is a bit of a mixture in pci files between the two styles, but
> I'm fairly sure kernel-doc is supposed to be as I'm suggesting
> (I had this wrong due to cut and paste in earlier version!)

Will fix.

>
>> + *  @dev      device which may be an RCEC with associated RCiEPs,
>> + *            an RCiEP associated with an RCEC, or a Port.
>> + *  @cb       callback to be called for each device found
>> + *  @userdata arbitrary pointer to be passed to callback.
>> + *
>> + *  If the device provided is a port, walk the subordinate bus,
>> + *  including any bridged devices on buses under this bus.
>> + *  Call the provided callback on each device found.
>> + *
>> + *  If the device provided has no subordinate bus, call the provided
>> + *  callback on the device itself.
>> + *
>
> I also had an ugly pointless newline here. oops :)

Will fix.

Thanks,

Sean

>
>> + */
>> +static void pci_walk_dev_affected(struct pci_dev *dev, int 
>> (*cb)(struct pci_dev *, void *),
>> +				  void *userdata)
>> +{
>> +	if (dev->subordinate) {
>> +		pci_walk_bus(dev->subordinate, cb, userdata);
>> +	} else {
>> +		cb(dev, userdata);
>> +	}
>> +}
>> +
>>  pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>>  			enum pci_channel_state state,
>>  			pci_ers_result_t (*reset_link)(struct pci_dev *pdev))
>>  {
>>  	pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
>> -	struct pci_bus *bus;
>>
>>  	/*
>>  	 * Error recovery runs on all subordinates of the first downstream 
>> port.
>>  	 * If the downstream port detected the error, it is cleared at the 
>> end.
>> +	 * For RCiEPs we should reset just the RCiEP itself.
>>  	 */
>>  	if (!(pci_pcie_type(dev) == PCI_EXP_TYPE_ROOT_PORT ||
>> -	      pci_pcie_type(dev) == PCI_EXP_TYPE_DOWNSTREAM))
>> +	      pci_pcie_type(dev) == PCI_EXP_TYPE_DOWNSTREAM ||
>> +	      pci_pcie_type(dev) == PCI_EXP_TYPE_RC_END ||
>> +	      pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC))
>>  		dev = dev->bus->self;
>> -	bus = dev->subordinate;
>>
>>  	pci_dbg(dev, "broadcast error_detected message\n");
>>  	if (state == pci_channel_io_frozen) {
>> -		pci_walk_bus(bus, report_frozen_detected, &status);
>> +		pci_walk_dev_affected(dev, report_frozen_detected, &status);
>> +		if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_END) {
>> +			pci_warn(dev, "link reset not possible for RCiEP\n");
>> +			status = PCI_ERS_RESULT_NONE;
>> +			goto failed;
>> +		}
>> +
>>  		status = reset_link(dev);
>>  		if (status != PCI_ERS_RESULT_RECOVERED) {
>>  			pci_warn(dev, "link reset failed\n");
>>  			goto failed;
>>  		}
>>  	} else {
>> -		pci_walk_bus(bus, report_normal_detected, &status);
>> +		pci_walk_dev_affected(dev, report_normal_detected, &status);
>>  	}
>>
>>  	if (status == PCI_ERS_RESULT_CAN_RECOVER) {
>>  		status = PCI_ERS_RESULT_RECOVERED;
>>  		pci_dbg(dev, "broadcast mmio_enabled message\n");
>> -		pci_walk_bus(bus, report_mmio_enabled, &status);
>> +		pci_walk_dev_affected(dev, report_mmio_enabled, &status);
>>  	}
>>
>>  	if (status == PCI_ERS_RESULT_NEED_RESET) {
>> @@ -188,17 +219,21 @@ pci_ers_result_t pcie_do_recovery(struct 
>> pci_dev *dev,
>>  		 */
>>  		status = PCI_ERS_RESULT_RECOVERED;
>>  		pci_dbg(dev, "broadcast slot_reset message\n");
>> -		pci_walk_bus(bus, report_slot_reset, &status);
>> +		pci_walk_dev_affected(dev, report_slot_reset, &status);
>>  	}
>>
>>  	if (status != PCI_ERS_RESULT_RECOVERED)
>>  		goto failed;
>>
>>  	pci_dbg(dev, "broadcast resume message\n");
>> -	pci_walk_bus(bus, report_resume, &status);
>> +	pci_walk_dev_affected(dev, report_resume, &status);
>>
>> -	pci_aer_clear_device_status(dev);
>> -	pci_aer_clear_nonfatal_status(dev);
>> +	if ((pci_pcie_type(dev) == PCI_EXP_TYPE_ROOT_PORT ||
>> +	     pci_pcie_type(dev) == PCI_EXP_TYPE_DOWNSTREAM ||
>> +	     pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC)) {
>> +		pci_aer_clear_device_status(dev);
>> +		pci_aer_clear_nonfatal_status(dev);
>> +	}
>>  	pci_info(dev, "device recovery successful\n");
>>  	return status;
>>

  reply	other threads:[~2020-07-27 14:59 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-07-24 17:22 [RFC PATCH 0/9] Add RCEC handling to PCI/AER Sean V Kelley
2020-07-24 17:22 ` [RFC PATCH 1/9] pci_ids: Add class code and extended capability for RCEC Sean V Kelley
2020-07-27 10:00   ` Jonathan Cameron
2020-07-27 10:21     ` Jonathan Cameron
2020-07-27 15:22       ` Sean V Kelley
2020-07-24 17:22 ` [RFC PATCH 2/9] PCI: Extend Root Port Driver to support RCEC Sean V Kelley
2020-07-27 12:30   ` Jonathan Cameron
2020-07-27 15:05     ` Sean V Kelley
2020-07-24 17:22 ` [RFC PATCH 3/9] PCI/portdrv: Add pcie_walk_rcec() to walk RCiEPs associated with RCEC Sean V Kelley
2020-07-27 10:49   ` Jonathan Cameron
2020-07-27 15:21     ` Sean V Kelley
2020-07-24 17:22 ` [RFC PATCH 4/9] PCI/AER: Extend AER error handling to RCECs Sean V Kelley
2020-07-27 11:00   ` Jonathan Cameron
2020-07-27 14:58     ` Sean V Kelley [this message]
2020-07-27 14:04   ` Jonathan Cameron
2020-07-27 15:00     ` Sean V Kelley
2020-07-24 17:22 ` [RFC PATCH 5/9] PCI/AER: Apply function level reset to RCiEP on fatal error Sean V Kelley
2020-07-25 10:05   ` kernel test robot
2020-07-27 11:17   ` Jonathan Cameron
2020-07-28 13:27     ` Zhuo, Qiuxu
2020-07-28 16:14       ` Sean V Kelley
2020-07-28 17:02         ` Jonathan Cameron
2020-07-28 17:42           ` Sean V Kelley
2020-07-24 17:22 ` [RFC PATCH 6/9] PCI: Add 'rcec' field to pci_dev for associated RCiEPs Sean V Kelley
2020-07-27 11:23   ` Jonathan Cameron
2020-07-27 15:39     ` Sean V Kelley
2020-07-27 16:11     ` Jonathan Cameron
2020-07-27 16:28       ` Sean V Kelley
2020-07-24 17:22 ` [RFC PATCH 7/9] PCI/AER: Add RCEC AER handling Sean V Kelley
2020-07-27 12:22   ` Jonathan Cameron
2020-07-27 15:19     ` Sean V Kelley
2020-07-27 17:14       ` Jonathan Cameron
2020-07-24 17:22 ` [RFC PATCH 8/9] PCI/PME: Add RCEC PME handling Sean V Kelley
2020-08-04  8:35   ` Jay Fang
2020-08-04  9:47     ` Jonathan Cameron
2020-07-24 17:22 ` [RFC PATCH 9/9] PCI/AER: Add RCEC AER error injection support Sean V Kelley
2020-07-27 12:37 ` [RFC PATCH 0/9] Add RCEC handling to PCI/AER Jonathan Cameron
2020-07-27 14:56   ` Sean V Kelley

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=763CB36C-4D47-4A59-BBEF-577E48FD1A18@intel.com \
    --to=sean.v.kelley@intel.com \
    --cc=Jonathan.Cameron@Huawei.com \
    --cc=ashok.raj@kernel.org \
    --cc=bhelgaas@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=rjw@rjwysocki.net \
    --cc=sathyanarayanan.kuppuswamy@linux.intel.com \
    --cc=tony.luck@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.