linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/4] PCI/AER: Return approrpiate value when AER is not supported
@ 2017-12-19 21:06 Keith Busch
  2017-12-19 21:06 ` [PATCH 2/4] PCI/AER: Provide API for getting AER information Keith Busch
                   ` (4 more replies)
  0 siblings, 5 replies; 24+ messages in thread
From: Keith Busch @ 2017-12-19 21:06 UTC (permalink / raw)
  To: linux-pci, Bjorn Helgaas, Maik Broemme; +Cc: Keith Busch

Getting the AER information is documented to return 0 when it failed to
get the information.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 drivers/pci/pcie/aer/aerdrv_core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c
index 744805232155..ea0dc1cc7fc7 100644
--- a/drivers/pci/pcie/aer/aerdrv_core.c
+++ b/drivers/pci/pcie/aer/aerdrv_core.c
@@ -660,7 +660,7 @@ static int get_device_error_info(struct pci_dev *dev, struct aer_err_info *info)
 
 	/* The device might not support AER */
 	if (!pos)
-		return 1;
+		return 0;
 
 	if (info->severity == AER_CORRECTABLE) {
 		pci_read_config_dword(dev, pos + PCI_ERR_COR_STATUS,
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 2/4] PCI/AER: Provide API for getting AER information
  2017-12-19 21:06 [PATCH 1/4] PCI/AER: Return approrpiate value when AER is not supported Keith Busch
@ 2017-12-19 21:06 ` Keith Busch
  2017-12-19 21:06 ` [PATCH 3/4] PCI/DPC: Enable DPC in conjuction with AER Keith Busch
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 24+ messages in thread
From: Keith Busch @ 2017-12-19 21:06 UTC (permalink / raw)
  To: linux-pci, Bjorn Helgaas, Maik Broemme; +Cc: Keith Busch

Since this is making the function externally visible, appending "aer_"
prefix to the function name.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 drivers/pci/pcie/aer/aerdrv.h      | 1 +
 drivers/pci/pcie/aer/aerdrv_core.c | 8 ++++----
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/pci/pcie/aer/aerdrv.h b/drivers/pci/pcie/aer/aerdrv.h
index 5449e5ce139d..7c833a1d897e 100644
--- a/drivers/pci/pcie/aer/aerdrv.h
+++ b/drivers/pci/pcie/aer/aerdrv.h
@@ -107,6 +107,7 @@ static inline pci_ers_result_t merge_result(enum pci_ers_result orig,
 }
 
 extern struct bus_type pcie_port_bus_type;
+int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info);
 void aer_isr(struct work_struct *work);
 void aer_print_error(struct pci_dev *dev, struct aer_err_info *info);
 void aer_print_port_info(struct pci_dev *dev, struct aer_err_info *info);
diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c
index ea0dc1cc7fc7..4ba1fbd4700b 100644
--- a/drivers/pci/pcie/aer/aerdrv_core.c
+++ b/drivers/pci/pcie/aer/aerdrv_core.c
@@ -640,7 +640,7 @@ static void aer_recover_work_func(struct work_struct *work)
 #endif
 
 /**
- * get_device_error_info - read error status from dev and store it to info
+ * aer_get_device_error_info - read error status from dev and store it to info
  * @dev: pointer to the device expected to have a error record
  * @info: pointer to structure to store the error record
  *
@@ -648,7 +648,7 @@ static void aer_recover_work_func(struct work_struct *work)
  *
  * Note that @info is reused among all error devices. Clear fields properly.
  */
-static int get_device_error_info(struct pci_dev *dev, struct aer_err_info *info)
+int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info)
 {
 	int pos, temp;
 
@@ -707,11 +707,11 @@ static inline void aer_process_err_devices(struct pcie_device *p_device,
 
 	/* Report all before handle them, not to lost records by reset etc. */
 	for (i = 0; i < e_info->error_dev_num && e_info->dev[i]; i++) {
-		if (get_device_error_info(e_info->dev[i], e_info))
+		if (aer_get_device_error_info(e_info->dev[i], e_info))
 			aer_print_error(e_info->dev[i], e_info);
 	}
 	for (i = 0; i < e_info->error_dev_num && e_info->dev[i]; i++) {
-		if (get_device_error_info(e_info->dev[i], e_info))
+		if (aer_get_device_error_info(e_info->dev[i], e_info))
 			handle_error_source(p_device, e_info->dev[i], e_info);
 	}
 }
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 3/4] PCI/DPC: Enable DPC in conjuction with AER
  2017-12-19 21:06 [PATCH 1/4] PCI/AER: Return approrpiate value when AER is not supported Keith Busch
  2017-12-19 21:06 ` [PATCH 2/4] PCI/AER: Provide API for getting AER information Keith Busch
@ 2017-12-19 21:06 ` Keith Busch
  2018-01-15 14:43   ` Sinan Kaya
  2017-12-19 21:06 ` [PATCH 4/4] PCI/DPC: Print AER status in DPC event handling Keith Busch
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 24+ messages in thread
From: Keith Busch @ 2017-12-19 21:06 UTC (permalink / raw)
  To: linux-pci, Bjorn Helgaas, Maik Broemme; +Cc: Keith Busch

The PCI Express Base Specification's implementation note on "Determination
of DPC Control" recommends the operating system always link DPC control to
the control of AER, as the two functionalities are strongly connected. To
avoid conflicts over whether platform firmware or the OS control DPC,
this patch enables DPC only if AER is enabled in the OS, and the device's
error handling does not have a firmware-first AER handling.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 drivers/pci/pcie/Kconfig        | 2 +-
 drivers/pci/pcie/pcie-dpc.c     | 4 ++++
 drivers/pci/pcie/portdrv_core.c | 4 ++--
 3 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/drivers/pci/pcie/Kconfig b/drivers/pci/pcie/Kconfig
index ac53edbc9613..d658dfa53b87 100644
--- a/drivers/pci/pcie/Kconfig
+++ b/drivers/pci/pcie/Kconfig
@@ -92,7 +92,7 @@ config PCIE_PME
 
 config PCIE_DPC
 	bool "PCIe Downstream Port Containment support"
-	depends on PCIEPORTBUS
+	depends on PCIEPORTBUS && PCIEAER
 	default n
 	help
 	  This enables PCI Express Downstream Port Containment (DPC)
diff --git a/drivers/pci/pcie/pcie-dpc.c b/drivers/pci/pcie/pcie-dpc.c
index 2d976a623ddc..ef71a472592c 100644
--- a/drivers/pci/pcie/pcie-dpc.c
+++ b/drivers/pci/pcie/pcie-dpc.c
@@ -15,6 +15,7 @@
 #include <linux/pci.h>
 #include <linux/pcieport_if.h>
 #include "../pci.h"
+#include "aer/aerdrv.h"
 
 struct rp_pio_header_log_regs {
 	u32 dw0;
@@ -289,6 +290,9 @@ static int dpc_probe(struct pcie_device *dev)
 	int status;
 	u16 ctl, cap;
 
+	if (pcie_aer_get_firmware_first(pdev))
+		return -ENOTSUPP;
+
 	dpc = devm_kzalloc(device, sizeof(*dpc), GFP_KERNEL);
 	if (!dpc)
 		return -ENOMEM;
diff --git a/drivers/pci/pcie/portdrv_core.c b/drivers/pci/pcie/portdrv_core.c
index a59210350c44..ef3bad4ad010 100644
--- a/drivers/pci/pcie/portdrv_core.c
+++ b/drivers/pci/pcie/portdrv_core.c
@@ -216,9 +216,9 @@ static int get_port_device_capability(struct pci_dev *dev)
 		return 0;
 
 	cap_mask = PCIE_PORT_SERVICE_PME | PCIE_PORT_SERVICE_HP
-			| PCIE_PORT_SERVICE_VC | PCIE_PORT_SERVICE_DPC;
+			| PCIE_PORT_SERVICE_VC;
 	if (pci_aer_available())
-		cap_mask |= PCIE_PORT_SERVICE_AER;
+		cap_mask |= PCIE_PORT_SERVICE_AER | PCIE_PORT_SERVICE_DPC;
 
 	if (pcie_ports_auto)
 		pcie_port_platform_notify(dev, &cap_mask);
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 4/4] PCI/DPC: Print AER status in DPC event handling
  2017-12-19 21:06 [PATCH 1/4] PCI/AER: Return approrpiate value when AER is not supported Keith Busch
  2017-12-19 21:06 ` [PATCH 2/4] PCI/AER: Provide API for getting AER information Keith Busch
  2017-12-19 21:06 ` [PATCH 3/4] PCI/DPC: Enable DPC in conjuction with AER Keith Busch
@ 2017-12-19 21:06 ` Keith Busch
  2017-12-21  4:43   ` Sinan Kaya
  2018-01-12 23:03   ` Bjorn Helgaas
  2017-12-21  1:04 ` [PATCH 1/4] PCI/AER: Return approrpiate value when AER is not supported Bjorn Helgaas
  2018-03-20 23:02 ` Bjorn Helgaas
  4 siblings, 2 replies; 24+ messages in thread
From: Keith Busch @ 2017-12-19 21:06 UTC (permalink / raw)
  To: linux-pci, Bjorn Helgaas, Maik Broemme; +Cc: Keith Busch

A DPC enabled device will suppress sending ERR_FATAL and ERR_NONFATAL,
which prevents the AER handler from reporting the details of the
error. This patch will have the DPC driver get the AER status registers
from the downstream port that detected the uncorrectable error, and
print out additional information.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 drivers/pci/pcie/pcie-dpc.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/pci/pcie/pcie-dpc.c b/drivers/pci/pcie/pcie-dpc.c
index ef71a472592c..aefc4fbdcef0 100644
--- a/drivers/pci/pcie/pcie-dpc.c
+++ b/drivers/pci/pcie/pcie-dpc.c
@@ -44,6 +44,7 @@ struct dpc_dev {
 	int			cap_pos;
 	bool			rp;
 	u32			rp_pio_status;
+	struct aer_err_info	info;
 };
 
 static const char * const rp_pio_error_string[] = {
@@ -111,6 +112,9 @@ static void interrupt_event_handler(struct work_struct *work)
 	struct pci_dev *dev, *temp, *pdev = dpc->dev->port;
 	struct pci_bus *parent = pdev->subordinate;
 
+	if (aer_get_device_error_info(pdev, &dpc->info))
+		aer_print_error(pdev, &dpc->info);
+
 	pci_lock_rescan_remove();
 	list_for_each_entry_safe_reverse(dev, temp, &parent->devices,
 					 bus_list) {
@@ -275,6 +279,10 @@ static irqreturn_t dpc_irq(int irq, void *context)
 		/* show RP PIO error detail information */
 		if (reason == 3 && ext_reason == 0)
 			dpc_process_rp_pio_error(dpc);
+		if (reason == 2)
+			dpc->info.severity = AER_FATAL;
+		else
+			dpc->info.severity = AER_NONFATAL;
 
 		schedule_work(&dpc->work);
 	}
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/4] PCI/AER: Return approrpiate value when AER is not supported
  2017-12-19 21:06 [PATCH 1/4] PCI/AER: Return approrpiate value when AER is not supported Keith Busch
                   ` (2 preceding siblings ...)
  2017-12-19 21:06 ` [PATCH 4/4] PCI/DPC: Print AER status in DPC event handling Keith Busch
@ 2017-12-21  1:04 ` Bjorn Helgaas
  2017-12-21  3:53   ` Dongdong Liu
  2017-12-21 14:59   ` Keith Busch
  2018-03-20 23:02 ` Bjorn Helgaas
  4 siblings, 2 replies; 24+ messages in thread
From: Bjorn Helgaas @ 2017-12-21  1:04 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-pci, Bjorn Helgaas, Maik Broemme

s/approrpiate/appropriate/

(But maybe should be restated altogether, see below.)

On Tue, Dec 19, 2017 at 02:06:40PM -0700, Keith Busch wrote:
> Getting the AER information is documented to return 0 when it failed to
> get the information.

I think this case is either impossible (if we only call this function
for devices known to support AER), or it fixes an actual bug (the
caller would call aer_print_error() when it shouldn't, and potentially
print garbage).  Right?

If the former, I vote for removing the test.  If the latter, the
changelog should mention that it fixes a bug.

> Signed-off-by: Keith Busch <keith.busch@intel.com>
> ---
>  drivers/pci/pcie/aer/aerdrv_core.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c
> index 744805232155..ea0dc1cc7fc7 100644
> --- a/drivers/pci/pcie/aer/aerdrv_core.c
> +++ b/drivers/pci/pcie/aer/aerdrv_core.c
> @@ -660,7 +660,7 @@ static int get_device_error_info(struct pci_dev *dev, struct aer_err_info *info)
>  
>  	/* The device might not support AER */
>  	if (!pos)
> -		return 1;
> +		return 0;
>  
>  	if (info->severity == AER_CORRECTABLE) {
>  		pci_read_config_dword(dev, pos + PCI_ERR_COR_STATUS,
> -- 
> 2.13.6
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/4] PCI/AER: Return approrpiate value when AER is not supported
  2017-12-21  1:04 ` [PATCH 1/4] PCI/AER: Return approrpiate value when AER is not supported Bjorn Helgaas
@ 2017-12-21  3:53   ` Dongdong Liu
  2017-12-21 14:59   ` Keith Busch
  1 sibling, 0 replies; 24+ messages in thread
From: Dongdong Liu @ 2017-12-21  3:53 UTC (permalink / raw)
  To: Bjorn Helgaas, Keith Busch; +Cc: linux-pci, Bjorn Helgaas, Maik Broemme

Current code APEI AER and native AER seems a little different.
APEI AER ,the device must have AER capability, then can do recovery.
Native AER seems the device under the ROOT PORT can do not have the
AER capability. Current patch seems can fix the Inconsistency.
But as PCIe 4.0r1.0 6.2.5 Sequence of Device Error Signaling and Logging
Operations describe,It seems the device can implement the Device Status reg,
but do not have AER capability.So for such devices, can we use AER driver to
call device error handler callback to do recovery?

Thanks,
Dongdong
在 2017/12/21 9:04, Bjorn Helgaas 写道:
> s/approrpiate/appropriate/
>
> (But maybe should be restated altogether, see below.)
>
> On Tue, Dec 19, 2017 at 02:06:40PM -0700, Keith Busch wrote:
>> Getting the AER information is documented to return 0 when it failed to
>> get the information.
>
> I think this case is either impossible (if we only call this function
> for devices known to support AER), or it fixes an actual bug (the
> caller would call aer_print_error() when it shouldn't, and potentially
> print garbage).  Right?
>
> If the former, I vote for removing the test.  If the latter, the
> changelog should mention that it fixes a bug.
>
>> Signed-off-by: Keith Busch <keith.busch@intel.com>
>> ---
>>  drivers/pci/pcie/aer/aerdrv_core.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c
>> index 744805232155..ea0dc1cc7fc7 100644
>> --- a/drivers/pci/pcie/aer/aerdrv_core.c
>> +++ b/drivers/pci/pcie/aer/aerdrv_core.c
>> @@ -660,7 +660,7 @@ static int get_device_error_info(struct pci_dev *dev, struct aer_err_info *info)
>>
>>  	/* The device might not support AER */
>>  	if (!pos)
>> -		return 1;
>> +		return 0;
>>
>>  	if (info->severity == AER_CORRECTABLE) {
>>  		pci_read_config_dword(dev, pos + PCI_ERR_COR_STATUS,
>> --
>> 2.13.6
>>
>
> .
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/4] PCI/DPC: Print AER status in DPC event handling
  2017-12-19 21:06 ` [PATCH 4/4] PCI/DPC: Print AER status in DPC event handling Keith Busch
@ 2017-12-21  4:43   ` Sinan Kaya
  2017-12-21  5:12     ` Keith Busch
  2018-01-12 23:03   ` Bjorn Helgaas
  1 sibling, 1 reply; 24+ messages in thread
From: Sinan Kaya @ 2017-12-21  4:43 UTC (permalink / raw)
  To: Keith Busch, linux-pci, Bjorn Helgaas, Maik Broemme, Pawandeep Oza

+Oza

On 12/19/2017 4:06 PM, Keith Busch wrote:
> A DPC enabled device will suppress sending ERR_FATAL and ERR_NONFATAL,
> which prevents the AER handler from reporting the details of the
> error. This patch will have the DPC driver get the AER status registers
> from the downstream port that detected the uncorrectable error, and
> print out additional information.
> 
> Signed-off-by: Keith Busch <keith.busch@intel.com>
> ---

Oza is doing some restructuring to unify DPC and AER error handling path per
feedback from Bjorn. It is almost done. He is testing it. 

Can this patch wait until you review his version? I'm thinking this could
be something that can be added to his series instead.

Coming back to this patch. The interrupt number for DPC and AER could be the
same or different. According to the spec, AER errors are always reported
regardless of DPC driver presence (see the famous flow chart).

If the interrupt IDs are the same for AER and DPC, your patch would introduce
double printing for AER errors.

Sinan

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/4] PCI/DPC: Print AER status in DPC event handling
  2017-12-21  4:43   ` Sinan Kaya
@ 2017-12-21  5:12     ` Keith Busch
  2018-01-10 15:52       ` Sinan Kaya
  0 siblings, 1 reply; 24+ messages in thread
From: Keith Busch @ 2017-12-21  5:12 UTC (permalink / raw)
  To: Sinan Kaya; +Cc: linux-pci, Bjorn Helgaas, Maik Broemme, Pawandeep Oza

On Wed, Dec 20, 2017 at 11:43:07PM -0500, Sinan Kaya wrote:
> +Oza
> 
> On 12/19/2017 4:06 PM, Keith Busch wrote:
> > A DPC enabled device will suppress sending ERR_FATAL and ERR_NONFATAL,
> > which prevents the AER handler from reporting the details of the
> > error. This patch will have the DPC driver get the AER status registers
> > from the downstream port that detected the uncorrectable error, and
> > print out additional information.
> > 
> > Signed-off-by: Keith Busch <keith.busch@intel.com>
> > ---
> 
> Oza is doing some restructuring to unify DPC and AER error handling path per
> feedback from Bjorn. It is almost done. He is testing it. 
> 
> Can this patch wait until you review his version? I'm thinking this could
> be something that can be added to his series instead.

No problem.
 
> Coming back to this patch. The interrupt number for DPC and AER could be the
> same or different. 

Only if you're talking about root ports. DPC is also a feature of
switch downstream ports, which don't generate interrupts on AER events
(they don't have a Root Error Command register).

> According to the spec, AER errors are always reported
> regardless of DPC driver presence (see the famous flow chart).

> If the interrupt IDs are the same for AER and DPC, your patch would introduce
> double printing for AER errors.

The AER Uncorrectable Status Register of the detecting port would
indeed be set with the appropriate status if that's the type of error
that occured, but when DPC is enabled, the root port never observes an
ERR_FATAL/ERR_NONFATAL message required for it to get set the Root Error
Status Register. The Linux AER handler requires the Root Error Status
be set in order for it to print anything, so I don't think we're at risk
of double printing with this patch even if the root port is DPC capable.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/4] PCI/AER: Return approrpiate value when AER is not supported
  2017-12-21  1:04 ` [PATCH 1/4] PCI/AER: Return approrpiate value when AER is not supported Bjorn Helgaas
  2017-12-21  3:53   ` Dongdong Liu
@ 2017-12-21 14:59   ` Keith Busch
  1 sibling, 0 replies; 24+ messages in thread
From: Keith Busch @ 2017-12-21 14:59 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: linux-pci, Bjorn Helgaas, Maik Broemme

On Wed, Dec 20, 2017 at 05:04:52PM -0800, Bjorn Helgaas wrote:
> On Tue, Dec 19, 2017 at 02:06:40PM -0700, Keith Busch wrote:
> > Getting the AER information is documented to return 0 when it failed to
> > get the information.
> 
> I think this case is either impossible (if we only call this function
> for devices known to support AER), or it fixes an actual bug (the
> caller would call aer_print_error() when it shouldn't, and potentially
> print garbage).  Right?
> 
> If the former, I vote for removing the test.  If the latter, the
> changelog should mention that it fixes a bug.

I just spotted this mistake when I was changing the function to
non-static. I've never observed bogus data printed in real life.

In the current AER handling, the only way we could incorrectly get there
is if the root port's Error Source ID Register somehow contains the ID
of a device that doesn't have AER capabilities, which would of course
be broken.

So if we trust hardware, we should are safe to remove the check.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/4] PCI/DPC: Print AER status in DPC event handling
  2017-12-21  5:12     ` Keith Busch
@ 2018-01-10 15:52       ` Sinan Kaya
  2018-01-16  2:47         ` Keith Busch
  0 siblings, 1 reply; 24+ messages in thread
From: Sinan Kaya @ 2018-01-10 15:52 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-pci, Bjorn Helgaas, Maik Broemme, Pawandeep Oza

Hi Keith,

On 12/21/2017 12:12 AM, Keith Busch wrote:
> On Wed, Dec 20, 2017 at 11:43:07PM -0500, Sinan Kaya wrote:
>> +Oza
>>
>> On 12/19/2017 4:06 PM, Keith Busch wrote:
>>> A DPC enabled device will suppress sending ERR_FATAL and ERR_NONFATAL,
>>> which prevents the AER handler from reporting the details of the
>>> error. This patch will have the DPC driver get the AER status registers
>>> from the downstream port that detected the uncorrectable error, and
>>> print out additional information.
>>>
>>> Signed-off-by: Keith Busch <keith.busch@intel.com>
>>> ---
>>
>> Oza is doing some restructuring to unify DPC and AER error handling path per
>> feedback from Bjorn. It is almost done. He is testing it. 
>>
>> Can this patch wait until you review his version? I'm thinking this could
>> be something that can be added to his series instead.
> 
> No problem.
>  
>> Coming back to this patch. The interrupt number for DPC and AER could be the
>> same or different. 
> 
> Only if you're talking about root ports. DPC is also a feature of
> switch downstream ports, which don't generate interrupts on AER events
> (they don't have a Root Error Command register).
> 
>> According to the spec, AER errors are always reported
>> regardless of DPC driver presence (see the famous flow chart).
> 
>> If the interrupt IDs are the same for AER and DPC, your patch would introduce
>> double printing for AER errors.
> 
> The AER Uncorrectable Status Register of the detecting port would
> indeed be set with the appropriate status if that's the type of error
> that occured, but when DPC is enabled, the root port never observes an
> ERR_FATAL/ERR_NONFATAL message required for it to get set the Root Error
> Status Register. The Linux AER handler requires the Root Error Status
> be set in order for it to print anything, so I don't think we're at risk
> of double printing with this patch even if the root port is DPC capable.
> 

Coming back to this. Oza posted his patch that integrates DPC into PORTDRV similar
to AER driver. 

"[PATCH v3 0/4] Address error and recovery for AER and DPC"

We are looking for feedback on the series.

The idea in a nut shell is to collect all endpoint error recovery code into a
new file called pcie-err.c and then invoke callbacks before DPC driver shuts down
the currently active driver.

There is however, still one more problem that has not been tacked.

Since the AER status is set when we observe DPC event and nobody is clearing these
we won't observe another DPC event until somebody clears these. We can say that
we are resetting the endpoints as part of the DPC but we are not touching the
switch downstream port or the root port registers.

Somebody still needs to clear these in addition to printing whatever information
is available in the AER registers.

Do you agree?

Sinan

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/4] PCI/DPC: Print AER status in DPC event handling
  2017-12-19 21:06 ` [PATCH 4/4] PCI/DPC: Print AER status in DPC event handling Keith Busch
  2017-12-21  4:43   ` Sinan Kaya
@ 2018-01-12 23:03   ` Bjorn Helgaas
  2018-01-14  1:35     ` Keith Busch
  1 sibling, 1 reply; 24+ messages in thread
From: Bjorn Helgaas @ 2018-01-12 23:03 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-pci, Bjorn Helgaas, Maik Broemme

Hi Keith,

On Tue, Dec 19, 2017 at 02:06:43PM -0700, Keith Busch wrote:
> A DPC enabled device will suppress sending ERR_FATAL and ERR_NONFATAL,
> which prevents the AER handler from reporting the details of the
> error. This patch will have the DPC driver get the AER status registers
> from the downstream port that detected the uncorrectable error, and
> print out additional information.
> 
> Signed-off-by: Keith Busch <keith.busch@intel.com>
> ---
>  drivers/pci/pcie/pcie-dpc.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/drivers/pci/pcie/pcie-dpc.c b/drivers/pci/pcie/pcie-dpc.c
> index ef71a472592c..aefc4fbdcef0 100644
> --- a/drivers/pci/pcie/pcie-dpc.c
> +++ b/drivers/pci/pcie/pcie-dpc.c
> @@ -44,6 +44,7 @@ struct dpc_dev {
>  	int			cap_pos;
>  	bool			rp;
>  	u32			rp_pio_status;
> +	struct aer_err_info	info;
>  };
>  
>  static const char * const rp_pio_error_string[] = {
> @@ -111,6 +112,9 @@ static void interrupt_event_handler(struct work_struct *work)
>  	struct pci_dev *dev, *temp, *pdev = dpc->dev->port;
>  	struct pci_bus *parent = pdev->subordinate;
>  
> +	if (aer_get_device_error_info(pdev, &dpc->info))
> +		aer_print_error(pdev, &dpc->info);

I'm confused about this.  "pdev" is the DPC-enabled Port, i.e., the
Port that received an ERR_FATAL or ERR_NONFATAL Message.  The Message
was generated by something below "pdev", and that's where the
interesting AER logging would have been done.

This patch suggests that the DPC port itself would have something
interesting logged in its AER capability.  Is that really true?  I see
in sec 6.2.10 (PCIe r4.0) that the DPC port should log the Trigger
Reason and the Error Source ID from the Message, but I don't see
anything about logging anything in its AER registers.

>  	pci_lock_rescan_remove();
>  	list_for_each_entry_safe_reverse(dev, temp, &parent->devices,
>  					 bus_list) {
> @@ -275,6 +279,10 @@ static irqreturn_t dpc_irq(int irq, void *context)
>  		/* show RP PIO error detail information */
>  		if (reason == 3 && ext_reason == 0)
>  			dpc_process_rp_pio_error(dpc);
> +		if (reason == 2)
> +			dpc->info.severity = AER_FATAL;
> +		else
> +			dpc->info.severity = AER_NONFATAL;
>  
>  		schedule_work(&dpc->work);
>  	}
> -- 
> 2.13.6
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/4] PCI/DPC: Print AER status in DPC event handling
  2018-01-12 23:03   ` Bjorn Helgaas
@ 2018-01-14  1:35     ` Keith Busch
  2018-01-15 14:32       ` Sinan Kaya
  2018-01-17  0:14       ` Bjorn Helgaas
  0 siblings, 2 replies; 24+ messages in thread
From: Keith Busch @ 2018-01-14  1:35 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: linux-pci, Bjorn Helgaas, Maik Broemme

On Fri, Jan 12, 2018 at 05:03:03PM -0600, Bjorn Helgaas wrote:
> Hi Keith,
> 
> On Tue, Dec 19, 2017 at 02:06:43PM -0700, Keith Busch wrote:
> > A DPC enabled device will suppress sending ERR_FATAL and ERR_NONFATAL,
> > which prevents the AER handler from reporting the details of the
> > error. This patch will have the DPC driver get the AER status registers
> > from the downstream port that detected the uncorrectable error, and
> > print out additional information.
> > 
> > Signed-off-by: Keith Busch <keith.busch@intel.com>
> > ---
> >  drivers/pci/pcie/pcie-dpc.c | 8 ++++++++
> >  1 file changed, 8 insertions(+)
> > 
> > diff --git a/drivers/pci/pcie/pcie-dpc.c b/drivers/pci/pcie/pcie-dpc.c
> > index ef71a472592c..aefc4fbdcef0 100644
> > --- a/drivers/pci/pcie/pcie-dpc.c
> > +++ b/drivers/pci/pcie/pcie-dpc.c
> > @@ -44,6 +44,7 @@ struct dpc_dev {
> >  	int			cap_pos;
> >  	bool			rp;
> >  	u32			rp_pio_status;
> > +	struct aer_err_info	info;
> >  };
> >  
> >  static const char * const rp_pio_error_string[] = {
> > @@ -111,6 +112,9 @@ static void interrupt_event_handler(struct work_struct *work)
> >  	struct pci_dev *dev, *temp, *pdev = dpc->dev->port;
> >  	struct pci_bus *parent = pdev->subordinate;
> >  
> > +	if (aer_get_device_error_info(pdev, &dpc->info))
> > +		aer_print_error(pdev, &dpc->info);
> 
> I'm confused about this.  "pdev" is the DPC-enabled Port, i.e., the
> Port that received an ERR_FATAL or ERR_NONFATAL Message.  The Message
> was generated by something below "pdev", and that's where the
> interesting AER logging would have been done.
> 
> This patch suggests that the DPC port itself would have something
> interesting logged in its AER capability.  Is that really true?  I see
> in sec 6.2.10 (PCIe r4.0) that the DPC port should log the Trigger
> Reason and the Error Source ID from the Message, but I don't see
> anything about logging anything in its AER registers.

If the trigger reason is "unmasked uncorrectable error", then the DPC
AER register has useful information. All other reasons would not provide
useful data in the AER register.


I would assume that aer_get_device_error_info would return false if
it's one of other reasons, but I can fix this up to check the trigger
reasoning rather than unconditionally reading the AER status

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/4] PCI/DPC: Print AER status in DPC event handling
  2018-01-14  1:35     ` Keith Busch
@ 2018-01-15 14:32       ` Sinan Kaya
  2018-01-17  0:36         ` Bjorn Helgaas
  2018-01-17  0:14       ` Bjorn Helgaas
  1 sibling, 1 reply; 24+ messages in thread
From: Sinan Kaya @ 2018-01-15 14:32 UTC (permalink / raw)
  To: Keith Busch, Bjorn Helgaas; +Cc: linux-pci, Bjorn Helgaas, Maik Broemme

On 1/13/2018 8:35 PM, Keith Busch wrote:
> On Fri, Jan 12, 2018 at 05:03:03PM -0600, Bjorn Helgaas wrote:
>> Hi Keith,
>>
>> On Tue, Dec 19, 2017 at 02:06:43PM -0700, Keith Busch wrote:
>>> A DPC enabled device will suppress sending ERR_FATAL and ERR_NONFATAL,
>>> which prevents the AER handler from reporting the details of the
>>> error. This patch will have the DPC driver get the AER status registers
>>> from the downstream port that detected the uncorrectable error, and
>>> print out additional information.
>>>
>>> Signed-off-by: Keith Busch <keith.busch@intel.com>
>>> ---
>>>  drivers/pci/pcie/pcie-dpc.c | 8 ++++++++
>>>  1 file changed, 8 insertions(+)
>>>
>>> diff --git a/drivers/pci/pcie/pcie-dpc.c b/drivers/pci/pcie/pcie-dpc.c
>>> index ef71a472592c..aefc4fbdcef0 100644
>>> --- a/drivers/pci/pcie/pcie-dpc.c
>>> +++ b/drivers/pci/pcie/pcie-dpc.c
>>> @@ -44,6 +44,7 @@ struct dpc_dev {
>>>  	int			cap_pos;
>>>  	bool			rp;
>>>  	u32			rp_pio_status;
>>> +	struct aer_err_info	info;
>>>  };
>>>  
>>>  static const char * const rp_pio_error_string[] = {
>>> @@ -111,6 +112,9 @@ static void interrupt_event_handler(struct work_struct *work)
>>>  	struct pci_dev *dev, *temp, *pdev = dpc->dev->port;
>>>  	struct pci_bus *parent = pdev->subordinate;
>>>  
>>> +	if (aer_get_device_error_info(pdev, &dpc->info))
>>> +		aer_print_error(pdev, &dpc->info);
>>
>> I'm confused about this.  "pdev" is the DPC-enabled Port, i.e., the
>> Port that received an ERR_FATAL or ERR_NONFATAL Message.  The Message
>> was generated by something below "pdev", and that's where the
>> interesting AER logging would have been done.
>>
>> This patch suggests that the DPC port itself would have something
>> interesting logged in its AER capability.  Is that really true?  I see
>> in sec 6.2.10 (PCIe r4.0) that the DPC port should log the Trigger
>> Reason and the Error Source ID from the Message, but I don't see
>> anything about logging anything in its AER registers.
> 
> If the trigger reason is "unmasked uncorrectable error", then the DPC
> AER register has useful information. All other reasons would not provide
> useful data in the AER register.
> 
> 
> I would assume that aer_get_device_error_info would return false if
> it's one of other reasons, but I can fix this up to check the trigger
> reasoning rather than unconditionally reading the AER status
> 

https://pcisig.com/sites/default/files/specification_documents/ECN_DPC_2012-02-09_finalized.pdf

If you look at figure 6-2, AER errors are always logged regardless of the DPC
presence.

That's why, I was suggesting Keith that code should also clear the AER information
when such an event happens. Otherwise, same event won't trigger again.


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/4] PCI/DPC: Enable DPC in conjuction with AER
  2017-12-19 21:06 ` [PATCH 3/4] PCI/DPC: Enable DPC in conjuction with AER Keith Busch
@ 2018-01-15 14:43   ` Sinan Kaya
  2018-01-16  1:33     ` Keith Busch
  0 siblings, 1 reply; 24+ messages in thread
From: Sinan Kaya @ 2018-01-15 14:43 UTC (permalink / raw)
  To: Keith Busch, linux-pci, Bjorn Helgaas, Maik Broemme

On 12/19/2017 4:06 PM, Keith Busch wrote:
> @@ -289,6 +290,9 @@ static int dpc_probe(struct pcie_device *dev)
>  	int status;
>  	u16 ctl, cap;
>  
> +	if (pcie_aer_get_firmware_first(pdev))
> +		return -ENOTSUPP;
> +

There are two ways to support firmware first handling along with DPC.

The first one is to tie DPC handling to the firmware first enable.

The second one is to enable DPC ERR_COR signalling so that firmware
gets notified on each DPC event occurrence.

While the first one gives more control to the firmware, I think it beats
the purpose of the DPC. The first approach requires firmware to do some
"pre-processing" before notifying operating system of a failure.

The goal of the DPC is to put hardware into safe state when a PCIe error
happens. The best software recovery following this is to notify endpoint
drivers of failures and shutdown threads/processes accessing the hardware
as quick as possible.

The firmware-first event notification is through ACPI GHES and firmware injects
an artifical uncorrected AER error to the operating system. Once OS gets
notified, it tries to recover drivers through AER fatal error recovery mechanism.

While the semantics of this path is clearly defined in ACPI, it is also known
to be slow as well. During the time firmware does its business, operating
system still could be trying to access the endpoint address space.

My suggestion is to enable ERR_COR signalling so firmware gets a notification
on each DPC event for logging purposes. 

OS handles DPC natively and tries to recover hardware without any external
influence.

Sinan
-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/4] PCI/DPC: Enable DPC in conjuction with AER
  2018-01-15 14:43   ` Sinan Kaya
@ 2018-01-16  1:33     ` Keith Busch
  2018-01-16  3:04       ` Sinan Kaya
  0 siblings, 1 reply; 24+ messages in thread
From: Keith Busch @ 2018-01-16  1:33 UTC (permalink / raw)
  To: Sinan Kaya; +Cc: linux-pci, Bjorn Helgaas, Maik Broemme

On Mon, Jan 15, 2018 at 09:43:22AM -0500, Sinan Kaya wrote:
> On 12/19/2017 4:06 PM, Keith Busch wrote:
> > @@ -289,6 +290,9 @@ static int dpc_probe(struct pcie_device *dev)
> >  	int status;
> >  	u16 ctl, cap;
> >  
> > +	if (pcie_aer_get_firmware_first(pdev))
> > +		return -ENOTSUPP;
> > +
> 
> There are two ways to support firmware first handling along with DPC.
> 
> The first one is to tie DPC handling to the firmware first enable.
> 
> The second one is to enable DPC ERR_COR signalling so that firmware
> gets notified on each DPC event occurrence.
> 
> While the first one gives more control to the firmware, I think it beats
> the purpose of the DPC. The first approach requires firmware to do some
> "pre-processing" before notifying operating system of a failure.
> 
> The goal of the DPC is to put hardware into safe state when a PCIe error
> happens. The best software recovery following this is to notify endpoint
> drivers of failures and shutdown threads/processes accessing the hardware
> as quick as possible.
> 
> The firmware-first event notification is through ACPI GHES and firmware injects
> an artifical uncorrected AER error to the operating system. Once OS gets
> notified, it tries to recover drivers through AER fatal error recovery mechanism.
> 
> While the semantics of this path is clearly defined in ACPI, it is also known
> to be slow as well. During the time firmware does its business, operating
> system still could be trying to access the endpoint address space.
> 
> My suggestion is to enable ERR_COR signalling so firmware gets a notification
> on each DPC event for logging purposes. 
> 
> OS handles DPC natively and tries to recover hardware without any external
> influence.

I see what you're saying, but if a device has a firmware first policy,
doesn't that mean firmware owns the DPC Control register? The OS shouldn't
be mucking with it in that case, right?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/4] PCI/DPC: Print AER status in DPC event handling
  2018-01-10 15:52       ` Sinan Kaya
@ 2018-01-16  2:47         ` Keith Busch
  2018-01-17  0:56           ` Bjorn Helgaas
  0 siblings, 1 reply; 24+ messages in thread
From: Keith Busch @ 2018-01-16  2:47 UTC (permalink / raw)
  To: Sinan Kaya; +Cc: linux-pci, Bjorn Helgaas, Maik Broemme, Pawandeep Oza

On Wed, Jan 10, 2018 at 10:52:23AM -0500, Sinan Kaya wrote:
> Since the AER status is set when we observe DPC event and nobody is clearing these
> we won't observe another DPC event until somebody clears these. We can say that
> we are resetting the endpoints as part of the DPC but we are not touching the
> switch downstream port or the root port registers.
> 
> Somebody still needs to clear these in addition to printing whatever information
> is available in the AER registers.
> 
> Do you agree?

We should clear the downstream port's AER uncorrectable status register if
the trigger reason is of that type, but DPC definitely does not require
we clear it in order to observe a new event.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/4] PCI/DPC: Enable DPC in conjuction with AER
  2018-01-16  1:33     ` Keith Busch
@ 2018-01-16  3:04       ` Sinan Kaya
  0 siblings, 0 replies; 24+ messages in thread
From: Sinan Kaya @ 2018-01-16  3:04 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-pci, Bjorn Helgaas, Maik Broemme

On 1/15/2018 8:33 PM, Keith Busch wrote:
> On Mon, Jan 15, 2018 at 09:43:22AM -0500, Sinan Kaya wrote:
>> On 12/19/2017 4:06 PM, Keith Busch wrote:
>>> @@ -289,6 +290,9 @@ static int dpc_probe(struct pcie_device *dev)
>>>  	int status;
>>>  	u16 ctl, cap;
>>>  
>>> +	if (pcie_aer_get_firmware_first(pdev))
>>> +		return -ENOTSUPP;
>>> +
>>
>> There are two ways to support firmware first handling along with DPC.
>>
>> The first one is to tie DPC handling to the firmware first enable.
>>
>> The second one is to enable DPC ERR_COR signalling so that firmware
>> gets notified on each DPC event occurrence.
>>
>> While the first one gives more control to the firmware, I think it beats
>> the purpose of the DPC. The first approach requires firmware to do some
>> "pre-processing" before notifying operating system of a failure.
>>
>> The goal of the DPC is to put hardware into safe state when a PCIe error
>> happens. The best software recovery following this is to notify endpoint
>> drivers of failures and shutdown threads/processes accessing the hardware
>> as quick as possible.
>>
>> The firmware-first event notification is through ACPI GHES and firmware injects
>> an artifical uncorrected AER error to the operating system. Once OS gets
>> notified, it tries to recover drivers through AER fatal error recovery mechanism.
>>
>> While the semantics of this path is clearly defined in ACPI, it is also known
>> to be slow as well. During the time firmware does its business, operating
>> system still could be trying to access the endpoint address space.
>>
>> My suggestion is to enable ERR_COR signalling so firmware gets a notification
>> on each DPC event for logging purposes. 
>>
>> OS handles DPC natively and tries to recover hardware without any external
>> influence.
> 
> I see what you're saying, but if a device has a firmware first policy,
> doesn't that mean firmware owns the DPC Control register? The OS shouldn't
> be mucking with it in that case, right?
> 

I agree. I looked at the spec one more time. These are the two paragraphs mentioning
firmware first. Unfortunately, it will come down to the quality of firmware implementation
to make something useful out of DPC functionality.

There should have been a DPC control request as well as a firmware-first control request
instead of tying these together.

"Determination of DPC Control
DPC may be controlled in some configurations by platform firmware and in other configurations by
the operating system. DPC functionality is strongly linked with the functionality in Advanced Error
Reporting. To avoid conflicts over whether platform firmware or the operating system have control
of DPC, it is recommended that platform firmware and operating systems always link the control of
DPC to the control of Advanced Error Reporting."

"Use of DPC ERR_COR Signaling
It is recommended that operating systems use DPC interrupts for signaling when DPC has been
triggered. While DPC ERR_COR signaling indicates the same event, DPC ERR_COR signaling is
primarily intended for use by platform firmware, when it needs to be notified in order to do its own
logging of the event or provide “firmware first” services"




-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/4] PCI/DPC: Print AER status in DPC event handling
  2018-01-14  1:35     ` Keith Busch
  2018-01-15 14:32       ` Sinan Kaya
@ 2018-01-17  0:14       ` Bjorn Helgaas
  1 sibling, 0 replies; 24+ messages in thread
From: Bjorn Helgaas @ 2018-01-17  0:14 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-pci, Bjorn Helgaas, Maik Broemme

On Sat, Jan 13, 2018 at 06:35:18PM -0700, Keith Busch wrote:
> On Fri, Jan 12, 2018 at 05:03:03PM -0600, Bjorn Helgaas wrote:
> > Hi Keith,
> > 
> > On Tue, Dec 19, 2017 at 02:06:43PM -0700, Keith Busch wrote:
> > > A DPC enabled device will suppress sending ERR_FATAL and ERR_NONFATAL,
> > > which prevents the AER handler from reporting the details of the
> > > error. This patch will have the DPC driver get the AER status registers
> > > from the downstream port that detected the uncorrectable error, and
> > > print out additional information.
> > > 
> > > Signed-off-by: Keith Busch <keith.busch@intel.com>
> > > ---
> > >  drivers/pci/pcie/pcie-dpc.c | 8 ++++++++
> > >  1 file changed, 8 insertions(+)
> > > 
> > > diff --git a/drivers/pci/pcie/pcie-dpc.c b/drivers/pci/pcie/pcie-dpc.c
> > > index ef71a472592c..aefc4fbdcef0 100644
> > > --- a/drivers/pci/pcie/pcie-dpc.c
> > > +++ b/drivers/pci/pcie/pcie-dpc.c
> > > @@ -44,6 +44,7 @@ struct dpc_dev {
> > >  	int			cap_pos;
> > >  	bool			rp;
> > >  	u32			rp_pio_status;
> > > +	struct aer_err_info	info;
> > >  };
> > >  
> > >  static const char * const rp_pio_error_string[] = {
> > > @@ -111,6 +112,9 @@ static void interrupt_event_handler(struct work_struct *work)
> > >  	struct pci_dev *dev, *temp, *pdev = dpc->dev->port;
> > >  	struct pci_bus *parent = pdev->subordinate;
> > >  
> > > +	if (aer_get_device_error_info(pdev, &dpc->info))
> > > +		aer_print_error(pdev, &dpc->info);
> > 
> > I'm confused about this.  "pdev" is the DPC-enabled Port, i.e., the
> > Port that received an ERR_FATAL or ERR_NONFATAL Message.  The Message
> > was generated by something below "pdev", and that's where the
> > interesting AER logging would have been done.
> > 
> > This patch suggests that the DPC port itself would have something
> > interesting logged in its AER capability.  Is that really true?  I see
> > in sec 6.2.10 (PCIe r4.0) that the DPC port should log the Trigger
> > Reason and the Error Source ID from the Message, but I don't see
> > anything about logging anything in its AER registers.
> 
> If the trigger reason is "unmasked uncorrectable error", then the DPC
> AER register has useful information. All other reasons would not provide
> useful data in the AER register.

OK, that makes sense.  This "unmasked uncorrectable error" case would
be things like the Port receiving an Unsupported Request completion, a
Completion Timeout, Completer Abort, etc.  In that case, the DPC Port
should do its own AER logging before triggering DPC.

> I would assume that aer_get_device_error_info would return false if
> it's one of other reasons, but I can fix this up to check the trigger
> reasoning rather than unconditionally reading the AER status

It might be overkill, but it does niggle at me that in these other
cases, the hardware isn't telling us to read the AER logging, but we
do it anyway.

We currently pass no information to interrupt_event_handler(), so it
has no way of testing the trigger reason.  I guess you could do the
test in dpc_irq() before scheduling interrupt_event_handler()?

That would move it from the work item to the ISR, but I guess it's
essentially the same sort of code as dpc_process_rp_pio_error(), which
we already do in the ISR.

Bjorn

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/4] PCI/DPC: Print AER status in DPC event handling
  2018-01-15 14:32       ` Sinan Kaya
@ 2018-01-17  0:36         ` Bjorn Helgaas
  0 siblings, 0 replies; 24+ messages in thread
From: Bjorn Helgaas @ 2018-01-17  0:36 UTC (permalink / raw)
  To: Sinan Kaya; +Cc: Keith Busch, linux-pci, Bjorn Helgaas, Maik Broemme

On Mon, Jan 15, 2018 at 09:32:59AM -0500, Sinan Kaya wrote:
> On 1/13/2018 8:35 PM, Keith Busch wrote:
> > On Fri, Jan 12, 2018 at 05:03:03PM -0600, Bjorn Helgaas wrote:
> >> Hi Keith,
> >>
> >> On Tue, Dec 19, 2017 at 02:06:43PM -0700, Keith Busch wrote:
> >>> A DPC enabled device will suppress sending ERR_FATAL and ERR_NONFATAL,
> >>> which prevents the AER handler from reporting the details of the
> >>> error. This patch will have the DPC driver get the AER status registers
> >>> from the downstream port that detected the uncorrectable error, and
> >>> print out additional information.
> >>>
> >>> Signed-off-by: Keith Busch <keith.busch@intel.com>
> >>> ---
> >>>  drivers/pci/pcie/pcie-dpc.c | 8 ++++++++
> >>>  1 file changed, 8 insertions(+)
> >>>
> >>> diff --git a/drivers/pci/pcie/pcie-dpc.c b/drivers/pci/pcie/pcie-dpc.c
> >>> index ef71a472592c..aefc4fbdcef0 100644
> >>> --- a/drivers/pci/pcie/pcie-dpc.c
> >>> +++ b/drivers/pci/pcie/pcie-dpc.c
> >>> @@ -44,6 +44,7 @@ struct dpc_dev {
> >>>  	int			cap_pos;
> >>>  	bool			rp;
> >>>  	u32			rp_pio_status;
> >>> +	struct aer_err_info	info;
> >>>  };
> >>>  
> >>>  static const char * const rp_pio_error_string[] = {
> >>> @@ -111,6 +112,9 @@ static void interrupt_event_handler(struct work_struct *work)
> >>>  	struct pci_dev *dev, *temp, *pdev = dpc->dev->port;
> >>>  	struct pci_bus *parent = pdev->subordinate;
> >>>  
> >>> +	if (aer_get_device_error_info(pdev, &dpc->info))
> >>> +		aer_print_error(pdev, &dpc->info);
> >>
> >> I'm confused about this.  "pdev" is the DPC-enabled Port, i.e., the
> >> Port that received an ERR_FATAL or ERR_NONFATAL Message.  The Message
> >> was generated by something below "pdev", and that's where the
> >> interesting AER logging would have been done.
> >>
> >> This patch suggests that the DPC port itself would have something
> >> interesting logged in its AER capability.  Is that really true?  I see
> >> in sec 6.2.10 (PCIe r4.0) that the DPC port should log the Trigger
> >> Reason and the Error Source ID from the Message, but I don't see
> >> anything about logging anything in its AER registers.
> > 
> > If the trigger reason is "unmasked uncorrectable error", then the DPC
> > AER register has useful information. All other reasons would not provide
> > useful data in the AER register.
> > 
> > 
> > I would assume that aer_get_device_error_info would return false if
> > it's one of other reasons, but I can fix this up to check the trigger
> > reasoning rather than unconditionally reading the AER status
> > 
> 
> https://pcisig.com/sites/default/files/specification_documents/ECN_DPC_2012-02-09_finalized.pdf
> 
> If you look at figure 6-2, AER errors are always logged regardless
> of the DPC presence.

Figure 6-2 applies to "Errors from Table 6-2, 6-3, or 6-4".
(I suspect it should also include Table 6-5 because PCIe r4.0,
sec 6.2.4 says Table 6-5 corresponds to AER status bits).

But in any case, those tables don't include receipt of ERR_FATAL or
ERR_NONFATAL messages.  So I don't think a DPC Port should be logging
anything in its own AER registers when it receives those messages.

If a DPC Port receives ERR_FATAL or ERR_NONFATAL while DPC is enabled,
it will trigger DPC and drop the message.  If DPC were not enabled, I
think it would merely forward the message toward the Root Complex
without logging an error itself.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/4] PCI/DPC: Print AER status in DPC event handling
  2018-01-16  2:47         ` Keith Busch
@ 2018-01-17  0:56           ` Bjorn Helgaas
  2018-01-17  1:34             ` Keith Busch
  2018-01-17 13:36             ` Sinan Kaya
  0 siblings, 2 replies; 24+ messages in thread
From: Bjorn Helgaas @ 2018-01-17  0:56 UTC (permalink / raw)
  To: Keith Busch
  Cc: Sinan Kaya, linux-pci, Bjorn Helgaas, Maik Broemme, Pawandeep Oza

On Mon, Jan 15, 2018 at 07:47:41PM -0700, Keith Busch wrote:
> On Wed, Jan 10, 2018 at 10:52:23AM -0500, Sinan Kaya wrote:
> > Since the AER status is set when we observe DPC event and nobody
> > is clearing these we won't observe another DPC event until
> > somebody clears these. We can say that we are resetting the
> > endpoints as part of the DPC but we are not touching the switch
> > downstream port or the root port registers.
> > 
> > Somebody still needs to clear these in addition to printing
> > whatever information is available in the AER registers.
> 
> We should clear the downstream port's AER uncorrectable status
> register if the trigger reason is of that type, but DPC definitely
> does not require we clear it in order to observe a new event.

I don't quite understand this.  We don't need to clear the AER status
in order to observe a new DPC event, but I think we *do* need to clear
the AER status in order to log future AER events.  Don't we?

I think Sinan is saying that if a DPC Port observes its own unmasked
uncorrectable error (i.e., not something it learned about by receiving
an ERR_* message), it will set a bit in an AER status register.

Since we do not set DPC ERR_COR Enable, the DPC Port does not generate
an ERR_COR Message, so the AER driver never learns about the error and
never clears the AER status register.  So we'll decode the AER status
(with your current patch), but we don't clear it, so if another error
occurs, the AER logging won't work correctly.

Maybe we should be setting DPC ERR_COR Enable and letting the AER
driver decode and clear the AER info?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/4] PCI/DPC: Print AER status in DPC event handling
  2018-01-17  0:56           ` Bjorn Helgaas
@ 2018-01-17  1:34             ` Keith Busch
  2018-01-17 13:36             ` Sinan Kaya
  1 sibling, 0 replies; 24+ messages in thread
From: Keith Busch @ 2018-01-17  1:34 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Sinan Kaya, linux-pci, Bjorn Helgaas, Maik Broemme, Pawandeep Oza

On Tue, Jan 16, 2018 at 06:56:00PM -0600, Bjorn Helgaas wrote:
> On Mon, Jan 15, 2018 at 07:47:41PM -0700, Keith Busch wrote:
> > On Wed, Jan 10, 2018 at 10:52:23AM -0500, Sinan Kaya wrote:
> > > Since the AER status is set when we observe DPC event and nobody
> > > is clearing these we won't observe another DPC event until
> > > somebody clears these. We can say that we are resetting the
> > > endpoints as part of the DPC but we are not touching the switch
> > > downstream port or the root port registers.
> > > 
> > > Somebody still needs to clear these in addition to printing
> > > whatever information is available in the AER registers.
> > 
> > We should clear the downstream port's AER uncorrectable status
> > register if the trigger reason is of that type, but DPC definitely
> > does not require we clear it in order to observe a new event.
> 
> I don't quite understand this.  We don't need to clear the AER status
> in order to observe a new DPC event, but I think we *do* need to clear
> the AER status in order to log future AER events.  Don't we?

Absolutely, I totally agree on this. I have a v2 ready to send that calls
pci_cleanup_aer_uncorrect_error_status after printing the error. And
I've also made sure to do this only for the appropriate trigger reason
that you mentioned on the other thread.

> I think Sinan is saying that if a DPC Port observes its own unmasked
> uncorrectable error (i.e., not something it learned about by receiving
> an ERR_* message), it will set a bit in an AER status register.
> 
> Since we do not set DPC ERR_COR Enable, the DPC Port does not generate
> an ERR_COR Message, so the AER driver never learns about the error and
> never clears the AER status register.  So we'll decode the AER status
> (with your current patch), but we don't clear it, so if another error
> occurs, the AER logging won't work correctly.
> 
> Maybe we should be setting DPC ERR_COR Enable and letting the AER
> driver decode and clear the AER info?

Yes, enabling ERR_COR should be no problem and will provide those
platform notification benefits.

That alone will not clear the uncorrectable status, though. The AER
driver will only check correctable status registers on that message. The
spec doesn't appear to define what we should find in the downstream
port's correctable status, so I can't tell if that will actually log
new information or if it's simply to assist platform notification that
something happened.

I'll post the v2 for consideration, and inlcude the ERR_COR enabling. This
does clash a bit with Oza's approach, but it should lower churn for the
near term.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/4] PCI/DPC: Print AER status in DPC event handling
  2018-01-17  0:56           ` Bjorn Helgaas
  2018-01-17  1:34             ` Keith Busch
@ 2018-01-17 13:36             ` Sinan Kaya
  1 sibling, 0 replies; 24+ messages in thread
From: Sinan Kaya @ 2018-01-17 13:36 UTC (permalink / raw)
  To: Bjorn Helgaas, Keith Busch
  Cc: linux-pci, Bjorn Helgaas, Maik Broemme, Pawandeep Oza

On 1/16/2018 7:56 PM, Bjorn Helgaas wrote:
> I think Sinan is saying that if a DPC Port observes its own unmasked
> uncorrectable error (i.e., not something it learned about by receiving
> an ERR_* message), it will set a bit in an AER status register.
> 
> Since we do not set DPC ERR_COR Enable, the DPC Port does not generate
> an ERR_COR Message, so the AER driver never learns about the error and
> never clears the AER status register.  So we'll decode the AER status
> (with your current patch), but we don't clear it, so if another error
> occurs, the AER logging won't work correctly.

Yes, this is what I was meaning. I see that Keith took care of this in
the new series.

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/4] PCI/AER: Return approrpiate value when AER is not supported
  2017-12-19 21:06 [PATCH 1/4] PCI/AER: Return approrpiate value when AER is not supported Keith Busch
                   ` (3 preceding siblings ...)
  2017-12-21  1:04 ` [PATCH 1/4] PCI/AER: Return approrpiate value when AER is not supported Bjorn Helgaas
@ 2018-03-20 23:02 ` Bjorn Helgaas
  2018-03-21 22:27   ` Keith Busch
  4 siblings, 1 reply; 24+ messages in thread
From: Bjorn Helgaas @ 2018-03-20 23:02 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-pci, Bjorn Helgaas, Maik Broemme

On Tue, Dec 19, 2017 at 02:06:40PM -0700, Keith Busch wrote:
> Getting the AER information is documented to return 0 when it failed to
> get the information.
> 
> Signed-off-by: Keith Busch <keith.busch@intel.com>

Keith, I'm sorry, but I lost track of where we're at with this series.
Can you repost a fresh version if it's still applicable?

Don't worry about rebasing it to anything other than v4.16-rc1 (my
"master" branch).  I might have added some conflicting things, but
I'll take care of resolving any conflicts.

> ---
>  drivers/pci/pcie/aer/aerdrv_core.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c
> index 744805232155..ea0dc1cc7fc7 100644
> --- a/drivers/pci/pcie/aer/aerdrv_core.c
> +++ b/drivers/pci/pcie/aer/aerdrv_core.c
> @@ -660,7 +660,7 @@ static int get_device_error_info(struct pci_dev *dev, struct aer_err_info *info)
>  
>  	/* The device might not support AER */
>  	if (!pos)
> -		return 1;
> +		return 0;
>  
>  	if (info->severity == AER_CORRECTABLE) {
>  		pci_read_config_dword(dev, pos + PCI_ERR_COR_STATUS,
> -- 
> 2.13.6
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/4] PCI/AER: Return approrpiate value when AER is not supported
  2018-03-20 23:02 ` Bjorn Helgaas
@ 2018-03-21 22:27   ` Keith Busch
  0 siblings, 0 replies; 24+ messages in thread
From: Keith Busch @ 2018-03-21 22:27 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: linux-pci, Bjorn Helgaas, Maik Broemme

On Tue, Mar 20, 2018 at 04:02:01PM -0700, Bjorn Helgaas wrote:
> On Tue, Dec 19, 2017 at 02:06:40PM -0700, Keith Busch wrote:
> > Getting the AER information is documented to return 0 when it failed to
> > get the information.
> > 
> > Signed-off-by: Keith Busch <keith.busch@intel.com>
> 
> Keith, I'm sorry, but I lost track of where we're at with this series.
> Can you repost a fresh version if it's still applicable?
> 
> Don't worry about rebasing it to anything other than v4.16-rc1 (my
> "master" branch).  I might have added some conflicting things, but
> I'll take care of resolving any conflicts.

No problem, I think this is still applicable, and can be done without
conflicting with what Oza and Sinan are working on (it took a few tries,
but it finally clicked for me on how that works).

I'll rebase a new series and should have it posted by end of week.

Thanks,
Keith

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2018-03-21 22:27 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-12-19 21:06 [PATCH 1/4] PCI/AER: Return approrpiate value when AER is not supported Keith Busch
2017-12-19 21:06 ` [PATCH 2/4] PCI/AER: Provide API for getting AER information Keith Busch
2017-12-19 21:06 ` [PATCH 3/4] PCI/DPC: Enable DPC in conjuction with AER Keith Busch
2018-01-15 14:43   ` Sinan Kaya
2018-01-16  1:33     ` Keith Busch
2018-01-16  3:04       ` Sinan Kaya
2017-12-19 21:06 ` [PATCH 4/4] PCI/DPC: Print AER status in DPC event handling Keith Busch
2017-12-21  4:43   ` Sinan Kaya
2017-12-21  5:12     ` Keith Busch
2018-01-10 15:52       ` Sinan Kaya
2018-01-16  2:47         ` Keith Busch
2018-01-17  0:56           ` Bjorn Helgaas
2018-01-17  1:34             ` Keith Busch
2018-01-17 13:36             ` Sinan Kaya
2018-01-12 23:03   ` Bjorn Helgaas
2018-01-14  1:35     ` Keith Busch
2018-01-15 14:32       ` Sinan Kaya
2018-01-17  0:36         ` Bjorn Helgaas
2018-01-17  0:14       ` Bjorn Helgaas
2017-12-21  1:04 ` [PATCH 1/4] PCI/AER: Return approrpiate value when AER is not supported Bjorn Helgaas
2017-12-21  3:53   ` Dongdong Liu
2017-12-21 14:59   ` Keith Busch
2018-03-20 23:02 ` Bjorn Helgaas
2018-03-21 22:27   ` Keith Busch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).