linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/3] PCI/ACPI: Fix firmware first error recovery with root port in reset
@ 2013-05-30 14:39 Betty Dall
  2013-05-30 14:39 ` [PATCH v2 1/3] PCI/AER: Fix incorrect return from aer_hest_parse() Betty Dall
                   ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Betty Dall @ 2013-05-30 14:39 UTC (permalink / raw)
  To: rjw, bhelgaas; +Cc: ying.huang, linux-acpi, linux-kernel, linux-pci, Betty Dall

This patch set fixes a bug on platforms that use firmware first AER.
Firmware can leave the root port in Secondary Bus Reset (SBR) and
communicate this to the OS through the "reset" bit in the flags field
of the HEST table and associated CPER records. Firmware wants to do this
so that the error is contained and the hardware is in a known state.

Without these patches, the root port stays in SBR and the device drivers
cannot recover. These patches recognize when the firmware first root port
is in SBR and bring the root port out of SBR so the devices under the root
port can recover.

The changes have been tested on systems with firmware first that set the
"reset" bit by injecting various hardware errors. The errors successfully
recover.

Changes since v1:
Fixed a typo in the comment of patch 2.
Removed incorrect setting of reset bit in patch 3.

Signed-off-by: Betty Dall <betty.dall@hp.com>
---
Betty Dall (3):
  PCI/AER: Fix incorrect return from aer_hest_parse()
  ACPI/APEI: Force fatal AER severity when bus has been reset
  PCI/AER: Provide reset_link for firmware first root port
---
 drivers/acpi/apei/ghes.c           |   21 ++++++++++++++++++++-
 drivers/pci/pcie/aer/aerdrv_acpi.c |    3 +++
 drivers/pci/pcie/aer/aerdrv_core.c |   36 ++++++++++++++++++++++++++++++++++++
 3 files changed, 59 insertions(+), 1 deletions(-)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v2 1/3] PCI/AER: Fix incorrect return from aer_hest_parse()
  2013-05-30 14:39 [PATCH v2 0/3] PCI/ACPI: Fix firmware first error recovery with root port in reset Betty Dall
@ 2013-05-30 14:39 ` Betty Dall
  2013-06-02  0:38   ` Bjorn Helgaas
                     ` (2 more replies)
  2013-05-30 14:39 ` [PATCH v2 2/3] ACPI/APEI: Force fatal AER severity when bus has been reset Betty Dall
  2013-05-30 14:39 ` [PATCH v2 3/3] PCI/AER: Provide reset_link for firmware first root port Betty Dall
  2 siblings, 3 replies; 21+ messages in thread
From: Betty Dall @ 2013-05-30 14:39 UTC (permalink / raw)
  To: rjw, bhelgaas; +Cc: ying.huang, linux-acpi, linux-kernel, linux-pci, Betty Dall

The function aer_hest_parse() is called to determine if the given
PCI device is firmware first or not. The code loops through each
section of the HEST table to look for a match. The bug is that
the function always returns whether the last HEST section is firmware
first. The fix stops the iteration once the info.firmware_first
variable is set.  This is similar to how the function aer_hest_parse_aff()
stops the iteration.

Signed-off-by: Betty Dall <betty.dall@hp.com>
---

 drivers/pci/pcie/aer/aerdrv_acpi.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)


diff --git a/drivers/pci/pcie/aer/aerdrv_acpi.c b/drivers/pci/pcie/aer/aerdrv_acpi.c
index 5194a7d..39b8671 100644
--- a/drivers/pci/pcie/aer/aerdrv_acpi.c
+++ b/drivers/pci/pcie/aer/aerdrv_acpi.c
@@ -42,6 +42,9 @@ static int aer_hest_parse(struct acpi_hest_header *hest_hdr, void *data)
 	u8 bridge = 0;
 	int ff = 0;
 
+	if (info->firmware_first)
+		return 0;
+
 	switch (hest_hdr->type) {
 	case ACPI_HEST_TYPE_AER_ROOT_PORT:
 		pcie_type = PCI_EXP_TYPE_ROOT_PORT;

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 2/3] ACPI/APEI: Force fatal AER severity when bus has been reset
  2013-05-30 14:39 [PATCH v2 0/3] PCI/ACPI: Fix firmware first error recovery with root port in reset Betty Dall
  2013-05-30 14:39 ` [PATCH v2 1/3] PCI/AER: Fix incorrect return from aer_hest_parse() Betty Dall
@ 2013-05-30 14:39 ` Betty Dall
  2013-06-04  7:53   ` Chen Gong
  2013-05-30 14:39 ` [PATCH v2 3/3] PCI/AER: Provide reset_link for firmware first root port Betty Dall
  2 siblings, 1 reply; 21+ messages in thread
From: Betty Dall @ 2013-05-30 14:39 UTC (permalink / raw)
  To: rjw, bhelgaas; +Cc: ying.huang, linux-acpi, linux-kernel, linux-pci, Betty Dall

The CPER error record has a reset bit that indicates that the platform
has reset the bus. The reset bit can be set for any severity error
including recoverable.  From the AER code path's perspective,
any error is fatal if the bus has been reset. This patch upgrades the
severity of the AER recovery to AER_FATAL whenever the CPER error record
indicates that the bus has been reset.

Changes since v1:
Fixed a typo in comment.

Signed-off-by: Betty Dall <betty.dall@hp.com>
---

 drivers/acpi/apei/ghes.c |   21 ++++++++++++++++++++-
 1 files changed, 20 insertions(+), 1 deletions(-)


diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index d668a8a..1c67d5a 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -451,7 +451,26 @@ static void ghes_do_proc(struct ghes *ghes,
 				int aer_severity;
 				devfn = PCI_DEVFN(pcie_err->device_id.device,
 						  pcie_err->device_id.function);
-				aer_severity = cper_severity_to_aer(sev);
+				/*
+				 * Some Firmware First implementations
+				 * put the device in SBR to contain
+				 * the error. This is indicated by the
+				 * CPER Section Descriptor Flags reset
+				 * bit which means the component must
+				 * be re-initialized or re-enabled
+				 * prior to use. Promoting the AER
+				 * serverity to FATAL will cause the
+				 * AER code to link_reset and allow
+				 * drivers to reprogram their cards.
+				 */
+				if (gdata->flags & CPER_SEC_RESET)
+					aer_severity = cper_severity_to_aer(
+							CPER_SEV_FATAL);
+				else
+					aer_severity =
+						cper_severity_to_aer(sev);
+
+
 				aer_recover_queue(pcie_err->device_id.segment,
 						  pcie_err->device_id.bus,
 						  devfn, aer_severity);

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 3/3] PCI/AER: Provide reset_link for firmware first root port
  2013-05-30 14:39 [PATCH v2 0/3] PCI/ACPI: Fix firmware first error recovery with root port in reset Betty Dall
  2013-05-30 14:39 ` [PATCH v2 1/3] PCI/AER: Fix incorrect return from aer_hest_parse() Betty Dall
  2013-05-30 14:39 ` [PATCH v2 2/3] ACPI/APEI: Force fatal AER severity when bus has been reset Betty Dall
@ 2013-05-30 14:39 ` Betty Dall
  2013-06-04  8:36   ` Chen Gong
  2 siblings, 1 reply; 21+ messages in thread
From: Betty Dall @ 2013-05-30 14:39 UTC (permalink / raw)
  To: rjw, bhelgaas; +Cc: ying.huang, linux-acpi, linux-kernel, linux-pci, Betty Dall

The firmware first path does not install the aerdrv root port
service driver, so the firmware first root port does not have
a reset_link callback. When a firmware first root port does not have
a reset_link callback, use a new default reset_link similar to what
we already do for the default_downstream_reset_link(). The firmware
first default reset_link brings the root port out of SBR if firmware
left it in SBR.

Changes since v1:
Removed incorrect setting of p2p_ctrl after the read.

Signed-off-by: Betty Dall <betty.dall@hp.com>
---

 drivers/pci/pcie/aer/aerdrv_core.c |   36 ++++++++++++++++++++++++++++++++++++
 1 files changed, 36 insertions(+), 0 deletions(-)


diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c
index 8ec8b4f..72f76cd 100644
--- a/drivers/pci/pcie/aer/aerdrv_core.c
+++ b/drivers/pci/pcie/aer/aerdrv_core.c
@@ -413,6 +413,39 @@ static pci_ers_result_t default_downstream_reset_link(struct pci_dev *dev)
 	return PCI_ERS_RESULT_RECOVERED;
 }
 
+/**
+ * default_ff_root_port_reset_link - default reset function for firmware
+ *		first Root Port
+ * @dev: pointer to root port's pci_dev data structure
+ *
+ * Invoked when performing link reset at Root Port w/ no aer driver.
+ * This happens through the firmware first path. Firmware may leave
+ * the Root Port in SBR and it is the OS responsiblity to bring it out
+ * of SBR.
+ */
+static pci_ers_result_t default_ff_root_port_reset_link(struct pci_dev *dev)
+{
+	u16 p2p_ctrl;
+
+	/* Read Secondary Bus Reset */
+	pci_read_config_word(dev, PCI_BRIDGE_CONTROL, &p2p_ctrl);
+
+	/* De-assert Secondary Bus Reset, if it is set */
+	if (p2p_ctrl & PCI_BRIDGE_CTL_BUS_RESET) {
+		p2p_ctrl &= ~PCI_BRIDGE_CTL_BUS_RESET;
+		pci_write_config_word(dev, PCI_BRIDGE_CONTROL, p2p_ctrl);
+
+		/*
+		 * System software must wait for at least 100ms from the end
+		 * of a reset of one or more device before it is permitted
+		 * to issue Configuration Requests to those devices.
+		 */
+		msleep(200);
+		dev_dbg(&dev->dev, "Root Port link has been reset\n");
+	}
+	return PCI_ERS_RESULT_RECOVERED;
+}
+
 static int find_aer_service_iter(struct device *device, void *data)
 {
 	struct pcie_port_service_driver *service_driver, **drv;
@@ -460,6 +493,9 @@ static pci_ers_result_t reset_link(struct pci_dev *dev)
 		status = driver->reset_link(udev);
 	} else if (pci_pcie_type(udev) == PCI_EXP_TYPE_DOWNSTREAM) {
 		status = default_downstream_reset_link(udev);
+	} else if (pci_pcie_type(udev) == PCI_EXP_TYPE_ROOT_PORT &&
+		pcie_aer_get_firmware_first(udev)) {
+		status = default_ff_root_port_reset_link(udev);
 	} else {
 		dev_printk(KERN_DEBUG, &dev->dev,
 			"no link-reset support at upstream device %s\n",

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/3] PCI/AER: Fix incorrect return from aer_hest_parse()
  2013-05-30 14:39 ` [PATCH v2 1/3] PCI/AER: Fix incorrect return from aer_hest_parse() Betty Dall
@ 2013-06-02  0:38   ` Bjorn Helgaas
  2013-06-03 21:18     ` Betty Dall
  2013-06-04  7:42   ` Chen Gong
  2013-06-04 13:13   ` Bjorn Helgaas
  2 siblings, 1 reply; 21+ messages in thread
From: Bjorn Helgaas @ 2013-06-02  0:38 UTC (permalink / raw)
  To: Betty Dall
  Cc: Rafael J. Wysocki, Huang Ying, linux-acpi, linux-kernel,
	linux-pci, Moore, Robert

[+cc Bob for ACPI HEST spec questions]

On Thu, May 30, 2013 at 8:39 AM, Betty Dall <betty.dall@hp.com> wrote:
> The function aer_hest_parse() is called to determine if the given
> PCI device is firmware first or not. The code loops through each
> section of the HEST table to look for a match. The bug is that
> the function always returns whether the last HEST section is firmware
> first. The fix stops the iteration once the info.firmware_first
> variable is set.  This is similar to how the function aer_hest_parse_aff()
> stops the iteration.
>
> Signed-off-by: Betty Dall <betty.dall@hp.com>
> ---
>
>  drivers/pci/pcie/aer/aerdrv_acpi.c |    3 +++
>  1 files changed, 3 insertions(+), 0 deletions(-)
>
>
> diff --git a/drivers/pci/pcie/aer/aerdrv_acpi.c b/drivers/pci/pcie/aer/aerdrv_acpi.c
> index 5194a7d..39b8671 100644
> --- a/drivers/pci/pcie/aer/aerdrv_acpi.c
> +++ b/drivers/pci/pcie/aer/aerdrv_acpi.c
> @@ -42,6 +42,9 @@ static int aer_hest_parse(struct acpi_hest_header *hest_hdr, void *data)
>         u8 bridge = 0;
>         int ff = 0;
>
> +       if (info->firmware_first)
> +               return 0;
> +
>         switch (hest_hdr->type) {
>         case ACPI_HEST_TYPE_AER_ROOT_PORT:
>                 pcie_type = PCI_EXP_TYPE_ROOT_PORT;

Not related directly to your patch, Betty, but I can't figure out why
the ACPI spec defines the HEST structures for PCIe as it does.  I'm
looking at ACPI 5.0, sec 18.3.2.3 - 18.3.2.5.

1) The PCIe Root Port, PCIe Device, and PCIe/PCI-X Bridge structures
all include Bus, Device, and Function fields.  But there's no Segment.
 The current Linux code (hest_match_pci()) assumes HEST records can
only apply to PCI domain 0.  Is Linux missing something, or is the
HEST really this limited?

2) Doesn't the fact that the HEST structures include a bus number mean
that the OS cannot reassign bus numbers?  I guess maybe we could still
reassign them if we kept track of the mapping back to the original
values.

3) It's interesting that the only choices seem to be "global" or "list
every device."  There's no "apply to this device and the subtree under
it," so I guess if you want HEST to apply to hot-added devices,
"global" is the only thing that makes sense?

Bjorn

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/3] PCI/AER: Fix incorrect return from aer_hest_parse()
  2013-06-02  0:38   ` Bjorn Helgaas
@ 2013-06-03 21:18     ` Betty Dall
  2013-06-04 17:39       ` Bjorn Helgaas
  0 siblings, 1 reply; 21+ messages in thread
From: Betty Dall @ 2013-06-03 21:18 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Rafael J. Wysocki, Huang Ying, linux-acpi, linux-kernel,
	linux-pci, Moore, Robert

On Sat, 2013-06-01 at 18:38 -0600, Bjorn Helgaas wrote:
> [+cc Bob for ACPI HEST spec questions]
> 
> On Thu, May 30, 2013 at 8:39 AM, Betty Dall <betty.dall@hp.com> wrote:
> > The function aer_hest_parse() is called to determine if the given
> > PCI device is firmware first or not. The code loops through each
> > section of the HEST table to look for a match. The bug is that
> > the function always returns whether the last HEST section is firmware
> > first. The fix stops the iteration once the info.firmware_first
> > variable is set.  This is similar to how the function aer_hest_parse_aff()
> > stops the iteration.
> >
> > Signed-off-by: Betty Dall <betty.dall@hp.com>
> > ---
> >
> >  drivers/pci/pcie/aer/aerdrv_acpi.c |    3 +++
> >  1 files changed, 3 insertions(+), 0 deletions(-)
> >
> >
> > diff --git a/drivers/pci/pcie/aer/aerdrv_acpi.c b/drivers/pci/pcie/aer/aerdrv_acpi.c
> > index 5194a7d..39b8671 100644
> > --- a/drivers/pci/pcie/aer/aerdrv_acpi.c
> > +++ b/drivers/pci/pcie/aer/aerdrv_acpi.c
> > @@ -42,6 +42,9 @@ static int aer_hest_parse(struct acpi_hest_header *hest_hdr, void *data)
> >         u8 bridge = 0;
> >         int ff = 0;
> >
> > +       if (info->firmware_first)
> > +               return 0;
> > +
> >         switch (hest_hdr->type) {
> >         case ACPI_HEST_TYPE_AER_ROOT_PORT:
> >                 pcie_type = PCI_EXP_TYPE_ROOT_PORT;
> 
> Not related directly to your patch, Betty, but I can't figure out why
> the ACPI spec defines the HEST structures for PCIe as it does.  I'm
> looking at ACPI 5.0, sec 18.3.2.3 - 18.3.2.5. 

> 1) The PCIe Root Port, PCIe Device, and PCIe/PCI-X Bridge structures
> all include Bus, Device, and Function fields.  But there's no Segment.
>  The current Linux code (hest_match_pci()) assumes HEST records can
> only apply to PCI domain 0.  Is Linux missing something, or is the
> HEST really this limited?
You are right that the HEST table does not have the Segment for the PCIe
sources. The Linux code uses the Generic Source type that points to a
UEFI CPER record. Those do have the Segment. The code in
acpi/apei/ghes.c that parses the HEST and invokes the
aer_recover_queue() is using the segment from the CPER record. 

The hest_match_pci() code is only looking at the PCIe error source types
because that is where the flags field is defined to indicate "firmware
first". Flags is not defined in the Generic error source.  Firmware
first platforms today must be using GLOBAL error sources because the
code is written in such a way that any specific PCIe match will cause
the entire platform to be firmware first. Linux only supports either AER
or firmware first right now, and it would be an enhancement to do both. 

> 2) Doesn't the fact that the HEST structures include a bus number mean
> that the OS cannot reassign bus numbers?  I guess maybe we could still
> reassign them if we kept track of the mapping back to the original
> values.
Yes, I think there needs to be a mapping between the HEST/CPER bus
number and the OS assigned bus numbers, if the OS is reassigning bus
numbers.  

> 3) It's interesting that the only choices seem to be "global" or "list
> every device."  There's no "apply to this device and the subtree under
> it," so I guess if you want HEST to apply to hot-added devices,
> "global" is the only thing that makes sense?
"Global" is the one one that makes sense to me too. The size of the HEST
table is fixed at boot and firmware strives to keep it as small as
possible because firmware memory is a scarce resource. 

Betty

> Bjorn



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/3] PCI/AER: Fix incorrect return from aer_hest_parse()
  2013-05-30 14:39 ` [PATCH v2 1/3] PCI/AER: Fix incorrect return from aer_hest_parse() Betty Dall
  2013-06-02  0:38   ` Bjorn Helgaas
@ 2013-06-04  7:42   ` Chen Gong
  2013-06-04 13:13   ` Bjorn Helgaas
  2 siblings, 0 replies; 21+ messages in thread
From: Chen Gong @ 2013-06-04  7:42 UTC (permalink / raw)
  To: Betty Dall; +Cc: rjw, bhelgaas, ying.huang, linux-acpi, linux-kernel, linux-pci

[-- Attachment #1: Type: text/plain, Size: 2185 bytes --]

On Thu, May 30, 2013 at 08:39:27AM -0600, Betty Dall wrote:
> Date:	Thu, 30 May 2013 08:39:27 -0600
> From: Betty Dall <betty.dall@hp.com>
> To: rjw@sisk.pl, bhelgaas@google.com
> Cc: ying.huang@intel.com, linux-acpi@vger.kernel.org,
>  linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, Betty Dall
>  <betty.dall@hp.com>
> Subject: [PATCH v2 1/3] PCI/AER: Fix incorrect return from aer_hest_parse()
> X-Mailer: git-send-email 1.7.7.6
> 
> The function aer_hest_parse() is called to determine if the given
> PCI device is firmware first or not. The code loops through each
> section of the HEST table to look for a match. The bug is that
> the function always returns whether the last HEST section is firmware
> first. The fix stops the iteration once the info.firmware_first
> variable is set.  This is similar to how the function aer_hest_parse_aff()
> stops the iteration.
> 
> Signed-off-by: Betty Dall <betty.dall@hp.com>

The patch is good. But I have further concern based on your patch.
1) aer_hest_parse never checks the 2nd input parameter (void *data),
which means once it is NULL, it will crash the kernel.

2) both aer_hest_parse and aer_hest_parse_aff utilize some flag
as shortcut, if so, why not adding similar logic in apei_hest_parse
to stop meaningless loops once FFM is confirmed as enabled.

> ---
> 
>  drivers/pci/pcie/aer/aerdrv_acpi.c |    3 +++
>  1 files changed, 3 insertions(+), 0 deletions(-)
> 
> 
> diff --git a/drivers/pci/pcie/aer/aerdrv_acpi.c b/drivers/pci/pcie/aer/aerdrv_acpi.c
> index 5194a7d..39b8671 100644
> --- a/drivers/pci/pcie/aer/aerdrv_acpi.c
> +++ b/drivers/pci/pcie/aer/aerdrv_acpi.c
> @@ -42,6 +42,9 @@ static int aer_hest_parse(struct acpi_hest_header *hest_hdr, void *data)
>  	u8 bridge = 0;
>  	int ff = 0;
>  
> +	if (info->firmware_first)
> +		return 0;
> +
>  	switch (hest_hdr->type) {
>  	case ACPI_HEST_TYPE_AER_ROOT_PORT:
>  		pcie_type = PCI_EXP_TYPE_ROOT_PORT;
> --
> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 2/3] ACPI/APEI: Force fatal AER severity when bus has been reset
  2013-05-30 14:39 ` [PATCH v2 2/3] ACPI/APEI: Force fatal AER severity when bus has been reset Betty Dall
@ 2013-06-04  7:53   ` Chen Gong
  2013-06-04 16:20     ` Betty Dall
  2013-06-04 17:54     ` Bjorn Helgaas
  0 siblings, 2 replies; 21+ messages in thread
From: Chen Gong @ 2013-06-04  7:53 UTC (permalink / raw)
  To: Betty Dall; +Cc: rjw, bhelgaas, ying.huang, linux-acpi, linux-kernel, linux-pci

[-- Attachment #1: Type: text/plain, Size: 2732 bytes --]

On Thu, May 30, 2013 at 08:39:28AM -0600, Betty Dall wrote:
> Date:	Thu, 30 May 2013 08:39:28 -0600
> From: Betty Dall <betty.dall@hp.com>
> To: rjw@sisk.pl, bhelgaas@google.com
> Cc: ying.huang@intel.com, linux-acpi@vger.kernel.org,
>  linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, Betty Dall
>  <betty.dall@hp.com>
> Subject: [PATCH v2 2/3] ACPI/APEI: Force fatal AER severity when bus has
>  been reset
> X-Mailer: git-send-email 1.7.7.6
> 
> The CPER error record has a reset bit that indicates that the platform
> has reset the bus. The reset bit can be set for any severity error
> including recoverable.  From the AER code path's perspective,
> any error is fatal if the bus has been reset. This patch upgrades the
> severity of the AER recovery to AER_FATAL whenever the CPER error record
> indicates that the bus has been reset.
> 
> Changes since v1:
> Fixed a typo in comment.
> 
> Signed-off-by: Betty Dall <betty.dall@hp.com>
> ---
> 
>  drivers/acpi/apei/ghes.c |   21 ++++++++++++++++++++-
>  1 files changed, 20 insertions(+), 1 deletions(-)
> 
> 
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index d668a8a..1c67d5a 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -451,7 +451,26 @@ static void ghes_do_proc(struct ghes *ghes,
>  				int aer_severity;
>  				devfn = PCI_DEVFN(pcie_err->device_id.device,
>  						  pcie_err->device_id.function);
> -				aer_severity = cper_severity_to_aer(sev);
> +				/*
> +				 * Some Firmware First implementations
> +				 * put the device in SBR to contain
> +				 * the error. This is indicated by the
> +				 * CPER Section Descriptor Flags reset
> +				 * bit which means the component must
> +				 * be re-initialized or re-enabled
> +				 * prior to use. Promoting the AER
> +				 * serverity to FATAL will cause the
> +				 * AER code to link_reset and allow
> +				 * drivers to reprogram their cards.
> +				 */
> +				if (gdata->flags & CPER_SEC_RESET)
> +					aer_severity = cper_severity_to_aer(
> +							CPER_SEV_FATAL);
> +				else
> +					aer_severity =
> +						cper_severity_to_aer(sev);
> +
> +

How about this?
                                if (gdata->flags & CPER_SEC_RESET)
                                        sev = CPER_SEV_FATAL;
                                cper_severity_to_aer(sev);

>  				aer_recover_queue(pcie_err->device_id.segment,
>  						  pcie_err->device_id.bus,
>  						  devfn, aer_severity);
> --
> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 3/3] PCI/AER: Provide reset_link for firmware first root port
  2013-05-30 14:39 ` [PATCH v2 3/3] PCI/AER: Provide reset_link for firmware first root port Betty Dall
@ 2013-06-04  8:36   ` Chen Gong
  2013-06-04 18:31     ` Bjorn Helgaas
  2013-06-04 21:38     ` Betty Dall
  0 siblings, 2 replies; 21+ messages in thread
From: Chen Gong @ 2013-06-04  8:36 UTC (permalink / raw)
  To: Betty Dall; +Cc: rjw, bhelgaas, ying.huang, linux-acpi, linux-kernel, linux-pci

[-- Attachment #1: Type: text/plain, Size: 4154 bytes --]

On Thu, May 30, 2013 at 08:39:29AM -0600, Betty Dall wrote:
> Date:	Thu, 30 May 2013 08:39:29 -0600
> From: Betty Dall <betty.dall@hp.com>
> To: rjw@sisk.pl, bhelgaas@google.com
> Cc: ying.huang@intel.com, linux-acpi@vger.kernel.org,
>  linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, Betty Dall
>  <betty.dall@hp.com>
> Subject: [PATCH v2 3/3] PCI/AER: Provide reset_link for firmware first root
>  port
> X-Mailer: git-send-email 1.7.7.6
> 
> The firmware first path does not install the aerdrv root port
> service driver, so the firmware first root port does not have
> a reset_link callback. When a firmware first root port does not have
> a reset_link callback, use a new default reset_link similar to what
> we already do for the default_downstream_reset_link(). The firmware
> first default reset_link brings the root port out of SBR if firmware
> left it in SBR.
> 
> Changes since v1:
> Removed incorrect setting of p2p_ctrl after the read.
> 
> Signed-off-by: Betty Dall <betty.dall@hp.com>
> ---
> 
>  drivers/pci/pcie/aer/aerdrv_core.c |   36 ++++++++++++++++++++++++++++++++++++
>  1 files changed, 36 insertions(+), 0 deletions(-)
> 
> 
> diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c
> index 8ec8b4f..72f76cd 100644
> --- a/drivers/pci/pcie/aer/aerdrv_core.c
> +++ b/drivers/pci/pcie/aer/aerdrv_core.c
> @@ -413,6 +413,39 @@ static pci_ers_result_t default_downstream_reset_link(struct pci_dev *dev)
>  	return PCI_ERS_RESULT_RECOVERED;
>  }
>  
> +/**
> + * default_ff_root_port_reset_link - default reset function for firmware
> + *		first Root Port
> + * @dev: pointer to root port's pci_dev data structure
> + *
> + * Invoked when performing link reset at Root Port w/ no aer driver.
> + * This happens through the firmware first path. Firmware may leave
> + * the Root Port in SBR and it is the OS responsiblity to bring it out
> + * of SBR.
> + */
> +static pci_ers_result_t default_ff_root_port_reset_link(struct pci_dev *dev)
> +{
> +	u16 p2p_ctrl;
> +
> +	/* Read Secondary Bus Reset */
> +	pci_read_config_word(dev, PCI_BRIDGE_CONTROL, &p2p_ctrl);
> +
> +	/* De-assert Secondary Bus Reset, if it is set */
> +	if (p2p_ctrl & PCI_BRIDGE_CTL_BUS_RESET) {
> +		p2p_ctrl &= ~PCI_BRIDGE_CTL_BUS_RESET;
> +		pci_write_config_word(dev, PCI_BRIDGE_CONTROL, p2p_ctrl);
> +
> +		/*
> +		 * System software must wait for at least 100ms from the end
> +		 * of a reset of one or more device before it is permitted
> +		 * to issue Configuration Requests to those devices.
> +		 */
> +		msleep(200);
> +		dev_dbg(&dev->dev, "Root Port link has been reset\n");
> +	}
> +	return PCI_ERS_RESULT_RECOVERED;
> +}

I don't think this function is OK.
1) You don't really reset the 2nd Bus but just checking its status.
I think you should have following steps to reset 2nd Bus:

a. Assert 2nd Bus Reset
b. wait for some time until this message has been broadcasted well
c. De-assert 2nd Bus Reset
d. wait for Trst time

IOW, since we have aer_do_secondary_bus_reset to perform secondary bus
reset, why you repeat it again?

2) msleep(200) is too long for kernel. You'd better yield the CPU when
sleep.

> +
>  static int find_aer_service_iter(struct device *device, void *data)
>  {
>  	struct pcie_port_service_driver *service_driver, **drv;
> @@ -460,6 +493,9 @@ static pci_ers_result_t reset_link(struct pci_dev *dev)
>  		status = driver->reset_link(udev);
>  	} else if (pci_pcie_type(udev) == PCI_EXP_TYPE_DOWNSTREAM) {
>  		status = default_downstream_reset_link(udev);
> +	} else if (pci_pcie_type(udev) == PCI_EXP_TYPE_ROOT_PORT &&
> +		pcie_aer_get_firmware_first(udev)) {
> +		status = default_ff_root_port_reset_link(udev);
>  	} else {
>  		dev_printk(KERN_DEBUG, &dev->dev,
>  			"no link-reset support at upstream device %s\n",
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/3] PCI/AER: Fix incorrect return from aer_hest_parse()
  2013-05-30 14:39 ` [PATCH v2 1/3] PCI/AER: Fix incorrect return from aer_hest_parse() Betty Dall
  2013-06-02  0:38   ` Bjorn Helgaas
  2013-06-04  7:42   ` Chen Gong
@ 2013-06-04 13:13   ` Bjorn Helgaas
  2013-06-05  2:48     ` Chen Gong
  2 siblings, 1 reply; 21+ messages in thread
From: Bjorn Helgaas @ 2013-06-04 13:13 UTC (permalink / raw)
  To: Betty Dall
  Cc: rjw, ying.huang, linux-acpi, linux-kernel, linux-pci, gong.chen

On Thu, May 30, 2013 at 08:39:27AM -0600, Betty Dall wrote:
> The function aer_hest_parse() is called to determine if the given
> PCI device is firmware first or not. The code loops through each
> section of the HEST table to look for a match. The bug is that
> the function always returns whether the last HEST section is firmware
> first. The fix stops the iteration once the info.firmware_first
> variable is set.  This is similar to how the function aer_hest_parse_aff()
> stops the iteration.
> 
> Signed-off-by: Betty Dall <betty.dall@hp.com>
> ---
> 
>  drivers/pci/pcie/aer/aerdrv_acpi.c |    3 +++
>  1 files changed, 3 insertions(+), 0 deletions(-)
> 
> 
> diff --git a/drivers/pci/pcie/aer/aerdrv_acpi.c b/drivers/pci/pcie/aer/aerdrv_acpi.c
> index 5194a7d..39b8671 100644
> --- a/drivers/pci/pcie/aer/aerdrv_acpi.c
> +++ b/drivers/pci/pcie/aer/aerdrv_acpi.c
> @@ -42,6 +42,9 @@ static int aer_hest_parse(struct acpi_hest_header *hest_hdr, void *data)
>  	u8 bridge = 0;
>  	int ff = 0;
>  
> +	if (info->firmware_first)
> +		return 0;
> +
>  	switch (hest_hdr->type) {
>  	case ACPI_HEST_TYPE_AER_ROOT_PORT:
>  		pcie_type = PCI_EXP_TYPE_ROOT_PORT;


1) I think dev->__aer_firmware_first should be initialized somewhere in the
pci_scan_single_device() path, e.g., maybe pci_init_capabilities().  It's
known at device add-time and never changes, so there's no point in doing
the lazy setup we do now.  That would let us get rid of
__aer_firmware_first_valid, too (along with the pointless "__" prefix).
This is just an observation, not a requirement for this patch set.

2) This is a band-aid that covers up the real problem, which is that we
update info->firmware_first even for non-matching devices.  I think we
should do something like the following instead:


commit c67612f272f1792a08f012f1b5ca37d5cfde5de4
Author: Bjorn Helgaas <bhelgaas@google.com>
Date:   Mon Jun 3 16:49:12 2013 -0600

    PCI/AER: Don't parse HEST table for non-PCIe devices
    
    AER is a PCIe-only capability, so there's no point in trying to match
    a HEST PCIe structure with a non-PCIe device.
    
    Previously, a HEST global AER bridge entry (type 8) could incorrectly
    match *any* bridge, even a legacy PCI-PCI bridge, and a non-global
    HEST entry could match a legacy PCI device.
    
    Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>

diff --git a/drivers/pci/pcie/aer/aerdrv_acpi.c b/drivers/pci/pcie/aer/aerdrv_acpi.c
index 5194a7d..4f798ab 100644
--- a/drivers/pci/pcie/aer/aerdrv_acpi.c
+++ b/drivers/pci/pcie/aer/aerdrv_acpi.c
@@ -59,8 +59,7 @@ static int aer_hest_parse(struct acpi_hest_header *hest_hdr, void *data)
 
 	p = (struct acpi_hest_aer_common *)(hest_hdr + 1);
 	if (p->flags & ACPI_HEST_GLOBAL) {
-		if ((pci_is_pcie(info->pci_dev) &&
-		     pci_pcie_type(info->pci_dev) == pcie_type) || bridge)
+		if ((pci_pcie_type(info->pci_dev) == pcie_type) || bridge)
 			ff = !!(p->flags & ACPI_HEST_FIRMWARE_FIRST);
 	} else
 		if (hest_match_pci(p, info->pci_dev))
@@ -89,6 +88,9 @@ static void aer_set_firmware_first(struct pci_dev *pci_dev)
 
 int pcie_aer_get_firmware_first(struct pci_dev *dev)
 {
+	if (!pci_is_pcie(dev))
+		return 0;
+
 	if (!dev->__aer_firmware_first_valid)
 		aer_set_firmware_first(dev);
 	return dev->__aer_firmware_first;

commit 947da50270686b0d70f4bc2f7323ef7229489ecb
Author: Bjorn Helgaas <bhelgaas@google.com>
Date:   Mon Jun 3 15:42:00 2013 -0600

    PCI/AER: Factor out HEST device type matching
    
    This factors out the matching of HEST structure type and PCIe device type
    to improve readability.  No functional change.
    
    Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>

diff --git a/drivers/pci/pcie/aer/aerdrv_acpi.c b/drivers/pci/pcie/aer/aerdrv_acpi.c
index 4f798ab..56e2d94 100644
--- a/drivers/pci/pcie/aer/aerdrv_acpi.c
+++ b/drivers/pci/pcie/aer/aerdrv_acpi.c
@@ -29,6 +29,22 @@ static inline int hest_match_pci(struct acpi_hest_aer_common *p,
 		 p->function == PCI_FUNC(pci->devfn));
 }
 
+static inline bool hest_match_type(struct acpi_hest_header *hest_hdr,
+				   struct pci_dev *dev)
+{
+	u16 hest_type = hest_hdr->type;
+	u8 pcie_type = pci_pcie_type(dev);
+
+	if ((hest_type == ACPI_HEST_TYPE_AER_ROOT_PORT &&
+	     pcie_type == PCI_EXP_TYPE_ROOT_PORT) ||
+	    (hest_type == ACPI_HEST_TYPE_AER_ENDPOINT &&
+	     pcie_type == PCI_EXP_TYPE_ENDPOINT) ||
+	    (hest_type == ACPI_HEST_TYPE_AER_BRIDGE &&
+	     (dev->class >> 16) == PCI_BASE_CLASS_BRIDGE))
+		return true;
+	return false;
+}
+
 struct aer_hest_parse_info {
 	struct pci_dev *pci_dev;
 	int firmware_first;
@@ -38,28 +54,11 @@ static int aer_hest_parse(struct acpi_hest_header *hest_hdr, void *data)
 {
 	struct aer_hest_parse_info *info = data;
 	struct acpi_hest_aer_common *p;
-	u8 pcie_type = 0;
-	u8 bridge = 0;
 	int ff = 0;
 
-	switch (hest_hdr->type) {
-	case ACPI_HEST_TYPE_AER_ROOT_PORT:
-		pcie_type = PCI_EXP_TYPE_ROOT_PORT;
-		break;
-	case ACPI_HEST_TYPE_AER_ENDPOINT:
-		pcie_type = PCI_EXP_TYPE_ENDPOINT;
-		break;
-	case ACPI_HEST_TYPE_AER_BRIDGE:
-		if ((info->pci_dev->class >> 16) == PCI_BASE_CLASS_BRIDGE)
-			bridge = 1;
-		break;
-	default:
-		return 0;
-	}
-
 	p = (struct acpi_hest_aer_common *)(hest_hdr + 1);
 	if (p->flags & ACPI_HEST_GLOBAL) {
-		if ((pci_pcie_type(info->pci_dev) == pcie_type) || bridge)
+		if (hest_match_type(hest_hdr, info->pci_dev))
 			ff = !!(p->flags & ACPI_HEST_FIRMWARE_FIRST);
 	} else
 		if (hest_match_pci(p, info->pci_dev))

commit e9f977a04d96a54c4f6aa0b831e725dab2154364
Author: Bjorn Helgaas <bhelgaas@google.com>
Date:   Mon Jun 3 19:47:27 2013 -0600

    PCI/AER: Set dev->__aer_firmware_first only for matching devices
    
    Previously, we always updated info->firmware_first, even for HEST entries
    that didn't match the device.  Therefore, if the last HEST descriptor was
    a PCIe structure that didn't match the device, we always cleared
    dev->__aer_firmware_first.
    
    Based-on-patch-by: Betty Dall <betty.dall@hp.com>
    Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>

diff --git a/drivers/pci/pcie/aer/aerdrv_acpi.c b/drivers/pci/pcie/aer/aerdrv_acpi.c
index 56e2d94..2bedad8 100644
--- a/drivers/pci/pcie/aer/aerdrv_acpi.c
+++ b/drivers/pci/pcie/aer/aerdrv_acpi.c
@@ -54,16 +54,16 @@ static int aer_hest_parse(struct acpi_hest_header *hest_hdr, void *data)
 {
 	struct aer_hest_parse_info *info = data;
 	struct acpi_hest_aer_common *p;
-	int ff = 0;
+	int ff;
 
 	p = (struct acpi_hest_aer_common *)(hest_hdr + 1);
-	if (p->flags & ACPI_HEST_GLOBAL) {
+	ff = !!(p->flags & ACPI_HEST_FIRMWARE_FIRST);
+	if (p->flags & ACPI_HEST_GLOBAL)
 		if (hest_match_type(hest_hdr, info->pci_dev))
-			ff = !!(p->flags & ACPI_HEST_FIRMWARE_FIRST);
-	} else
+			info->firmware_first = ff;
+	else
 		if (hest_match_pci(p, info->pci_dev))
-			ff = !!(p->flags & ACPI_HEST_FIRMWARE_FIRST);
-	info->firmware_first = ff;
+			info->firmware_first = ff;
 
 	return 0;
 }

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 2/3] ACPI/APEI: Force fatal AER severity when bus has been reset
  2013-06-04  7:53   ` Chen Gong
@ 2013-06-04 16:20     ` Betty Dall
  2013-06-04 17:54     ` Bjorn Helgaas
  1 sibling, 0 replies; 21+ messages in thread
From: Betty Dall @ 2013-06-04 16:20 UTC (permalink / raw)
  To: Chen Gong; +Cc: rjw, bhelgaas, ying.huang, linux-acpi, linux-kernel, linux-pci

On Tue, 2013-06-04 at 03:53 -0400, Chen Gong wrote:
> On Thu, May 30, 2013 at 08:39:28AM -0600, Betty Dall wrote:
> > Date:	Thu, 30 May 2013 08:39:28 -0600
> > From: Betty Dall <betty.dall@hp.com>
> > To: rjw@sisk.pl, bhelgaas@google.com
> > Cc: ying.huang@intel.com, linux-acpi@vger.kernel.org,
> >  linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, Betty Dall
> >  <betty.dall@hp.com>
> > Subject: [PATCH v2 2/3] ACPI/APEI: Force fatal AER severity when bus has
> >  been reset
> > X-Mailer: git-send-email 1.7.7.6
> > 
> > The CPER error record has a reset bit that indicates that the platform
> > has reset the bus. The reset bit can be set for any severity error
> > including recoverable.  From the AER code path's perspective,
> > any error is fatal if the bus has been reset. This patch upgrades the
> > severity of the AER recovery to AER_FATAL whenever the CPER error record
> > indicates that the bus has been reset.
> > 
> > Changes since v1:
> > Fixed a typo in comment.
> > 
> > Signed-off-by: Betty Dall <betty.dall@hp.com>
> > ---
> > 
> >  drivers/acpi/apei/ghes.c |   21 ++++++++++++++++++++-
> >  1 files changed, 20 insertions(+), 1 deletions(-)
> > 
> > 
> > diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> > index d668a8a..1c67d5a 100644
> > --- a/drivers/acpi/apei/ghes.c
> > +++ b/drivers/acpi/apei/ghes.c
> > @@ -451,7 +451,26 @@ static void ghes_do_proc(struct ghes *ghes,
> >  				int aer_severity;
> >  				devfn = PCI_DEVFN(pcie_err->device_id.device,
> >  						  pcie_err->device_id.function);
> > -				aer_severity = cper_severity_to_aer(sev);
> > +				/*
> > +				 * Some Firmware First implementations
> > +				 * put the device in SBR to contain
> > +				 * the error. This is indicated by the
> > +				 * CPER Section Descriptor Flags reset
> > +				 * bit which means the component must
> > +				 * be re-initialized or re-enabled
> > +				 * prior to use. Promoting the AER
> > +				 * serverity to FATAL will cause the
> > +				 * AER code to link_reset and allow
> > +				 * drivers to reprogram their cards.
> > +				 */
> > +				if (gdata->flags & CPER_SEC_RESET)
> > +					aer_severity = cper_severity_to_aer(
> > +							CPER_SEV_FATAL);
> > +				else
> > +					aer_severity =
> > +						cper_severity_to_aer(sev);
> > +
> > +
> 
> How about this?
>                                 if (gdata->flags & CPER_SEC_RESET)
>                                         sev = CPER_SEV_FATAL;
>                                 cper_severity_to_aer(sev);

Thanks for the review. I will make that change.

-Betty
> 
> >  				aer_recover_queue(pcie_err->device_id.segment,
> >  						  pcie_err->device_id.bus,
> >  						  devfn, aer_severity);
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/3] PCI/AER: Fix incorrect return from aer_hest_parse()
  2013-06-03 21:18     ` Betty Dall
@ 2013-06-04 17:39       ` Bjorn Helgaas
  0 siblings, 0 replies; 21+ messages in thread
From: Bjorn Helgaas @ 2013-06-04 17:39 UTC (permalink / raw)
  To: Betty Dall
  Cc: Rafael J. Wysocki, Huang Ying, linux-acpi, linux-kernel,
	linux-pci, Moore, Robert

On Mon, Jun 3, 2013 at 3:18 PM, Betty Dall <betty.dall@hp.com> wrote:
> On Sat, 2013-06-01 at 18:38 -0600, Bjorn Helgaas wrote:
>> [+cc Bob for ACPI HEST spec questions]
>>
>> On Thu, May 30, 2013 at 8:39 AM, Betty Dall <betty.dall@hp.com> wrote:
>> > The function aer_hest_parse() is called to determine if the given
>> > PCI device is firmware first or not. The code loops through each
>> > section of the HEST table to look for a match. The bug is that
>> > the function always returns whether the last HEST section is firmware
>> > first. The fix stops the iteration once the info.firmware_first
>> > variable is set.  This is similar to how the function aer_hest_parse_aff()
>> > stops the iteration.
>> >
>> > Signed-off-by: Betty Dall <betty.dall@hp.com>
>> > ---
>> >
>> >  drivers/pci/pcie/aer/aerdrv_acpi.c |    3 +++
>> >  1 files changed, 3 insertions(+), 0 deletions(-)
>> >
>> >
>> > diff --git a/drivers/pci/pcie/aer/aerdrv_acpi.c b/drivers/pci/pcie/aer/aerdrv_acpi.c
>> > index 5194a7d..39b8671 100644
>> > --- a/drivers/pci/pcie/aer/aerdrv_acpi.c
>> > +++ b/drivers/pci/pcie/aer/aerdrv_acpi.c
>> > @@ -42,6 +42,9 @@ static int aer_hest_parse(struct acpi_hest_header *hest_hdr, void *data)
>> >         u8 bridge = 0;
>> >         int ff = 0;
>> >
>> > +       if (info->firmware_first)
>> > +               return 0;
>> > +
>> >         switch (hest_hdr->type) {
>> >         case ACPI_HEST_TYPE_AER_ROOT_PORT:
>> >                 pcie_type = PCI_EXP_TYPE_ROOT_PORT;
>>
>> Not related directly to your patch, Betty, but I can't figure out why
>> the ACPI spec defines the HEST structures for PCIe as it does.  I'm
>> looking at ACPI 5.0, sec 18.3.2.3 - 18.3.2.5.
>
>> 1) The PCIe Root Port, PCIe Device, and PCIe/PCI-X Bridge structures
>> all include Bus, Device, and Function fields.  But there's no Segment.
>>  The current Linux code (hest_match_pci()) assumes HEST records can
>> only apply to PCI domain 0.  Is Linux missing something, or is the
>> HEST really this limited?
> You are right that the HEST table does not have the Segment for the PCIe
> sources. The Linux code uses the Generic Source type that points to a
> UEFI CPER record. Those do have the Segment. The code in
> acpi/apei/ghes.c that parses the HEST and invokes the
> aer_recover_queue() is using the segment from the CPER record.

My question has nothing to do with CPER.  Drivers can enable error
reporting for their devices with pci_enable_pcie_error_reporting().
If the device is marked "firmware-first" in the HEST, this call fails
without enabling reporting.  The HEST can only mark devices in domain
0 as "firmware-first."  Therefore pci_enable_pcie_error_reporting()
will enable reporting for:

  - domain 0 devices not marked "firmware-first" and
  - all devices in other domains.

This inconsistency seems like a hole in the spec, but maybe I'm
missing something.

Bjorn

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 2/3] ACPI/APEI: Force fatal AER severity when bus has been reset
  2013-06-04  7:53   ` Chen Gong
  2013-06-04 16:20     ` Betty Dall
@ 2013-06-04 17:54     ` Bjorn Helgaas
  2013-06-05  1:56       ` Chen Gong
  1 sibling, 1 reply; 21+ messages in thread
From: Bjorn Helgaas @ 2013-06-04 17:54 UTC (permalink / raw)
  To: Betty Dall, rjw, ying.huang, linux-acpi, linux-kernel, linux-pci

On Tue, Jun 04, 2013 at 03:53:36AM -0400, Chen Gong wrote:
> On Thu, May 30, 2013 at 08:39:28AM -0600, Betty Dall wrote:
> > Date:	Thu, 30 May 2013 08:39:28 -0600
> > From: Betty Dall <betty.dall@hp.com>
> > To: rjw@sisk.pl, bhelgaas@google.com
> > Cc: ying.huang@intel.com, linux-acpi@vger.kernel.org,
> >  linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, Betty Dall
> >  <betty.dall@hp.com>
> > Subject: [PATCH v2 2/3] ACPI/APEI: Force fatal AER severity when bus has
> >  been reset
> > X-Mailer: git-send-email 1.7.7.6
> > 
> > The CPER error record has a reset bit that indicates that the platform
> > has reset the bus. The reset bit can be set for any severity error
> > including recoverable.  From the AER code path's perspective,
> > any error is fatal if the bus has been reset. This patch upgrades the
> > severity of the AER recovery to AER_FATAL whenever the CPER error record
> > indicates that the bus has been reset.
> > 
> > Changes since v1:
> > Fixed a typo in comment.
> > 
> > Signed-off-by: Betty Dall <betty.dall@hp.com>
> > ---
> > 
> >  drivers/acpi/apei/ghes.c |   21 ++++++++++++++++++++-
> >  1 files changed, 20 insertions(+), 1 deletions(-)
> > 
> > 
> > diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> > index d668a8a..1c67d5a 100644
> > --- a/drivers/acpi/apei/ghes.c
> > +++ b/drivers/acpi/apei/ghes.c
> > @@ -451,7 +451,26 @@ static void ghes_do_proc(struct ghes *ghes,
> >  				int aer_severity;
> >  				devfn = PCI_DEVFN(pcie_err->device_id.device,
> >  						  pcie_err->device_id.function);
> > -				aer_severity = cper_severity_to_aer(sev);
> > +				/*
> > +				 * Some Firmware First implementations
> > +				 * put the device in SBR to contain
> > +				 * the error. This is indicated by the
> > +				 * CPER Section Descriptor Flags reset
> > +				 * bit which means the component must
> > +				 * be re-initialized or re-enabled
> > +				 * prior to use. Promoting the AER
> > +				 * serverity to FATAL will cause the
> > +				 * AER code to link_reset and allow
> > +				 * drivers to reprogram their cards.
> > +				 */
> > +				if (gdata->flags & CPER_SEC_RESET)
> > +					aer_severity = cper_severity_to_aer(
> > +							CPER_SEV_FATAL);
> > +				else
> > +					aer_severity =
> > +						cper_severity_to_aer(sev);
> > +
> > +
> 
> How about this?
>                                 if (gdata->flags & CPER_SEC_RESET)
>                                         sev = CPER_SEV_FATAL;
>                                 cper_severity_to_aer(sev);

No.  If the object is to make the severity AER_FATAL, you should just
do that.  You shouldn't fiddle around with the CPER severity, because
then you depend on the mapping performed by cper_severity_to_aer().

> 
> >  				aer_recover_queue(pcie_err->device_id.segment,
> >  						  pcie_err->device_id.bus,
> >  						  devfn, aer_severity);

In other words, something like the patch below.  I don't really care
if you use the original "if" above that only sets aer_severity once,
or if you overwrite it as below.  I overwrote it because it doesn't
wrap as many lines.  ghes_do_proc() really should just call helpers
so the interesting code doesn't have to be indented three and four
tab stops.


ACPI/APEI: Force fatal AER severity when component has been reset
    
The CPER error record has a reset bit that indicates that the platform has
reset the component.  The reset bit can be set for any severity error
including recoverable.  From the AER code path's perspective, any error is
fatal if the component has been reset. This patch upgrades the severity of
the AER recovery to AER_FATAL in this case.
    
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index d668a8a..ab31551 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -449,9 +449,19 @@ static void ghes_do_proc(struct ghes *ghes,
 			    pcie_err->validation_bits & CPER_PCIE_VALID_AER_INFO) {
 				unsigned int devfn;
 				int aer_severity;
+
 				devfn = PCI_DEVFN(pcie_err->device_id.device,
 						  pcie_err->device_id.function);
 				aer_severity = cper_severity_to_aer(sev);
+
+				/*
+				 * If firmware reset the component to contain
+				 * the error, we must reinitialize it before
+				 * use, so treat it as a fatal AER error.
+				 */
+				if (gdata->flags & CPER_SEC_RESET)
+					aer_severity = AER_FATAL;
+
 				aer_recover_queue(pcie_err->device_id.segment,
 						  pcie_err->device_id.bus,
 						  devfn, aer_severity);

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 3/3] PCI/AER: Provide reset_link for firmware first root port
  2013-06-04  8:36   ` Chen Gong
@ 2013-06-04 18:31     ` Bjorn Helgaas
  2013-06-04 21:38     ` Betty Dall
  1 sibling, 0 replies; 21+ messages in thread
From: Bjorn Helgaas @ 2013-06-04 18:31 UTC (permalink / raw)
  To: Betty Dall, Rafael J. Wysocki, Bjorn Helgaas, Huang Ying,
	linux-acpi, linux-kernel, linux-pci

On Tue, Jun 4, 2013 at 2:36 AM, Chen Gong <gong.chen@linux.intel.com> wrote:
> On Thu, May 30, 2013 at 08:39:29AM -0600, Betty Dall wrote:
>> Date: Thu, 30 May 2013 08:39:29 -0600
>> From: Betty Dall <betty.dall@hp.com>
>> To: rjw@sisk.pl, bhelgaas@google.com
>> Cc: ying.huang@intel.com, linux-acpi@vger.kernel.org,
>>  linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, Betty Dall
>>  <betty.dall@hp.com>
>> Subject: [PATCH v2 3/3] PCI/AER: Provide reset_link for firmware first root
>>  port
>> X-Mailer: git-send-email 1.7.7.6
>>
>> The firmware first path does not install the aerdrv root port
>> service driver, so the firmware first root port does not have
>> a reset_link callback. When a firmware first root port does not have
>> a reset_link callback, use a new default reset_link similar to what
>> we already do for the default_downstream_reset_link(). The firmware
>> first default reset_link brings the root port out of SBR if firmware
>> left it in SBR.
>>
>> Changes since v1:
>> Removed incorrect setting of p2p_ctrl after the read.
>>
>> Signed-off-by: Betty Dall <betty.dall@hp.com>
>> ---
>>
>>  drivers/pci/pcie/aer/aerdrv_core.c |   36 ++++++++++++++++++++++++++++++++++++
>>  1 files changed, 36 insertions(+), 0 deletions(-)
>>
>>
>> diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c
>> index 8ec8b4f..72f76cd 100644
>> --- a/drivers/pci/pcie/aer/aerdrv_core.c
>> +++ b/drivers/pci/pcie/aer/aerdrv_core.c
>> @@ -413,6 +413,39 @@ static pci_ers_result_t default_downstream_reset_link(struct pci_dev *dev)
>>       return PCI_ERS_RESULT_RECOVERED;
>>  }
>>
>> +/**
>> + * default_ff_root_port_reset_link - default reset function for firmware
>> + *           first Root Port
>> + * @dev: pointer to root port's pci_dev data structure
>> + *
>> + * Invoked when performing link reset at Root Port w/ no aer driver.
>> + * This happens through the firmware first path. Firmware may leave
>> + * the Root Port in SBR and it is the OS responsiblity to bring it out
>> + * of SBR.
>> + */
>> +static pci_ers_result_t default_ff_root_port_reset_link(struct pci_dev *dev)
>> +{
>> +     u16 p2p_ctrl;
>> +
>> +     /* Read Secondary Bus Reset */
>> +     pci_read_config_word(dev, PCI_BRIDGE_CONTROL, &p2p_ctrl);
>> +
>> +     /* De-assert Secondary Bus Reset, if it is set */
>> +     if (p2p_ctrl & PCI_BRIDGE_CTL_BUS_RESET) {
>> +             p2p_ctrl &= ~PCI_BRIDGE_CTL_BUS_RESET;
>> +             pci_write_config_word(dev, PCI_BRIDGE_CONTROL, p2p_ctrl);
>> +
>> +             /*
>> +              * System software must wait for at least 100ms from the end
>> +              * of a reset of one or more device before it is permitted
>> +              * to issue Configuration Requests to those devices.
>> +              */
>> +             msleep(200);
>> +             dev_dbg(&dev->dev, "Root Port link has been reset\n");
>> +     }
>> +     return PCI_ERS_RESULT_RECOVERED;
>> +}
>
> I don't think this function is OK.
> 1) You don't really reset the 2nd Bus but just checking its status.
> I think you should have following steps to reset 2nd Bus:
>
> a. Assert 2nd Bus Reset
> b. wait for some time until this message has been broadcasted well
> c. De-assert 2nd Bus Reset
> d. wait for Trst time
>
> IOW, since we have aer_do_secondary_bus_reset to perform secondary bus
> reset, why you repeat it again?
>
> 2) msleep(200) is too long for kernel. You'd better yield the CPU when
> sleep.
>
>> +
>>  static int find_aer_service_iter(struct device *device, void *data)
>>  {
>>       struct pcie_port_service_driver *service_driver, **drv;
>> @@ -460,6 +493,9 @@ static pci_ers_result_t reset_link(struct pci_dev *dev)
>>               status = driver->reset_link(udev);
>>       } else if (pci_pcie_type(udev) == PCI_EXP_TYPE_DOWNSTREAM) {
>>               status = default_downstream_reset_link(udev);
>> +     } else if (pci_pcie_type(udev) == PCI_EXP_TYPE_ROOT_PORT &&
>> +             pcie_aer_get_firmware_first(udev)) {
>> +             status = default_ff_root_port_reset_link(udev);

Can we just handle Root Ports the same as Downstream Ports, e.g.,

    if (pci_pcie_type(udev) == PCI_EXP_TYPE_DOWNSTREAM ||
        pci_pcie_type(udev) == PCI_EXP_TYPE_ROOT_PORT)
            status = default_downstream_reset_link(udev);

>>       } else {
>>               dev_printk(KERN_DEBUG, &dev->dev,
>>                       "no link-reset support at upstream device %s\n",
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 3/3] PCI/AER: Provide reset_link for firmware first root port
  2013-06-04  8:36   ` Chen Gong
  2013-06-04 18:31     ` Bjorn Helgaas
@ 2013-06-04 21:38     ` Betty Dall
  2013-06-04 22:15       ` Bjorn Helgaas
  1 sibling, 1 reply; 21+ messages in thread
From: Betty Dall @ 2013-06-04 21:38 UTC (permalink / raw)
  To: Chen Gong; +Cc: rjw, bhelgaas, ying.huang, linux-acpi, linux-kernel, linux-pci

On Tue, 2013-06-04 at 04:36 -0400, Chen Gong wrote:
> On Thu, May 30, 2013 at 08:39:29AM -0600, Betty Dall wrote:
> > Date:	Thu, 30 May 2013 08:39:29 -0600
> > From: Betty Dall <betty.dall@hp.com>
> > To: rjw@sisk.pl, bhelgaas@google.com
> > Cc: ying.huang@intel.com, linux-acpi@vger.kernel.org,
> >  linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, Betty Dall
> >  <betty.dall@hp.com>
> > Subject: [PATCH v2 3/3] PCI/AER: Provide reset_link for firmware first root
> >  port
> > X-Mailer: git-send-email 1.7.7.6
> > 
> > The firmware first path does not install the aerdrv root port
> > service driver, so the firmware first root port does not have
> > a reset_link callback. When a firmware first root port does not have
> > a reset_link callback, use a new default reset_link similar to what
> > we already do for the default_downstream_reset_link(). The firmware
> > first default reset_link brings the root port out of SBR if firmware
> > left it in SBR.
> > 
> > Changes since v1:
> > Removed incorrect setting of p2p_ctrl after the read.
> > 
> > Signed-off-by: Betty Dall <betty.dall@hp.com>
> > ---
> > 
> >  drivers/pci/pcie/aer/aerdrv_core.c |   36 ++++++++++++++++++++++++++++++++++++
> >  1 files changed, 36 insertions(+), 0 deletions(-)
> > 
> > 
> > diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c
> > index 8ec8b4f..72f76cd 100644
> > --- a/drivers/pci/pcie/aer/aerdrv_core.c
> > +++ b/drivers/pci/pcie/aer/aerdrv_core.c
> > @@ -413,6 +413,39 @@ static pci_ers_result_t default_downstream_reset_link(struct pci_dev *dev)
> >  	return PCI_ERS_RESULT_RECOVERED;
> >  }
> >  
> > +/**
> > + * default_ff_root_port_reset_link - default reset function for firmware
> > + *		first Root Port
> > + * @dev: pointer to root port's pci_dev data structure
> > + *
> > + * Invoked when performing link reset at Root Port w/ no aer driver.
> > + * This happens through the firmware first path. Firmware may leave
> > + * the Root Port in SBR and it is the OS responsiblity to bring it out
> > + * of SBR.
> > + */
> > +static pci_ers_result_t default_ff_root_port_reset_link(struct pci_dev *dev)
> > +{
> > +	u16 p2p_ctrl;
> > +
> > +	/* Read Secondary Bus Reset */
> > +	pci_read_config_word(dev, PCI_BRIDGE_CONTROL, &p2p_ctrl);
> > +
> > +	/* De-assert Secondary Bus Reset, if it is set */
> > +	if (p2p_ctrl & PCI_BRIDGE_CTL_BUS_RESET) {
> > +		p2p_ctrl &= ~PCI_BRIDGE_CTL_BUS_RESET;
> > +		pci_write_config_word(dev, PCI_BRIDGE_CONTROL, p2p_ctrl);
> > +
> > +		/*
> > +		 * System software must wait for at least 100ms from the end
> > +		 * of a reset of one or more device before it is permitted
> > +		 * to issue Configuration Requests to those devices.
> > +		 */
> > +		msleep(200);
> > +		dev_dbg(&dev->dev, "Root Port link has been reset\n");
> > +	}
> > +	return PCI_ERS_RESULT_RECOVERED;
> > +}
> 
> I don't think this function is OK.
> 1) You don't really reset the 2nd Bus but just checking its status.
> I think you should have following steps to reset 2nd Bus:
> 
> a. Assert 2nd Bus Reset
> b. wait for some time until this message has been broadcasted well
> c. De-assert 2nd Bus Reset
> d. wait for Trst time
> 
> IOW, since we have aer_do_secondary_bus_reset to perform secondary bus
> reset, why you repeat it again?
> 
> 2) msleep(200) is too long for kernel. You'd better yield the CPU when
> sleep.

The firmware first path currently has no reset_link. I want to make a
minimal change since resetting the link could be considered firmware's
job in the firmware first path. This change is to just check if SBR is
set, and bring the link out of reset only if it is in SBR.  This way, if
a another firmware first platform is already resetting the link, it
won't be done twice. I took the msleep and the code from
aer_do_sercondary_bus_reset() as you noticed. 


-Betty

> > +
> >  static int find_aer_service_iter(struct device *device, void *data)
> >  {
> >  	struct pcie_port_service_driver *service_driver, **drv;
> > @@ -460,6 +493,9 @@ static pci_ers_result_t reset_link(struct pci_dev *dev)
> >  		status = driver->reset_link(udev);
> >  	} else if (pci_pcie_type(udev) == PCI_EXP_TYPE_DOWNSTREAM) {
> >  		status = default_downstream_reset_link(udev);
> > +	} else if (pci_pcie_type(udev) == PCI_EXP_TYPE_ROOT_PORT &&
> > +		pcie_aer_get_firmware_first(udev)) {
> > +		status = default_ff_root_port_reset_link(udev);
> >  	} else {
> >  		dev_printk(KERN_DEBUG, &dev->dev,
> >  			"no link-reset support at upstream device %s\n",
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 3/3] PCI/AER: Provide reset_link for firmware first root port
  2013-06-04 21:38     ` Betty Dall
@ 2013-06-04 22:15       ` Bjorn Helgaas
  2013-06-05  1:56         ` Chen Gong
  2013-06-05 13:22         ` Betty Dall
  0 siblings, 2 replies; 21+ messages in thread
From: Bjorn Helgaas @ 2013-06-04 22:15 UTC (permalink / raw)
  To: Betty Dall
  Cc: Chen Gong, Rafael J. Wysocki, Huang Ying, linux-acpi,
	linux-kernel, linux-pci

On Tue, Jun 4, 2013 at 3:38 PM, Betty Dall <betty.dall@hp.com> wrote:
> On Tue, 2013-06-04 at 04:36 -0400, Chen Gong wrote:
>> On Thu, May 30, 2013 at 08:39:29AM -0600, Betty Dall wrote:
>> > Date:       Thu, 30 May 2013 08:39:29 -0600
>> > From: Betty Dall <betty.dall@hp.com>
>> > To: rjw@sisk.pl, bhelgaas@google.com
>> > Cc: ying.huang@intel.com, linux-acpi@vger.kernel.org,
>> >  linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, Betty Dall
>> >  <betty.dall@hp.com>
>> > Subject: [PATCH v2 3/3] PCI/AER: Provide reset_link for firmware first root
>> >  port
>> > X-Mailer: git-send-email 1.7.7.6
>> >
>> > The firmware first path does not install the aerdrv root port
>> > service driver, so the firmware first root port does not have
>> > a reset_link callback. When a firmware first root port does not have
>> > a reset_link callback, use a new default reset_link similar to what
>> > we already do for the default_downstream_reset_link(). The firmware
>> > first default reset_link brings the root port out of SBR if firmware
>> > left it in SBR.
>> >
>> > Changes since v1:
>> > Removed incorrect setting of p2p_ctrl after the read.
>> >
>> > Signed-off-by: Betty Dall <betty.dall@hp.com>
>> > ---
>> >
>> >  drivers/pci/pcie/aer/aerdrv_core.c |   36 ++++++++++++++++++++++++++++++++++++
>> >  1 files changed, 36 insertions(+), 0 deletions(-)
>> >
>> >
>> > diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c
>> > index 8ec8b4f..72f76cd 100644
>> > --- a/drivers/pci/pcie/aer/aerdrv_core.c
>> > +++ b/drivers/pci/pcie/aer/aerdrv_core.c
>> > @@ -413,6 +413,39 @@ static pci_ers_result_t default_downstream_reset_link(struct pci_dev *dev)
>> >     return PCI_ERS_RESULT_RECOVERED;
>> >  }
>> >
>> > +/**
>> > + * default_ff_root_port_reset_link - default reset function for firmware
>> > + *         first Root Port
>> > + * @dev: pointer to root port's pci_dev data structure
>> > + *
>> > + * Invoked when performing link reset at Root Port w/ no aer driver.
>> > + * This happens through the firmware first path. Firmware may leave
>> > + * the Root Port in SBR and it is the OS responsiblity to bring it out
>> > + * of SBR.
>> > + */
>> > +static pci_ers_result_t default_ff_root_port_reset_link(struct pci_dev *dev)
>> > +{
>> > +   u16 p2p_ctrl;
>> > +
>> > +   /* Read Secondary Bus Reset */
>> > +   pci_read_config_word(dev, PCI_BRIDGE_CONTROL, &p2p_ctrl);
>> > +
>> > +   /* De-assert Secondary Bus Reset, if it is set */
>> > +   if (p2p_ctrl & PCI_BRIDGE_CTL_BUS_RESET) {
>> > +           p2p_ctrl &= ~PCI_BRIDGE_CTL_BUS_RESET;
>> > +           pci_write_config_word(dev, PCI_BRIDGE_CONTROL, p2p_ctrl);
>> > +
>> > +           /*
>> > +            * System software must wait for at least 100ms from the end
>> > +            * of a reset of one or more device before it is permitted
>> > +            * to issue Configuration Requests to those devices.
>> > +            */
>> > +           msleep(200);
>> > +           dev_dbg(&dev->dev, "Root Port link has been reset\n");
>> > +   }
>> > +   return PCI_ERS_RESULT_RECOVERED;
>> > +}
>>
>> I don't think this function is OK.
>> 1) You don't really reset the 2nd Bus but just checking its status.
>> I think you should have following steps to reset 2nd Bus:
>>
>> a. Assert 2nd Bus Reset
>> b. wait for some time until this message has been broadcasted well
>> c. De-assert 2nd Bus Reset
>> d. wait for Trst time
>>
>> IOW, since we have aer_do_secondary_bus_reset to perform secondary bus
>> reset, why you repeat it again?
>>
>> 2) msleep(200) is too long for kernel. You'd better yield the CPU when
>> sleep.
>
> The firmware first path currently has no reset_link. I want to make a
> minimal change since resetting the link could be considered firmware's
> job in the firmware first path. This change is to just check if SBR is
> set, and bring the link out of reset only if it is in SBR.  This way, if
> a another firmware first platform is already resetting the link, it
> won't be done twice. I took the msleep and the code from
> aer_do_sercondary_bus_reset() as you noticed.

I understand the desire to make a minimal change, and we certainly
don't want to make changes bigger than they need to be.

However, my goal is really to have clean, maintainable code at the
end.  If that means bigger patches where you can't test every affected
platform, that's a risk I'm willing to take.

In this particular case, we could try to limit the change to just the
platforms you can test, as you've done.  But it looks to me like we
actually want to reset the link in *all* cases, regardless of whether
firmware-first is involved.  We're handling a fatal error for a
device.  That device might be below a PCIe switch (where we currently
reset the link), or it might be directly below a root port (where we
currently don't).  Why should those cases be different?  I don't think
the device or the driver should be able to tell the difference.

My *guess* is that this is an oversight in the original code, and that
doing the link reset for both downstream ports and root ports will
make error handling for devices below root ports work better.

The *last* thing I want is code littered with special cases that are
only there because "they only fix the platforms I could actually
test."  It's just about impossible to clean those up later.

Bjorn

>> > +
>> >  static int find_aer_service_iter(struct device *device, void *data)
>> >  {
>> >     struct pcie_port_service_driver *service_driver, **drv;
>> > @@ -460,6 +493,9 @@ static pci_ers_result_t reset_link(struct pci_dev *dev)
>> >             status = driver->reset_link(udev);
>> >     } else if (pci_pcie_type(udev) == PCI_EXP_TYPE_DOWNSTREAM) {
>> >             status = default_downstream_reset_link(udev);
>> > +   } else if (pci_pcie_type(udev) == PCI_EXP_TYPE_ROOT_PORT &&
>> > +           pcie_aer_get_firmware_first(udev)) {
>> > +           status = default_ff_root_port_reset_link(udev);
>> >     } else {
>> >             dev_printk(KERN_DEBUG, &dev->dev,
>> >                     "no link-reset support at upstream device %s\n",
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> > the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> > Please read the FAQ at  http://www.tux.org/lkml/
>
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 3/3] PCI/AER: Provide reset_link for firmware first root port
  2013-06-04 22:15       ` Bjorn Helgaas
@ 2013-06-05  1:56         ` Chen Gong
  2013-06-05 13:22         ` Betty Dall
  1 sibling, 0 replies; 21+ messages in thread
From: Chen Gong @ 2013-06-05  1:56 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Betty Dall, Rafael J. Wysocki, Huang Ying, linux-acpi,
	linux-kernel, linux-pci

[-- Attachment #1: Type: text/plain, Size: 6429 bytes --]

On Tue, Jun 04, 2013 at 04:15:21PM -0600, Bjorn Helgaas wrote:
> Date: Tue, 4 Jun 2013 16:15:21 -0600
> From: Bjorn Helgaas <bhelgaas@google.com>
> To: Betty Dall <betty.dall@hp.com>
> Cc: Chen Gong <gong.chen@linux.intel.com>, "Rafael J. Wysocki"
>  <rjw@sisk.pl>, Huang Ying <ying.huang@intel.com>,
>  "linux-acpi@vger.kernel.org" <linux-acpi@vger.kernel.org>,
>  "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
>  "linux-pci@vger.kernel.org" <linux-pci@vger.kernel.org>
> Subject: Re: [PATCH v2 3/3] PCI/AER: Provide reset_link for firmware first
>  root port
> 
> On Tue, Jun 4, 2013 at 3:38 PM, Betty Dall <betty.dall@hp.com> wrote:
> > On Tue, 2013-06-04 at 04:36 -0400, Chen Gong wrote:
> >> On Thu, May 30, 2013 at 08:39:29AM -0600, Betty Dall wrote:
> >> > Date:       Thu, 30 May 2013 08:39:29 -0600
> >> > From: Betty Dall <betty.dall@hp.com>
> >> > To: rjw@sisk.pl, bhelgaas@google.com
> >> > Cc: ying.huang@intel.com, linux-acpi@vger.kernel.org,
> >> >  linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, Betty Dall
> >> >  <betty.dall@hp.com>
> >> > Subject: [PATCH v2 3/3] PCI/AER: Provide reset_link for firmware first root
> >> >  port
> >> > X-Mailer: git-send-email 1.7.7.6
> >> >
> >> > The firmware first path does not install the aerdrv root port
> >> > service driver, so the firmware first root port does not have
> >> > a reset_link callback. When a firmware first root port does not have
> >> > a reset_link callback, use a new default reset_link similar to what
> >> > we already do for the default_downstream_reset_link(). The firmware
> >> > first default reset_link brings the root port out of SBR if firmware
> >> > left it in SBR.
> >> >
> >> > Changes since v1:
> >> > Removed incorrect setting of p2p_ctrl after the read.
> >> >
> >> > Signed-off-by: Betty Dall <betty.dall@hp.com>
> >> > ---
> >> >
> >> >  drivers/pci/pcie/aer/aerdrv_core.c |   36 ++++++++++++++++++++++++++++++++++++
> >> >  1 files changed, 36 insertions(+), 0 deletions(-)
> >> >
> >> >
> >> > diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c
> >> > index 8ec8b4f..72f76cd 100644
> >> > --- a/drivers/pci/pcie/aer/aerdrv_core.c
> >> > +++ b/drivers/pci/pcie/aer/aerdrv_core.c
> >> > @@ -413,6 +413,39 @@ static pci_ers_result_t default_downstream_reset_link(struct pci_dev *dev)
> >> >     return PCI_ERS_RESULT_RECOVERED;
> >> >  }
> >> >
> >> > +/**
> >> > + * default_ff_root_port_reset_link - default reset function for firmware
> >> > + *         first Root Port
> >> > + * @dev: pointer to root port's pci_dev data structure
> >> > + *
> >> > + * Invoked when performing link reset at Root Port w/ no aer driver.
> >> > + * This happens through the firmware first path. Firmware may leave
> >> > + * the Root Port in SBR and it is the OS responsiblity to bring it out
> >> > + * of SBR.
> >> > + */
> >> > +static pci_ers_result_t default_ff_root_port_reset_link(struct pci_dev *dev)
> >> > +{
> >> > +   u16 p2p_ctrl;
> >> > +
> >> > +   /* Read Secondary Bus Reset */
> >> > +   pci_read_config_word(dev, PCI_BRIDGE_CONTROL, &p2p_ctrl);
> >> > +
> >> > +   /* De-assert Secondary Bus Reset, if it is set */
> >> > +   if (p2p_ctrl & PCI_BRIDGE_CTL_BUS_RESET) {
> >> > +           p2p_ctrl &= ~PCI_BRIDGE_CTL_BUS_RESET;
> >> > +           pci_write_config_word(dev, PCI_BRIDGE_CONTROL, p2p_ctrl);
> >> > +
> >> > +           /*
> >> > +            * System software must wait for at least 100ms from the end
> >> > +            * of a reset of one or more device before it is permitted
> >> > +            * to issue Configuration Requests to those devices.
> >> > +            */
> >> > +           msleep(200);
> >> > +           dev_dbg(&dev->dev, "Root Port link has been reset\n");
> >> > +   }
> >> > +   return PCI_ERS_RESULT_RECOVERED;
> >> > +}
> >>
> >> I don't think this function is OK.
> >> 1) You don't really reset the 2nd Bus but just checking its status.
> >> I think you should have following steps to reset 2nd Bus:
> >>
> >> a. Assert 2nd Bus Reset
> >> b. wait for some time until this message has been broadcasted well
> >> c. De-assert 2nd Bus Reset
> >> d. wait for Trst time
> >>
> >> IOW, since we have aer_do_secondary_bus_reset to perform secondary bus
> >> reset, why you repeat it again?
> >>
> >> 2) msleep(200) is too long for kernel. You'd better yield the CPU when
> >> sleep.
> >
> > The firmware first path currently has no reset_link. I want to make a
> > minimal change since resetting the link could be considered firmware's
> > job in the firmware first path. This change is to just check if SBR is
> > set, and bring the link out of reset only if it is in SBR.  This way, if
> > a another firmware first platform is already resetting the link, it
> > won't be done twice. I took the msleep and the code from
> > aer_do_sercondary_bus_reset() as you noticed.
> 
> I understand the desire to make a minimal change, and we certainly
> don't want to make changes bigger than they need to be.
> 
> However, my goal is really to have clean, maintainable code at the
> end.  If that means bigger patches where you can't test every affected
> platform, that's a risk I'm willing to take.
> 
> In this particular case, we could try to limit the change to just the
> platforms you can test, as you've done.  But it looks to me like we
> actually want to reset the link in *all* cases, regardless of whether
> firmware-first is involved.  We're handling a fatal error for a
> device.  That device might be below a PCIe switch (where we currently
> reset the link), or it might be directly below a root port (where we
> currently don't).  Why should those cases be different?  I don't think
> the device or the driver should be able to tell the difference.
> 
> My *guess* is that this is an oversight in the original code, and that
> doing the link reset for both downstream ports and root ports will
> make error handling for devices below root ports work better.
> 
> The *last* thing I want is code littered with special cases that are
> only there because "they only fix the platforms I could actually
> test."  It's just about impossible to clean those up later.
> 
> Bjorn

I agree with Bjorn's conclusion. Once we meet a bogus BIOS implementation,
it will be a disaster for that device.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 2/3] ACPI/APEI: Force fatal AER severity when bus has been reset
  2013-06-04 17:54     ` Bjorn Helgaas
@ 2013-06-05  1:56       ` Chen Gong
  0 siblings, 0 replies; 21+ messages in thread
From: Chen Gong @ 2013-06-05  1:56 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Betty Dall, rjw, ying.huang, linux-acpi, linux-kernel, linux-pci

[-- Attachment #1: Type: text/plain, Size: 5290 bytes --]

On Tue, Jun 04, 2013 at 11:54:34AM -0600, Bjorn Helgaas wrote:
> Date:	Tue, 4 Jun 2013 11:54:34 -0600
> From: Bjorn Helgaas <bhelgaas@google.com>
> To: Betty Dall <betty.dall@hp.com>, rjw@sisk.pl, ying.huang@intel.com,
>  linux-acpi@vger.kernel.org, linux-kernel@vger.kernel.org,
>  linux-pci@vger.kernel.org
> Subject: Re: [PATCH v2 2/3] ACPI/APEI: Force fatal AER severity when bus
>  has been reset
> User-Agent: Mutt/1.5.21 (2010-09-15)
> 
> On Tue, Jun 04, 2013 at 03:53:36AM -0400, Chen Gong wrote:
> > On Thu, May 30, 2013 at 08:39:28AM -0600, Betty Dall wrote:
> > > Date:	Thu, 30 May 2013 08:39:28 -0600
> > > From: Betty Dall <betty.dall@hp.com>
> > > To: rjw@sisk.pl, bhelgaas@google.com
> > > Cc: ying.huang@intel.com, linux-acpi@vger.kernel.org,
> > >  linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, Betty Dall
> > >  <betty.dall@hp.com>
> > > Subject: [PATCH v2 2/3] ACPI/APEI: Force fatal AER severity when bus has
> > >  been reset
> > > X-Mailer: git-send-email 1.7.7.6
> > > 
> > > The CPER error record has a reset bit that indicates that the platform
> > > has reset the bus. The reset bit can be set for any severity error
> > > including recoverable.  From the AER code path's perspective,
> > > any error is fatal if the bus has been reset. This patch upgrades the
> > > severity of the AER recovery to AER_FATAL whenever the CPER error record
> > > indicates that the bus has been reset.
> > > 
> > > Changes since v1:
> > > Fixed a typo in comment.
> > > 
> > > Signed-off-by: Betty Dall <betty.dall@hp.com>
> > > ---
> > > 
> > >  drivers/acpi/apei/ghes.c |   21 ++++++++++++++++++++-
> > >  1 files changed, 20 insertions(+), 1 deletions(-)
> > > 
> > > 
> > > diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> > > index d668a8a..1c67d5a 100644
> > > --- a/drivers/acpi/apei/ghes.c
> > > +++ b/drivers/acpi/apei/ghes.c
> > > @@ -451,7 +451,26 @@ static void ghes_do_proc(struct ghes *ghes,
> > >  				int aer_severity;
> > >  				devfn = PCI_DEVFN(pcie_err->device_id.device,
> > >  						  pcie_err->device_id.function);
> > > -				aer_severity = cper_severity_to_aer(sev);
> > > +				/*
> > > +				 * Some Firmware First implementations
> > > +				 * put the device in SBR to contain
> > > +				 * the error. This is indicated by the
> > > +				 * CPER Section Descriptor Flags reset
> > > +				 * bit which means the component must
> > > +				 * be re-initialized or re-enabled
> > > +				 * prior to use. Promoting the AER
> > > +				 * serverity to FATAL will cause the
> > > +				 * AER code to link_reset and allow
> > > +				 * drivers to reprogram their cards.
> > > +				 */
> > > +				if (gdata->flags & CPER_SEC_RESET)
> > > +					aer_severity = cper_severity_to_aer(
> > > +							CPER_SEV_FATAL);
> > > +				else
> > > +					aer_severity =
> > > +						cper_severity_to_aer(sev);
> > > +
> > > +
> > 
> > How about this?
> >                                 if (gdata->flags & CPER_SEC_RESET)
> >                                         sev = CPER_SEV_FATAL;
> >                                 cper_severity_to_aer(sev);
> 
> No.  If the object is to make the severity AER_FATAL, you should just
> do that.  You shouldn't fiddle around with the CPER severity, because
> then you depend on the mapping performed by cper_severity_to_aer().
> 
> > 
> > >  				aer_recover_queue(pcie_err->device_id.segment,
> > >  						  pcie_err->device_id.bus,
> > >  						  devfn, aer_severity);
> 
> In other words, something like the patch below.  I don't really care
> if you use the original "if" above that only sets aer_severity once,
> or if you overwrite it as below.  I overwrote it because it doesn't
> wrap as many lines.  ghes_do_proc() really should just call helpers
> so the interesting code doesn't have to be indented three and four
> tab stops.
> 
> 
> ACPI/APEI: Force fatal AER severity when component has been reset
>     
> The CPER error record has a reset bit that indicates that the platform has
> reset the component.  The reset bit can be set for any severity error
> including recoverable.  From the AER code path's perspective, any error is
> fatal if the component has been reset. This patch upgrades the severity of
> the AER recovery to AER_FATAL in this case.
>     
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index d668a8a..ab31551 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -449,9 +449,19 @@ static void ghes_do_proc(struct ghes *ghes,
>  			    pcie_err->validation_bits & CPER_PCIE_VALID_AER_INFO) {
>  				unsigned int devfn;
>  				int aer_severity;
> +
>  				devfn = PCI_DEVFN(pcie_err->device_id.device,
>  						  pcie_err->device_id.function);
>  				aer_severity = cper_severity_to_aer(sev);
> +
> +				/*
> +				 * If firmware reset the component to contain
> +				 * the error, we must reinitialize it before
> +				 * use, so treat it as a fatal AER error.
> +				 */
> +				if (gdata->flags & CPER_SEC_RESET)
> +					aer_severity = AER_FATAL;
> +
>  				aer_recover_queue(pcie_err->device_id.segment,
>  						  pcie_err->device_id.bus,
>  						  devfn, aer_severity);
>

The patch looks good to me.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/3] PCI/AER: Fix incorrect return from aer_hest_parse()
  2013-06-04 13:13   ` Bjorn Helgaas
@ 2013-06-05  2:48     ` Chen Gong
  2013-06-05 13:38       ` Bjorn Helgaas
  0 siblings, 1 reply; 21+ messages in thread
From: Chen Gong @ 2013-06-05  2:48 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Betty Dall, rjw, ying.huang, linux-acpi, linux-kernel, linux-pci

[-- Attachment #1: Type: text/plain, Size: 8547 bytes --]

On Tue, Jun 04, 2013 at 07:13:24AM -0600, Bjorn Helgaas wrote:
> Date: Tue, 4 Jun 2013 07:13:24 -0600
> From: Bjorn Helgaas <bhelgaas@google.com>
> To: Betty Dall <betty.dall@hp.com>
> Cc: rjw@sisk.pl, ying.huang@intel.com, linux-acpi@vger.kernel.org,
>  linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org,
>  gong.chen@linux.intel.com
> Subject: Re: [PATCH v2 1/3] PCI/AER: Fix incorrect return from
>  aer_hest_parse()
> User-Agent: Mutt/1.5.21 (2010-09-15)
> 
> On Thu, May 30, 2013 at 08:39:27AM -0600, Betty Dall wrote:
> > The function aer_hest_parse() is called to determine if the given
> > PCI device is firmware first or not. The code loops through each
> > section of the HEST table to look for a match. The bug is that
> > the function always returns whether the last HEST section is firmware
> > first. The fix stops the iteration once the info.firmware_first
> > variable is set.  This is similar to how the function aer_hest_parse_aff()
> > stops the iteration.
> > 
> > Signed-off-by: Betty Dall <betty.dall@hp.com>
> > ---
> > 
> >  drivers/pci/pcie/aer/aerdrv_acpi.c |    3 +++
> >  1 files changed, 3 insertions(+), 0 deletions(-)
> > 
> > 
> > diff --git a/drivers/pci/pcie/aer/aerdrv_acpi.c b/drivers/pci/pcie/aer/aerdrv_acpi.c
> > index 5194a7d..39b8671 100644
> > --- a/drivers/pci/pcie/aer/aerdrv_acpi.c
> > +++ b/drivers/pci/pcie/aer/aerdrv_acpi.c
> > @@ -42,6 +42,9 @@ static int aer_hest_parse(struct acpi_hest_header *hest_hdr, void *data)
> >  	u8 bridge = 0;
> >  	int ff = 0;
> >  
> > +	if (info->firmware_first)
> > +		return 0;
> > +
> >  	switch (hest_hdr->type) {
> >  	case ACPI_HEST_TYPE_AER_ROOT_PORT:
> >  		pcie_type = PCI_EXP_TYPE_ROOT_PORT;
> 
> 
> 1) I think dev->__aer_firmware_first should be initialized somewhere in the
> pci_scan_single_device() path, e.g., maybe pci_init_capabilities().  It's
> known at device add-time and never changes, so there's no point in doing
> the lazy setup we do now.  That would let us get rid of
> __aer_firmware_first_valid, too (along with the pointless "__" prefix).
> This is just an observation, not a requirement for this patch set.
> 
> 2) This is a band-aid that covers up the real problem, which is that we
> update info->firmware_first even for non-matching devices.  I think we
> should do something like the following instead:
> 
> 
> commit c67612f272f1792a08f012f1b5ca37d5cfde5de4
> Author: Bjorn Helgaas <bhelgaas@google.com>
> Date:   Mon Jun 3 16:49:12 2013 -0600
> 
>     PCI/AER: Don't parse HEST table for non-PCIe devices
>     
>     AER is a PCIe-only capability, so there's no point in trying to match
>     a HEST PCIe structure with a non-PCIe device.
>     
>     Previously, a HEST global AER bridge entry (type 8) could incorrectly
>     match *any* bridge, even a legacy PCI-PCI bridge, and a non-global
>     HEST entry could match a legacy PCI device.
>     
If so, it should be a BIOS bug, right? (BIOS should not contain PCI device
for AER structure).
And I agree that your patch tries to avoid this scenario, but even one PCI
device is set FFM enabled, as you mentioned above, AER is only for PCIe,
which means a PCI device will not trigger such an event, so it should be
harmless, in principle. But in practice, I totally agree with you that
should be fixed.

Reviewed-by: Chen Gong <gong.chen@linux.intel.com>

>     Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> 
> diff --git a/drivers/pci/pcie/aer/aerdrv_acpi.c b/drivers/pci/pcie/aer/aerdrv_acpi.c
> index 5194a7d..4f798ab 100644
> --- a/drivers/pci/pcie/aer/aerdrv_acpi.c
> +++ b/drivers/pci/pcie/aer/aerdrv_acpi.c
> @@ -59,8 +59,7 @@ static int aer_hest_parse(struct acpi_hest_header *hest_hdr, void *data)
>  
>  	p = (struct acpi_hest_aer_common *)(hest_hdr + 1);
>  	if (p->flags & ACPI_HEST_GLOBAL) {
> -		if ((pci_is_pcie(info->pci_dev) &&
> -		     pci_pcie_type(info->pci_dev) == pcie_type) || bridge)
> +		if ((pci_pcie_type(info->pci_dev) == pcie_type) || bridge)
>  			ff = !!(p->flags & ACPI_HEST_FIRMWARE_FIRST);
>  	} else
>  		if (hest_match_pci(p, info->pci_dev))
> @@ -89,6 +88,9 @@ static void aer_set_firmware_first(struct pci_dev *pci_dev)
>  
>  int pcie_aer_get_firmware_first(struct pci_dev *dev)
>  {
> +	if (!pci_is_pcie(dev))
> +		return 0;
> +
>  	if (!dev->__aer_firmware_first_valid)
>  		aer_set_firmware_first(dev);
>  	return dev->__aer_firmware_first;
> 
> commit 947da50270686b0d70f4bc2f7323ef7229489ecb
> Author: Bjorn Helgaas <bhelgaas@google.com>
> Date:   Mon Jun 3 15:42:00 2013 -0600
> 
>     PCI/AER: Factor out HEST device type matching
>     
>     This factors out the matching of HEST structure type and PCIe device type
>     to improve readability.  No functional change.
>     
>     Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> 

Reviewed-by: Chen Gong <gong.chen@linux.intel.com>

> diff --git a/drivers/pci/pcie/aer/aerdrv_acpi.c b/drivers/pci/pcie/aer/aerdrv_acpi.c
> index 4f798ab..56e2d94 100644
> --- a/drivers/pci/pcie/aer/aerdrv_acpi.c
> +++ b/drivers/pci/pcie/aer/aerdrv_acpi.c
> @@ -29,6 +29,22 @@ static inline int hest_match_pci(struct acpi_hest_aer_common *p,
>  		 p->function == PCI_FUNC(pci->devfn));
>  }
>  
> +static inline bool hest_match_type(struct acpi_hest_header *hest_hdr,
> +				   struct pci_dev *dev)
> +{
> +	u16 hest_type = hest_hdr->type;
> +	u8 pcie_type = pci_pcie_type(dev);
> +
> +	if ((hest_type == ACPI_HEST_TYPE_AER_ROOT_PORT &&
> +	     pcie_type == PCI_EXP_TYPE_ROOT_PORT) ||
> +	    (hest_type == ACPI_HEST_TYPE_AER_ENDPOINT &&
> +	     pcie_type == PCI_EXP_TYPE_ENDPOINT) ||
> +	    (hest_type == ACPI_HEST_TYPE_AER_BRIDGE &&
> +	     (dev->class >> 16) == PCI_BASE_CLASS_BRIDGE))
> +		return true;
> +	return false;
> +}
> +
>  struct aer_hest_parse_info {
>  	struct pci_dev *pci_dev;
>  	int firmware_first;
> @@ -38,28 +54,11 @@ static int aer_hest_parse(struct acpi_hest_header *hest_hdr, void *data)
>  {
>  	struct aer_hest_parse_info *info = data;
>  	struct acpi_hest_aer_common *p;
> -	u8 pcie_type = 0;
> -	u8 bridge = 0;
>  	int ff = 0;
>  
> -	switch (hest_hdr->type) {
> -	case ACPI_HEST_TYPE_AER_ROOT_PORT:
> -		pcie_type = PCI_EXP_TYPE_ROOT_PORT;
> -		break;
> -	case ACPI_HEST_TYPE_AER_ENDPOINT:
> -		pcie_type = PCI_EXP_TYPE_ENDPOINT;
> -		break;
> -	case ACPI_HEST_TYPE_AER_BRIDGE:
> -		if ((info->pci_dev->class >> 16) == PCI_BASE_CLASS_BRIDGE)
> -			bridge = 1;
> -		break;
> -	default:
> -		return 0;
> -	}
> -
>  	p = (struct acpi_hest_aer_common *)(hest_hdr + 1);
>  	if (p->flags & ACPI_HEST_GLOBAL) {
> -		if ((pci_pcie_type(info->pci_dev) == pcie_type) || bridge)
> +		if (hest_match_type(hest_hdr, info->pci_dev))
>  			ff = !!(p->flags & ACPI_HEST_FIRMWARE_FIRST);
>  	} else
>  		if (hest_match_pci(p, info->pci_dev))
> 
> commit e9f977a04d96a54c4f6aa0b831e725dab2154364
> Author: Bjorn Helgaas <bhelgaas@google.com>
> Date:   Mon Jun 3 19:47:27 2013 -0600
> 
>     PCI/AER: Set dev->__aer_firmware_first only for matching devices
>     
>     Previously, we always updated info->firmware_first, even for HEST entries
>     that didn't match the device.  Therefore, if the last HEST descriptor was
>     a PCIe structure that didn't match the device, we always cleared
>     dev->__aer_firmware_first.
>     
>     Based-on-patch-by: Betty Dall <betty.dall@hp.com>
>     Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
 
Reviewed-by: Chen Gong <gong.chen@linux.intel.com>

> diff --git a/drivers/pci/pcie/aer/aerdrv_acpi.c b/drivers/pci/pcie/aer/aerdrv_acpi.c
> index 56e2d94..2bedad8 100644
> --- a/drivers/pci/pcie/aer/aerdrv_acpi.c
> +++ b/drivers/pci/pcie/aer/aerdrv_acpi.c
> @@ -54,16 +54,16 @@ static int aer_hest_parse(struct acpi_hest_header *hest_hdr, void *data)
>  {
>  	struct aer_hest_parse_info *info = data;
>  	struct acpi_hest_aer_common *p;
> -	int ff = 0;
> +	int ff;
>  
>  	p = (struct acpi_hest_aer_common *)(hest_hdr + 1);
> -	if (p->flags & ACPI_HEST_GLOBAL) {
> +	ff = !!(p->flags & ACPI_HEST_FIRMWARE_FIRST);
> +	if (p->flags & ACPI_HEST_GLOBAL)
>  		if (hest_match_type(hest_hdr, info->pci_dev))
> -			ff = !!(p->flags & ACPI_HEST_FIRMWARE_FIRST);
> -	} else
> +			info->firmware_first = ff;
> +	else
>  		if (hest_match_pci(p, info->pci_dev))
> -			ff = !!(p->flags & ACPI_HEST_FIRMWARE_FIRST);
> -	info->firmware_first = ff;
> +			info->firmware_first = ff;
>  
>  	return 0;
>  }

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 3/3] PCI/AER: Provide reset_link for firmware first root port
  2013-06-04 22:15       ` Bjorn Helgaas
  2013-06-05  1:56         ` Chen Gong
@ 2013-06-05 13:22         ` Betty Dall
  1 sibling, 0 replies; 21+ messages in thread
From: Betty Dall @ 2013-06-05 13:22 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Chen Gong, Rafael J. Wysocki, Huang Ying, linux-acpi,
	linux-kernel, linux-pci

On Tue, 2013-06-04 at 16:15 -0600, Bjorn Helgaas wrote:
> On Tue, Jun 4, 2013 at 3:38 PM, Betty Dall <betty.dall@hp.com> wrote:
> > On Tue, 2013-06-04 at 04:36 -0400, Chen Gong wrote:
> >> On Thu, May 30, 2013 at 08:39:29AM -0600, Betty Dall wrote:
> >> > Date:       Thu, 30 May 2013 08:39:29 -0600
> >> > From: Betty Dall <betty.dall@hp.com>
> >> > To: rjw@sisk.pl, bhelgaas@google.com
> >> > Cc: ying.huang@intel.com, linux-acpi@vger.kernel.org,
> >> >  linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, Betty Dall
> >> >  <betty.dall@hp.com>
> >> > Subject: [PATCH v2 3/3] PCI/AER: Provide reset_link for firmware first root
> >> >  port
> >> > X-Mailer: git-send-email 1.7.7.6
> >> >
> >> > The firmware first path does not install the aerdrv root port
> >> > service driver, so the firmware first root port does not have
> >> > a reset_link callback. When a firmware first root port does not have
> >> > a reset_link callback, use a new default reset_link similar to what
> >> > we already do for the default_downstream_reset_link(). The firmware
> >> > first default reset_link brings the root port out of SBR if firmware
> >> > left it in SBR.
> >> >
> >> > Changes since v1:
> >> > Removed incorrect setting of p2p_ctrl after the read.
> >> >
> >> > Signed-off-by: Betty Dall <betty.dall@hp.com>
> >> > ---
> >> >
> >> >  drivers/pci/pcie/aer/aerdrv_core.c |   36 ++++++++++++++++++++++++++++++++++++
> >> >  1 files changed, 36 insertions(+), 0 deletions(-)
> >> >
> >> >
> >> > diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c
> >> > index 8ec8b4f..72f76cd 100644
> >> > --- a/drivers/pci/pcie/aer/aerdrv_core.c
> >> > +++ b/drivers/pci/pcie/aer/aerdrv_core.c
> >> > @@ -413,6 +413,39 @@ static pci_ers_result_t default_downstream_reset_link(struct pci_dev *dev)
> >> >     return PCI_ERS_RESULT_RECOVERED;
> >> >  }
> >> >
> >> > +/**
> >> > + * default_ff_root_port_reset_link - default reset function for firmware
> >> > + *         first Root Port
> >> > + * @dev: pointer to root port's pci_dev data structure
> >> > + *
> >> > + * Invoked when performing link reset at Root Port w/ no aer driver.
> >> > + * This happens through the firmware first path. Firmware may leave
> >> > + * the Root Port in SBR and it is the OS responsiblity to bring it out
> >> > + * of SBR.
> >> > + */
> >> > +static pci_ers_result_t default_ff_root_port_reset_link(struct pci_dev *dev)
> >> > +{
> >> > +   u16 p2p_ctrl;
> >> > +
> >> > +   /* Read Secondary Bus Reset */
> >> > +   pci_read_config_word(dev, PCI_BRIDGE_CONTROL, &p2p_ctrl);
> >> > +
> >> > +   /* De-assert Secondary Bus Reset, if it is set */
> >> > +   if (p2p_ctrl & PCI_BRIDGE_CTL_BUS_RESET) {
> >> > +           p2p_ctrl &= ~PCI_BRIDGE_CTL_BUS_RESET;
> >> > +           pci_write_config_word(dev, PCI_BRIDGE_CONTROL, p2p_ctrl);
> >> > +
> >> > +           /*
> >> > +            * System software must wait for at least 100ms from the end
> >> > +            * of a reset of one or more device before it is permitted
> >> > +            * to issue Configuration Requests to those devices.
> >> > +            */
> >> > +           msleep(200);
> >> > +           dev_dbg(&dev->dev, "Root Port link has been reset\n");
> >> > +   }
> >> > +   return PCI_ERS_RESULT_RECOVERED;
> >> > +}
> >>
> >> I don't think this function is OK.
> >> 1) You don't really reset the 2nd Bus but just checking its status.
> >> I think you should have following steps to reset 2nd Bus:
> >>
> >> a. Assert 2nd Bus Reset
> >> b. wait for some time until this message has been broadcasted well
> >> c. De-assert 2nd Bus Reset
> >> d. wait for Trst time
> >>
> >> IOW, since we have aer_do_secondary_bus_reset to perform secondary bus
> >> reset, why you repeat it again?
> >>
> >> 2) msleep(200) is too long for kernel. You'd better yield the CPU when
> >> sleep.
> >
> > The firmware first path currently has no reset_link. I want to make a
> > minimal change since resetting the link could be considered firmware's
> > job in the firmware first path. This change is to just check if SBR is
> > set, and bring the link out of reset only if it is in SBR.  This way, if
> > a another firmware first platform is already resetting the link, it
> > won't be done twice. I took the msleep and the code from
> > aer_do_sercondary_bus_reset() as you noticed.
> 
> I understand the desire to make a minimal change, and we certainly
> don't want to make changes bigger than they need to be.
> 
> However, my goal is really to have clean, maintainable code at the
> end.  If that means bigger patches where you can't test every affected
> platform, that's a risk I'm willing to take.
> 
> In this particular case, we could try to limit the change to just the
> platforms you can test, as you've done.  But it looks to me like we
> actually want to reset the link in *all* cases, regardless of whether
> firmware-first is involved.  We're handling a fatal error for a
> device.  That device might be below a PCIe switch (where we currently
> reset the link), or it might be directly below a root port (where we
> currently don't).  Why should those cases be different?  I don't think
> the device or the driver should be able to tell the difference.
> 
> My *guess* is that this is an oversight in the original code, and that
> doing the link reset for both downstream ports and root ports will
> make error handling for devices below root ports work better.
> 
> The *last* thing I want is code littered with special cases that are
> only there because "they only fix the platforms I could actually
> test."  It's just about impossible to clean those up later.
> 
> Bjorn

That makes sense to me. I will work on a new version of the patch and
test it today. 

-Betty

> >> > +
> >> >  static int find_aer_service_iter(struct device *device, void *data)
> >> >  {
> >> >     struct pcie_port_service_driver *service_driver, **drv;
> >> > @@ -460,6 +493,9 @@ static pci_ers_result_t reset_link(struct pci_dev *dev)
> >> >             status = driver->reset_link(udev);
> >> >     } else if (pci_pcie_type(udev) == PCI_EXP_TYPE_DOWNSTREAM) {
> >> >             status = default_downstream_reset_link(udev);
> >> > +   } else if (pci_pcie_type(udev) == PCI_EXP_TYPE_ROOT_PORT &&
> >> > +           pcie_aer_get_firmware_first(udev)) {
> >> > +           status = default_ff_root_port_reset_link(udev);
> >> >     } else {
> >> >             dev_printk(KERN_DEBUG, &dev->dev,
> >> >                     "no link-reset support at upstream device %s\n",
> >> > --
> >> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> >> > the body of a message to majordomo@vger.kernel.org
> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> > Please read the FAQ at  http://www.tux.org/lkml/
> >
> >



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 1/3] PCI/AER: Fix incorrect return from aer_hest_parse()
  2013-06-05  2:48     ` Chen Gong
@ 2013-06-05 13:38       ` Bjorn Helgaas
  0 siblings, 0 replies; 21+ messages in thread
From: Bjorn Helgaas @ 2013-06-05 13:38 UTC (permalink / raw)
  To: Bjorn Helgaas, Betty Dall, Rafael J. Wysocki, Huang Ying,
	linux-acpi, linux-kernel, linux-pci

On Tue, Jun 4, 2013 at 8:48 PM, Chen Gong <gong.chen@linux.intel.com> wrote:
> On Tue, Jun 04, 2013 at 07:13:24AM -0600, Bjorn Helgaas wrote:
>> Date: Tue, 4 Jun 2013 07:13:24 -0600
>> From: Bjorn Helgaas <bhelgaas@google.com>
>> To: Betty Dall <betty.dall@hp.com>
>> Cc: rjw@sisk.pl, ying.huang@intel.com, linux-acpi@vger.kernel.org,
>>  linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org,
>>  gong.chen@linux.intel.com
>> Subject: Re: [PATCH v2 1/3] PCI/AER: Fix incorrect return from
>>  aer_hest_parse()
>> User-Agent: Mutt/1.5.21 (2010-09-15)
>>
>> On Thu, May 30, 2013 at 08:39:27AM -0600, Betty Dall wrote:
>> > The function aer_hest_parse() is called to determine if the given
>> > PCI device is firmware first or not. The code loops through each
>> > section of the HEST table to look for a match. The bug is that
>> > the function always returns whether the last HEST section is firmware
>> > first. The fix stops the iteration once the info.firmware_first
>> > variable is set.  This is similar to how the function aer_hest_parse_aff()
>> > stops the iteration.
>> >
>> > Signed-off-by: Betty Dall <betty.dall@hp.com>
>> > ---
>> >
>> >  drivers/pci/pcie/aer/aerdrv_acpi.c |    3 +++
>> >  1 files changed, 3 insertions(+), 0 deletions(-)
>> >
>> >
>> > diff --git a/drivers/pci/pcie/aer/aerdrv_acpi.c b/drivers/pci/pcie/aer/aerdrv_acpi.c
>> > index 5194a7d..39b8671 100644
>> > --- a/drivers/pci/pcie/aer/aerdrv_acpi.c
>> > +++ b/drivers/pci/pcie/aer/aerdrv_acpi.c
>> > @@ -42,6 +42,9 @@ static int aer_hest_parse(struct acpi_hest_header *hest_hdr, void *data)
>> >     u8 bridge = 0;
>> >     int ff = 0;
>> >
>> > +   if (info->firmware_first)
>> > +           return 0;
>> > +
>> >     switch (hest_hdr->type) {
>> >     case ACPI_HEST_TYPE_AER_ROOT_PORT:
>> >             pcie_type = PCI_EXP_TYPE_ROOT_PORT;
>>
>>
>> 1) I think dev->__aer_firmware_first should be initialized somewhere in the
>> pci_scan_single_device() path, e.g., maybe pci_init_capabilities().  It's
>> known at device add-time and never changes, so there's no point in doing
>> the lazy setup we do now.  That would let us get rid of
>> __aer_firmware_first_valid, too (along with the pointless "__" prefix).
>> This is just an observation, not a requirement for this patch set.
>>
>> 2) This is a band-aid that covers up the real problem, which is that we
>> update info->firmware_first even for non-matching devices.  I think we
>> should do something like the following instead:
>>
>>
>> commit c67612f272f1792a08f012f1b5ca37d5cfde5de4
>> Author: Bjorn Helgaas <bhelgaas@google.com>
>> Date:   Mon Jun 3 16:49:12 2013 -0600
>>
>>     PCI/AER: Don't parse HEST table for non-PCIe devices
>>
>>     AER is a PCIe-only capability, so there's no point in trying to match
>>     a HEST PCIe structure with a non-PCIe device.
>>
>>     Previously, a HEST global AER bridge entry (type 8) could incorrectly
>>     match *any* bridge, even a legacy PCI-PCI bridge, and a non-global
>>     HEST entry could match a legacy PCI device.
>>
> If so, it should be a BIOS bug, right? (BIOS should not contain PCI device
> for AER structure).

There are two cases here:

1) HEST contains AER_BRIDGE structure with "global" bit set.  The
current code will set firmware_first for any P2P bridge, even non-PCIe
ones.  This is a Linux bug, not a BIOS problem.

2) HEST contains a non-global entry with bus/dev/fn matching a
non-PCIe device.  This is indeed a BIOS bug.

> And I agree that your patch tries to avoid this scenario, but even one PCI
> device is set FFM enabled, as you mentioned above, AER is only for PCIe,
> which means a PCI device will not trigger such an event, so it should be
> harmless, in principle. But in practice, I totally agree with you that
> should be fixed.

I think I agree.  Setting firmware_first for a non-PCIe device should
be harmless because we only use firmware_first in
pci_enable_pcie_error_reporting().  There, we return an error if
firmware_first is set.  We also return an error if the device doesn't
have the AER capability.  So setting firmware_first on a PCI
(non-PCIe) device just means we'll return the error a little sooner.

Thanks for reviewing all this stuff!

Bjorn

> Reviewed-by: Chen Gong <gong.chen@linux.intel.com>
>
>>     Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
>>
>> diff --git a/drivers/pci/pcie/aer/aerdrv_acpi.c b/drivers/pci/pcie/aer/aerdrv_acpi.c
>> index 5194a7d..4f798ab 100644
>> --- a/drivers/pci/pcie/aer/aerdrv_acpi.c
>> +++ b/drivers/pci/pcie/aer/aerdrv_acpi.c
>> @@ -59,8 +59,7 @@ static int aer_hest_parse(struct acpi_hest_header *hest_hdr, void *data)
>>
>>       p = (struct acpi_hest_aer_common *)(hest_hdr + 1);
>>       if (p->flags & ACPI_HEST_GLOBAL) {
>> -             if ((pci_is_pcie(info->pci_dev) &&
>> -                  pci_pcie_type(info->pci_dev) == pcie_type) || bridge)
>> +             if ((pci_pcie_type(info->pci_dev) == pcie_type) || bridge)
>>                       ff = !!(p->flags & ACPI_HEST_FIRMWARE_FIRST);
>>       } else
>>               if (hest_match_pci(p, info->pci_dev))
>> @@ -89,6 +88,9 @@ static void aer_set_firmware_first(struct pci_dev *pci_dev)
>>
>>  int pcie_aer_get_firmware_first(struct pci_dev *dev)
>>  {
>> +     if (!pci_is_pcie(dev))
>> +             return 0;
>> +
>>       if (!dev->__aer_firmware_first_valid)
>>               aer_set_firmware_first(dev);
>>       return dev->__aer_firmware_first;
>>
>> commit 947da50270686b0d70f4bc2f7323ef7229489ecb
>> Author: Bjorn Helgaas <bhelgaas@google.com>
>> Date:   Mon Jun 3 15:42:00 2013 -0600
>>
>>     PCI/AER: Factor out HEST device type matching
>>
>>     This factors out the matching of HEST structure type and PCIe device type
>>     to improve readability.  No functional change.
>>
>>     Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
>>
>
> Reviewed-by: Chen Gong <gong.chen@linux.intel.com>
>
>> diff --git a/drivers/pci/pcie/aer/aerdrv_acpi.c b/drivers/pci/pcie/aer/aerdrv_acpi.c
>> index 4f798ab..56e2d94 100644
>> --- a/drivers/pci/pcie/aer/aerdrv_acpi.c
>> +++ b/drivers/pci/pcie/aer/aerdrv_acpi.c
>> @@ -29,6 +29,22 @@ static inline int hest_match_pci(struct acpi_hest_aer_common *p,
>>                p->function == PCI_FUNC(pci->devfn));
>>  }
>>
>> +static inline bool hest_match_type(struct acpi_hest_header *hest_hdr,
>> +                                struct pci_dev *dev)
>> +{
>> +     u16 hest_type = hest_hdr->type;
>> +     u8 pcie_type = pci_pcie_type(dev);
>> +
>> +     if ((hest_type == ACPI_HEST_TYPE_AER_ROOT_PORT &&
>> +          pcie_type == PCI_EXP_TYPE_ROOT_PORT) ||
>> +         (hest_type == ACPI_HEST_TYPE_AER_ENDPOINT &&
>> +          pcie_type == PCI_EXP_TYPE_ENDPOINT) ||
>> +         (hest_type == ACPI_HEST_TYPE_AER_BRIDGE &&
>> +          (dev->class >> 16) == PCI_BASE_CLASS_BRIDGE))
>> +             return true;
>> +     return false;
>> +}
>> +
>>  struct aer_hest_parse_info {
>>       struct pci_dev *pci_dev;
>>       int firmware_first;
>> @@ -38,28 +54,11 @@ static int aer_hest_parse(struct acpi_hest_header *hest_hdr, void *data)
>>  {
>>       struct aer_hest_parse_info *info = data;
>>       struct acpi_hest_aer_common *p;
>> -     u8 pcie_type = 0;
>> -     u8 bridge = 0;
>>       int ff = 0;
>>
>> -     switch (hest_hdr->type) {
>> -     case ACPI_HEST_TYPE_AER_ROOT_PORT:
>> -             pcie_type = PCI_EXP_TYPE_ROOT_PORT;
>> -             break;
>> -     case ACPI_HEST_TYPE_AER_ENDPOINT:
>> -             pcie_type = PCI_EXP_TYPE_ENDPOINT;
>> -             break;
>> -     case ACPI_HEST_TYPE_AER_BRIDGE:
>> -             if ((info->pci_dev->class >> 16) == PCI_BASE_CLASS_BRIDGE)
>> -                     bridge = 1;
>> -             break;
>> -     default:
>> -             return 0;
>> -     }
>> -
>>       p = (struct acpi_hest_aer_common *)(hest_hdr + 1);
>>       if (p->flags & ACPI_HEST_GLOBAL) {
>> -             if ((pci_pcie_type(info->pci_dev) == pcie_type) || bridge)
>> +             if (hest_match_type(hest_hdr, info->pci_dev))
>>                       ff = !!(p->flags & ACPI_HEST_FIRMWARE_FIRST);
>>       } else
>>               if (hest_match_pci(p, info->pci_dev))
>>
>> commit e9f977a04d96a54c4f6aa0b831e725dab2154364
>> Author: Bjorn Helgaas <bhelgaas@google.com>
>> Date:   Mon Jun 3 19:47:27 2013 -0600
>>
>>     PCI/AER: Set dev->__aer_firmware_first only for matching devices
>>
>>     Previously, we always updated info->firmware_first, even for HEST entries
>>     that didn't match the device.  Therefore, if the last HEST descriptor was
>>     a PCIe structure that didn't match the device, we always cleared
>>     dev->__aer_firmware_first.
>>
>>     Based-on-patch-by: Betty Dall <betty.dall@hp.com>
>>     Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
>
> Reviewed-by: Chen Gong <gong.chen@linux.intel.com>
>
>> diff --git a/drivers/pci/pcie/aer/aerdrv_acpi.c b/drivers/pci/pcie/aer/aerdrv_acpi.c
>> index 56e2d94..2bedad8 100644
>> --- a/drivers/pci/pcie/aer/aerdrv_acpi.c
>> +++ b/drivers/pci/pcie/aer/aerdrv_acpi.c
>> @@ -54,16 +54,16 @@ static int aer_hest_parse(struct acpi_hest_header *hest_hdr, void *data)
>>  {
>>       struct aer_hest_parse_info *info = data;
>>       struct acpi_hest_aer_common *p;
>> -     int ff = 0;
>> +     int ff;
>>
>>       p = (struct acpi_hest_aer_common *)(hest_hdr + 1);
>> -     if (p->flags & ACPI_HEST_GLOBAL) {
>> +     ff = !!(p->flags & ACPI_HEST_FIRMWARE_FIRST);
>> +     if (p->flags & ACPI_HEST_GLOBAL)
>>               if (hest_match_type(hest_hdr, info->pci_dev))
>> -                     ff = !!(p->flags & ACPI_HEST_FIRMWARE_FIRST);
>> -     } else
>> +                     info->firmware_first = ff;
>> +     else
>>               if (hest_match_pci(p, info->pci_dev))
>> -                     ff = !!(p->flags & ACPI_HEST_FIRMWARE_FIRST);
>> -     info->firmware_first = ff;
>> +                     info->firmware_first = ff;
>>
>>       return 0;
>>  }

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2013-06-05 13:38 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-05-30 14:39 [PATCH v2 0/3] PCI/ACPI: Fix firmware first error recovery with root port in reset Betty Dall
2013-05-30 14:39 ` [PATCH v2 1/3] PCI/AER: Fix incorrect return from aer_hest_parse() Betty Dall
2013-06-02  0:38   ` Bjorn Helgaas
2013-06-03 21:18     ` Betty Dall
2013-06-04 17:39       ` Bjorn Helgaas
2013-06-04  7:42   ` Chen Gong
2013-06-04 13:13   ` Bjorn Helgaas
2013-06-05  2:48     ` Chen Gong
2013-06-05 13:38       ` Bjorn Helgaas
2013-05-30 14:39 ` [PATCH v2 2/3] ACPI/APEI: Force fatal AER severity when bus has been reset Betty Dall
2013-06-04  7:53   ` Chen Gong
2013-06-04 16:20     ` Betty Dall
2013-06-04 17:54     ` Bjorn Helgaas
2013-06-05  1:56       ` Chen Gong
2013-05-30 14:39 ` [PATCH v2 3/3] PCI/AER: Provide reset_link for firmware first root port Betty Dall
2013-06-04  8:36   ` Chen Gong
2013-06-04 18:31     ` Bjorn Helgaas
2013-06-04 21:38     ` Betty Dall
2013-06-04 22:15       ` Bjorn Helgaas
2013-06-05  1:56         ` Chen Gong
2013-06-05 13:22         ` Betty Dall

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).