linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] PCI/AER: Do not clear AER bits if we don't own AER
@ 2018-07-17 15:31 Alexandru Gagniuc
  2018-07-17 15:41 ` Sinan Kaya
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Alexandru Gagniuc @ 2018-07-17 15:31 UTC (permalink / raw)
  To: bhelgaas, keith.busch
  Cc: alex_gagniuc, austin_bolen, shyam_iyer, Alexandru Gagniuc,
	Frederick Lawler, Oza Pawandeep, linux-pci, linux-kernel

When we don't own AER, we shouldn't touch the AER error bits. This
happens unconditionally on device probe(). Clearing AER bits
willy-nilly might cause firmware to miss errors. Instead
these bits should get cleared by FFS, or via ACPI _HPX method.

This race is mostly of theoretical significance, as it is not easy to
reasonably demonstrate it in testing.

Signed-off-by: Alexandru Gagniuc <mr.nuke.me@gmail.com>
---
 drivers/pci/pcie/aer.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index a2e88386af28..18037a2a8231 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -383,6 +383,9 @@ int pci_cleanup_aer_error_status_regs(struct pci_dev *dev)
 	if (!pci_is_pcie(dev))
 		return -ENODEV;
 
+	if (pcie_aer_get_firmware_first(dev))
+		return -EIO;
+
 	pos = dev->aer_cap;
 	if (!pos)
 		return -EIO;
-- 
2.14.3


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] PCI/AER: Do not clear AER bits if we don't own AER
  2018-07-17 15:31 [PATCH] PCI/AER: Do not clear AER bits if we don't own AER Alexandru Gagniuc
@ 2018-07-17 15:41 ` Sinan Kaya
  2018-07-19 15:55   ` Alex G.
  2018-07-23 16:52 ` [PATCH v2] " Alexandru Gagniuc
  2018-08-09 14:15 ` [PATCH] " Bjorn Helgaas
  2 siblings, 1 reply; 18+ messages in thread
From: Sinan Kaya @ 2018-07-17 15:41 UTC (permalink / raw)
  To: Alexandru Gagniuc, bhelgaas, keith.busch
  Cc: alex_gagniuc, austin_bolen, shyam_iyer, Frederick Lawler,
	Oza Pawandeep, linux-pci, linux-kernel


On 7/17/2018 8:31 AM, Alexandru Gagniuc wrote:
> +	if (pcie_aer_get_firmware_first(dev))
> +		return -EIO;

Can you move this to closer to the caller pci_aer_init()?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] PCI/AER: Do not clear AER bits if we don't own AER
  2018-07-17 15:41 ` Sinan Kaya
@ 2018-07-19 15:55   ` Alex G.
  2018-07-19 16:58     ` Sinan Kaya
  0 siblings, 1 reply; 18+ messages in thread
From: Alex G. @ 2018-07-19 15:55 UTC (permalink / raw)
  To: Sinan Kaya, bhelgaas, keith.busch
  Cc: alex_gagniuc, austin_bolen, shyam_iyer, Frederick Lawler,
	Oza Pawandeep, linux-pci, linux-kernel



On 07/17/2018 10:41 AM, Sinan Kaya wrote:
> 
> On 7/17/2018 8:31 AM, Alexandru Gagniuc wrote:
>> +    if (pcie_aer_get_firmware_first(dev))
>> +        return -EIO;
> 
> Can you move this to closer to the caller pci_aer_init()?

I could move it there. although pci_cleanup_aer_error_status_regs() is 
called directly from pci_restore_state.  Of course, aer_cap should be 
zero in this case, and we'd still bail out.
I find the intent clearer if we check it here rather than having to do 
the mental parsing of the state of aer_cap.

Alex

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] PCI/AER: Do not clear AER bits if we don't own AER
  2018-07-19 15:55   ` Alex G.
@ 2018-07-19 16:58     ` Sinan Kaya
  2018-07-19 19:56       ` Alex G.
  0 siblings, 1 reply; 18+ messages in thread
From: Sinan Kaya @ 2018-07-19 16:58 UTC (permalink / raw)
  To: Alex G., bhelgaas, keith.busch
  Cc: alex_gagniuc, austin_bolen, shyam_iyer, Frederick Lawler,
	Oza Pawandeep, linux-pci, linux-kernel


On 7/19/2018 8:55 AM, Alex G. wrote:
> I find the intent clearer if we check it here rather than having to do 
> the mental parsing of the state of aer_cap.

I don't feel too strong about my comment to be honest. This was a
style/maintenance comment.

It feels like we are putting pcie_aer_get_firmware_first() into core
functions unnecessarily after your change. I understand the need for
your change. I'm asking if it is the right place or not.

pcie_aer_get_firmware_first() should be called from either the init or
probe function so that the rest of the AER functions do not get called
from any other context.

If someone adds another AER function, we might need to add another
pcie_aer_get_firmware_first() check there. So, we have unnecessary code
duplication.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] PCI/AER: Do not clear AER bits if we don't own AER
  2018-07-19 16:58     ` Sinan Kaya
@ 2018-07-19 19:56       ` Alex G.
  0 siblings, 0 replies; 18+ messages in thread
From: Alex G. @ 2018-07-19 19:56 UTC (permalink / raw)
  To: Sinan Kaya, bhelgaas, keith.busch
  Cc: alex_gagniuc, austin_bolen, shyam_iyer, Frederick Lawler,
	Oza Pawandeep, linux-pci, linux-kernel



On 07/19/2018 11:58 AM, Sinan Kaya wrote:
> 
> On 7/19/2018 8:55 AM, Alex G. wrote:
>> I find the intent clearer if we check it here rather than having to do 
>> the mental parsing of the state of aer_cap.
> 
> I don't feel too strong about my comment to be honest. This was a
> style/maintenance comment.
> 
> It feels like we are putting pcie_aer_get_firmware_first() into core
> functions unnecessarily after your change. I understand the need for
> your change. I'm asking if it is the right place or not.
> 
> pcie_aer_get_firmware_first() should be called from either the init or
> probe function so that the rest of the AER functions do not get called
> from any other context.
> 
> If someone adds another AER function, we might need to add another
> pcie_aer_get_firmware_first() check there. So, we have unnecessary code
> duplication.

We could move the aer_cap and get_ffs() check into one function that we 
end up calling all over the place. I understand your concern about code 
duplication, and I agree with it. I don't think that at this point it's 
that big of a deal, although we might need to guard every aer_() call.

So moving all the checks in a pcie_aer_is_kernel_first() makes sense.

Alex

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v2] PCI/AER: Do not clear AER bits if we don't own AER
  2018-07-17 15:31 [PATCH] PCI/AER: Do not clear AER bits if we don't own AER Alexandru Gagniuc
  2018-07-17 15:41 ` Sinan Kaya
@ 2018-07-23 16:52 ` Alexandru Gagniuc
  2018-07-24 15:59   ` Alex G.
                     ` (2 more replies)
  2018-08-09 14:15 ` [PATCH] " Bjorn Helgaas
  2 siblings, 3 replies; 18+ messages in thread
From: Alexandru Gagniuc @ 2018-07-23 16:52 UTC (permalink / raw)
  To: linux-pci, bhelgaas
  Cc: keith.busch, alex_gagniuc, austin_bolen, shyam_iyer,
	Alexandru Gagniuc, Frederick Lawler, Greg Kroah-Hartman,
	Oza Pawandeep, linux-kernel

When we don't own AER, we shouldn't touch the AER error bits. Clearing
error bits willy-nilly might cause firmware to miss some errors. In
theory, these bits get cleared by FFS, or via ACPI _HPX method. These
mechanisms are not subject to the problem.

This race is mostly of theoretical significance, since I can't
reasonably demonstrate this race in the lab.

On a side-note, pcie_aer_is_kernel_first() is created to alleviate the
need for two checks: aer_cap and get_firmware_first().

Signed-off-by: Alexandru Gagniuc <mr.nuke.me@gmail.com>
---
 drivers/pci/pcie/aer.c | 17 ++++++++++-------
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index a2e88386af28..85c3e173c025 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -307,6 +307,12 @@ int pcie_aer_get_firmware_first(struct pci_dev *dev)
 		aer_set_firmware_first(dev);
 	return dev->__aer_firmware_first;
 }
+
+static bool pcie_aer_is_kernel_first(struct pci_dev *dev)
+{
+	return !!dev->aer_cap && !pcie_aer_get_firmware_first(dev);
+}
+
 #define	PCI_EXP_AER_FLAGS	(PCI_EXP_DEVCTL_CERE | PCI_EXP_DEVCTL_NFERE | \
 				 PCI_EXP_DEVCTL_FERE | PCI_EXP_DEVCTL_URRE)
 
@@ -337,10 +343,7 @@ bool aer_acpi_firmware_first(void)
 
 int pci_enable_pcie_error_reporting(struct pci_dev *dev)
 {
-	if (pcie_aer_get_firmware_first(dev))
-		return -EIO;
-
-	if (!dev->aer_cap)
+	if (!pcie_aer_is_kernel_first(dev))
 		return -EIO;
 
 	return pcie_capability_set_word(dev, PCI_EXP_DEVCTL, PCI_EXP_AER_FLAGS);
@@ -349,7 +352,7 @@ EXPORT_SYMBOL_GPL(pci_enable_pcie_error_reporting);
 
 int pci_disable_pcie_error_reporting(struct pci_dev *dev)
 {
-	if (pcie_aer_get_firmware_first(dev))
+	if (!pcie_aer_is_kernel_first(dev))
 		return -EIO;
 
 	return pcie_capability_clear_word(dev, PCI_EXP_DEVCTL,
@@ -383,10 +386,10 @@ int pci_cleanup_aer_error_status_regs(struct pci_dev *dev)
 	if (!pci_is_pcie(dev))
 		return -ENODEV;
 
-	pos = dev->aer_cap;
-	if (!pos)
+	if (pcie_aer_is_kernel_first(dev))
 		return -EIO;
 
+	pos = dev->aer_cap;
 	port_type = pci_pcie_type(dev);
 	if (port_type == PCI_EXP_TYPE_ROOT_PORT) {
 		pci_read_config_dword(dev, pos + PCI_ERR_ROOT_STATUS, &status);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] PCI/AER: Do not clear AER bits if we don't own AER
  2018-07-23 16:52 ` [PATCH v2] " Alexandru Gagniuc
@ 2018-07-24 15:59   ` Alex G.
  2018-07-30 23:35     ` [PATCH v3] " Alexandru Gagniuc
  2018-07-24 17:08   ` [PATCH v2] " kbuild test robot
  2018-07-25  1:03   ` kbuild test robot
  2 siblings, 1 reply; 18+ messages in thread
From: Alex G. @ 2018-07-24 15:59 UTC (permalink / raw)
  To: linux-pci, bhelgaas
  Cc: keith.busch, alex_gagniuc, austin_bolen, shyam_iyer,
	Frederick Lawler, Greg Kroah-Hartman, Oza Pawandeep,
	linux-kernel



On 07/23/2018 11:52 AM, Alexandru Gagniuc wrote:
> When we don't own AER, we shouldn't touch the AER error bits. Clearing
> error bits willy-nilly might cause firmware to miss some errors. In
> theory, these bits get cleared by FFS, or via ACPI _HPX method. These
> mechanisms are not subject to the problem.
> 
> This race is mostly of theoretical significance, since I can't
> reasonably demonstrate this race in the lab.
> 
> On a side-note, pcie_aer_is_kernel_first() is created to alleviate the
> need for two checks: aer_cap and get_firmware_first().
> 
> Signed-off-by: Alexandru Gagniuc <mr.nuke.me@gmail.com>
> ---
>   drivers/pci/pcie/aer.c | 17 ++++++++++-------
>   1 file changed, 10 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index a2e88386af28..85c3e173c025 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -307,6 +307,12 @@ int pcie_aer_get_firmware_first(struct pci_dev *dev)
>   		aer_set_firmware_first(dev);
>   	return dev->__aer_firmware_first;
>   }
> +
> +static bool pcie_aer_is_kernel_first(struct pci_dev *dev)
> +{
> +	return !!dev->aer_cap && !pcie_aer_get_firmware_first(dev);
> +}
> +
>   #define	PCI_EXP_AER_FLAGS	(PCI_EXP_DEVCTL_CERE | PCI_EXP_DEVCTL_NFERE | \
>   				 PCI_EXP_DEVCTL_FERE | PCI_EXP_DEVCTL_URRE)
>   
> @@ -337,10 +343,7 @@ bool aer_acpi_firmware_first(void)
>   
>   int pci_enable_pcie_error_reporting(struct pci_dev *dev)
>   {
> -	if (pcie_aer_get_firmware_first(dev))
> -		return -EIO;
> -
> -	if (!dev->aer_cap)
> +	if (!pcie_aer_is_kernel_first(dev))
>   		return -EIO;
>   
>   	return pcie_capability_set_word(dev, PCI_EXP_DEVCTL, PCI_EXP_AER_FLAGS);
> @@ -349,7 +352,7 @@ EXPORT_SYMBOL_GPL(pci_enable_pcie_error_reporting);
>   
>   int pci_disable_pcie_error_reporting(struct pci_dev *dev)
>   {
> -	if (pcie_aer_get_firmware_first(dev))
> +	if (!pcie_aer_is_kernel_first(dev))
>   		return -EIO;
>   
>   	return pcie_capability_clear_word(dev, PCI_EXP_DEVCTL,
> @@ -383,10 +386,10 @@ int pci_cleanup_aer_error_status_regs(struct pci_dev *dev)
>   	if (!pci_is_pcie(dev))
>   		return -ENODEV;
>   
> -	pos = dev->aer_cap;
> -	if (!pos)
> +	if (pcie_aer_is_kernel_first(dev))

This here is missing a '!'. It's in my local branch, so I must have 
exported the patch before I fixed that. I'll get that fixed next rev.

>   		return -EIO;
>   
> +	pos = dev->aer_cap;
>   	port_type = pci_pcie_type(dev);
>   	if (port_type == PCI_EXP_TYPE_ROOT_PORT) {
>   		pci_read_config_dword(dev, pos + PCI_ERR_ROOT_STATUS, &status);
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] PCI/AER: Do not clear AER bits if we don't own AER
  2018-07-23 16:52 ` [PATCH v2] " Alexandru Gagniuc
  2018-07-24 15:59   ` Alex G.
@ 2018-07-24 17:08   ` kbuild test robot
  2018-07-25  1:03   ` kbuild test robot
  2 siblings, 0 replies; 18+ messages in thread
From: kbuild test robot @ 2018-07-24 17:08 UTC (permalink / raw)
  To: Alexandru Gagniuc
  Cc: kbuild-all, linux-pci, bhelgaas, keith.busch, alex_gagniuc,
	austin_bolen, shyam_iyer, Alexandru Gagniuc, Frederick Lawler,
	Greg Kroah-Hartman, Oza Pawandeep, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1767 bytes --]

Hi Alexandru,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on pci/next]
[also build test ERROR on v4.18-rc6 next-20180724]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Alexandru-Gagniuc/PCI-AER-Do-not-clear-AER-bits-if-we-don-t-own-AER/20180724-235320
base:   https://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci.git next
config: x86_64-randconfig-x008-201829 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All errors (new ones prefixed by >>):

   drivers/pci/pcie/aer.c: In function 'pci_enable_pcie_error_reporting':
>> drivers/pci/pcie/aer.c:371:7: error: implicit declaration of function 'pcie_aer_is_kernel_first'; did you mean 'pcie_aer_get_firmware_first'? [-Werror=implicit-function-declaration]
     if (!pcie_aer_is_kernel_first(dev))
          ^~~~~~~~~~~~~~~~~~~~~~~~
          pcie_aer_get_firmware_first
   cc1: some warnings being treated as errors

vim +371 drivers/pci/pcie/aer.c

   365	
   366	#define	PCI_EXP_AER_FLAGS	(PCI_EXP_DEVCTL_CERE | PCI_EXP_DEVCTL_NFERE | \
   367					 PCI_EXP_DEVCTL_FERE | PCI_EXP_DEVCTL_URRE)
   368	
   369	int pci_enable_pcie_error_reporting(struct pci_dev *dev)
   370	{
 > 371		if (!pcie_aer_is_kernel_first(dev))
   372			return -EIO;
   373	
   374		return pcie_capability_set_word(dev, PCI_EXP_DEVCTL, PCI_EXP_AER_FLAGS);
   375	}
   376	EXPORT_SYMBOL_GPL(pci_enable_pcie_error_reporting);
   377	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 27463 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2] PCI/AER: Do not clear AER bits if we don't own AER
  2018-07-23 16:52 ` [PATCH v2] " Alexandru Gagniuc
  2018-07-24 15:59   ` Alex G.
  2018-07-24 17:08   ` [PATCH v2] " kbuild test robot
@ 2018-07-25  1:03   ` kbuild test robot
  2 siblings, 0 replies; 18+ messages in thread
From: kbuild test robot @ 2018-07-25  1:03 UTC (permalink / raw)
  To: Alexandru Gagniuc
  Cc: kbuild-all, linux-pci, bhelgaas, keith.busch, alex_gagniuc,
	austin_bolen, shyam_iyer, Alexandru Gagniuc, Frederick Lawler,
	Greg Kroah-Hartman, Oza Pawandeep, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1719 bytes --]

Hi Alexandru,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on pci/next]
[also build test ERROR on v4.18-rc6 next-20180724]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Alexandru-Gagniuc/PCI-AER-Do-not-clear-AER-bits-if-we-don-t-own-AER/20180724-235320
base:   https://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci.git next
config: x86_64-randconfig-s1-07250001 (attached as .config)
compiler: gcc-6 (Debian 6.4.0-9) 6.4.0 20171026
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All errors (new ones prefixed by >>):

   drivers/pci/pcie/aer.c: In function 'pci_enable_pcie_error_reporting':
>> drivers/pci/pcie/aer.c:371:7: error: implicit declaration of function 'pcie_aer_is_kernel_first' [-Werror=implicit-function-declaration]
     if (!pcie_aer_is_kernel_first(dev))
          ^~~~~~~~~~~~~~~~~~~~~~~~
   cc1: some warnings being treated as errors

vim +/pcie_aer_is_kernel_first +371 drivers/pci/pcie/aer.c

   365	
   366	#define	PCI_EXP_AER_FLAGS	(PCI_EXP_DEVCTL_CERE | PCI_EXP_DEVCTL_NFERE | \
   367					 PCI_EXP_DEVCTL_FERE | PCI_EXP_DEVCTL_URRE)
   368	
   369	int pci_enable_pcie_error_reporting(struct pci_dev *dev)
   370	{
 > 371		if (!pcie_aer_is_kernel_first(dev))
   372			return -EIO;
   373	
   374		return pcie_capability_set_word(dev, PCI_EXP_DEVCTL, PCI_EXP_AER_FLAGS);
   375	}
   376	EXPORT_SYMBOL_GPL(pci_enable_pcie_error_reporting);
   377	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 31241 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v3] PCI/AER: Do not clear AER bits if we don't own AER
  2018-07-24 15:59   ` Alex G.
@ 2018-07-30 23:35     ` Alexandru Gagniuc
  2018-08-08  1:14       ` Bjorn Helgaas
  0 siblings, 1 reply; 18+ messages in thread
From: Alexandru Gagniuc @ 2018-07-30 23:35 UTC (permalink / raw)
  To: linux-pci, bhelgaas
  Cc: keith.busch, alex_gagniuc, austin_bolen, shyam_iyer,
	Alexandru Gagniuc, Frederick Lawler, Oza Pawandeep, linux-kernel

When we don't own AER, we shouldn't touch the AER error bits. Clearing
error bits willy-nilly might cause firmware to miss some errors. In
theory, these bits get cleared by FFS, or via ACPI _HPX method. These
mechanisms are not subject to the problem.

This race is mostly of theoretical significance, since I can't
reasonably demonstrate this race in the lab.

On a side-note, pcie_aer_is_kernel_first() is created to alleviate the
need for two checks: aer_cap and get_firmware_first().

Signed-off-by: Alexandru Gagniuc <mr.nuke.me@gmail.com>
---

Changes since v2:
  - Added missing negation in pci_cleanup_aer_error_status_regs()

 drivers/pci/pcie/aer.c | 17 ++++++++++-------
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index a2e88386af28..40e5c86271d1 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -307,6 +307,12 @@ int pcie_aer_get_firmware_first(struct pci_dev *dev)
 		aer_set_firmware_first(dev);
 	return dev->__aer_firmware_first;
 }
+
+static bool pcie_aer_is_kernel_first(struct pci_dev *dev)
+{
+	return !!dev->aer_cap && !pcie_aer_get_firmware_first(dev);
+}
+
 #define	PCI_EXP_AER_FLAGS	(PCI_EXP_DEVCTL_CERE | PCI_EXP_DEVCTL_NFERE | \
 				 PCI_EXP_DEVCTL_FERE | PCI_EXP_DEVCTL_URRE)
 
@@ -337,10 +343,7 @@ bool aer_acpi_firmware_first(void)
 
 int pci_enable_pcie_error_reporting(struct pci_dev *dev)
 {
-	if (pcie_aer_get_firmware_first(dev))
-		return -EIO;
-
-	if (!dev->aer_cap)
+	if (!pcie_aer_is_kernel_first(dev))
 		return -EIO;
 
 	return pcie_capability_set_word(dev, PCI_EXP_DEVCTL, PCI_EXP_AER_FLAGS);
@@ -349,7 +352,7 @@ EXPORT_SYMBOL_GPL(pci_enable_pcie_error_reporting);
 
 int pci_disable_pcie_error_reporting(struct pci_dev *dev)
 {
-	if (pcie_aer_get_firmware_first(dev))
+	if (!pcie_aer_is_kernel_first(dev))
 		return -EIO;
 
 	return pcie_capability_clear_word(dev, PCI_EXP_DEVCTL,
@@ -383,10 +386,10 @@ int pci_cleanup_aer_error_status_regs(struct pci_dev *dev)
 	if (!pci_is_pcie(dev))
 		return -ENODEV;
 
-	pos = dev->aer_cap;
-	if (!pos)
+	if (!pcie_aer_is_kernel_first(dev))
 		return -EIO;
 
+	pos = dev->aer_cap;
 	port_type = pci_pcie_type(dev);
 	if (port_type == PCI_EXP_TYPE_ROOT_PORT) {
 		pci_read_config_dword(dev, pos + PCI_ERR_ROOT_STATUS, &status);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v3] PCI/AER: Do not clear AER bits if we don't own AER
  2018-07-30 23:35     ` [PATCH v3] " Alexandru Gagniuc
@ 2018-08-08  1:14       ` Bjorn Helgaas
  2018-08-08  3:46         ` Alex G.
  0 siblings, 1 reply; 18+ messages in thread
From: Bjorn Helgaas @ 2018-08-08  1:14 UTC (permalink / raw)
  To: Alexandru Gagniuc
  Cc: linux-pci, bhelgaas, keith.busch, alex_gagniuc, austin_bolen,
	shyam_iyer, Frederick Lawler, Oza Pawandeep, linux-kernel

On Mon, Jul 30, 2018 at 06:35:31PM -0500, Alexandru Gagniuc wrote:
> When we don't own AER, we shouldn't touch the AER error bits. Clearing
> error bits willy-nilly might cause firmware to miss some errors. In
> theory, these bits get cleared by FFS, or via ACPI _HPX method. These
> mechanisms are not subject to the problem.

What's FFS?

I guess you mean FFS and _HPX are not subject to the problem because
they're supplied by firmware, so firmware would be responsible for
looking at the bits before clearing them?

> This race is mostly of theoretical significance, since I can't
> reasonably demonstrate this race in the lab.
> 
> On a side-note, pcie_aer_is_kernel_first() is created to alleviate the
> need for two checks: aer_cap and get_firmware_first().
> 
> Signed-off-by: Alexandru Gagniuc <mr.nuke.me@gmail.com>
> ---
> 
> Changes since v2:
>   - Added missing negation in pci_cleanup_aer_error_status_regs()
> 
>  drivers/pci/pcie/aer.c | 17 ++++++++++-------
>  1 file changed, 10 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index a2e88386af28..40e5c86271d1 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -307,6 +307,12 @@ int pcie_aer_get_firmware_first(struct pci_dev *dev)
>  		aer_set_firmware_first(dev);
>  	return dev->__aer_firmware_first;
>  }
> +
> +static bool pcie_aer_is_kernel_first(struct pci_dev *dev)
> +{
> +	return !!dev->aer_cap && !pcie_aer_get_firmware_first(dev);
> +}

I think it complicates things to have both "firmware_first" and
"kernel_first" interfaces, so I would prefer to stick with the
existing "firmware_first" style.

>  #define	PCI_EXP_AER_FLAGS	(PCI_EXP_DEVCTL_CERE | PCI_EXP_DEVCTL_NFERE | \
>  				 PCI_EXP_DEVCTL_FERE | PCI_EXP_DEVCTL_URRE)
>  
> @@ -337,10 +343,7 @@ bool aer_acpi_firmware_first(void)
>  
>  int pci_enable_pcie_error_reporting(struct pci_dev *dev)
>  {
> -	if (pcie_aer_get_firmware_first(dev))
> -		return -EIO;
> -
> -	if (!dev->aer_cap)
> +	if (!pcie_aer_is_kernel_first(dev))
>  		return -EIO;
>  
>  	return pcie_capability_set_word(dev, PCI_EXP_DEVCTL, PCI_EXP_AER_FLAGS);

This change doesn't actually fix anything, does it?  It looks like a
cleanup that doesn't change the behavior.

> @@ -349,7 +352,7 @@ EXPORT_SYMBOL_GPL(pci_enable_pcie_error_reporting);
>  
>  int pci_disable_pcie_error_reporting(struct pci_dev *dev)
>  {
> -	if (pcie_aer_get_firmware_first(dev))
> +	if (!pcie_aer_is_kernel_first(dev))
>  		return -EIO;

This change does effectively add a test for dev->aer_cap.  That makes
sense in terms of symmetry with pci_enable_pcie_error_reporting(),
but I think it should be a separate patch because it's conceptually
separate from the change below.

We should keep the existing behavior (but add the symmetry) here for
now, but it's not clear to me that these paths should care about AER
or firmware-first at all.  PCI_EXP_DEVCTL is not an AER register and
we have the _HPX mechanism for firmware to influence it (which these
paths currently ignore).  I suspect we should program these reporting
enable bits in the core enumeration path instead of having drivers
call these interfaces.

If/when we make changes along these lines, the history will be easier
to follow if *this* change is not connected with the change below to
pci_cleanup_aer_error_status_regs().

>  	return pcie_capability_clear_word(dev, PCI_EXP_DEVCTL,
> @@ -383,10 +386,10 @@ int pci_cleanup_aer_error_status_regs(struct pci_dev *dev)
>  	if (!pci_is_pcie(dev))
>  		return -ENODEV;
>  
> -	pos = dev->aer_cap;
> -	if (!pos)
> +	if (!pcie_aer_is_kernel_first(dev))
>  		return -EIO;

This part makes sense to me, but I think I would rather have it match
the existing style in pci_enable_pcie_error_reporting(), i.e., keep
the test for dev->aer_cap and add a test for
pcie_aer_get_firmware_first().

> +	pos = dev->aer_cap;
>  	port_type = pci_pcie_type(dev);
>  	if (port_type == PCI_EXP_TYPE_ROOT_PORT) {
>  		pci_read_config_dword(dev, pos + PCI_ERR_ROOT_STATUS, &status);
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3] PCI/AER: Do not clear AER bits if we don't own AER
  2018-08-08  1:14       ` Bjorn Helgaas
@ 2018-08-08  3:46         ` Alex G.
  0 siblings, 0 replies; 18+ messages in thread
From: Alex G. @ 2018-08-08  3:46 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-pci, bhelgaas, keith.busch, alex_gagniuc, austin_bolen,
	shyam_iyer, Frederick Lawler, Oza Pawandeep, linux-kernel



On 08/07/2018 08:14 PM, Bjorn Helgaas wrote:
> On Mon, Jul 30, 2018 at 06:35:31PM -0500, Alexandru Gagniuc wrote:
>> When we don't own AER, we shouldn't touch the AER error bits. Clearing
>> error bits willy-nilly might cause firmware to miss some errors. In
>> theory, these bits get cleared by FFS, or via ACPI _HPX method. These
>> mechanisms are not subject to the problem.
> 
> What's FFS?

Firmware-first. Nobody likes spelling it out, and all other proposed 
acronyms are insanely tong-twisting. So, FFS.

> I guess you mean FFS and _HPX are not subject to the problem because
> they're supplied by firmware, so firmware would be responsible for
> looking at the bits before clearing them?

Exactly.

>> This race is mostly of theoretical significance, since I can't
>> reasonably demonstrate this race in the lab.
>>
>> On a side-note, pcie_aer_is_kernel_first() is created to alleviate the
>> need for two checks: aer_cap and get_firmware_first().
>>
>> Signed-off-by: Alexandru Gagniuc <mr.nuke.me@gmail.com>
>> ---
>>
>> Changes since v2:
>>    - Added missing negation in pci_cleanup_aer_error_status_regs()
>>
>>   drivers/pci/pcie/aer.c | 17 ++++++++++-------
>>   1 file changed, 10 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>> index a2e88386af28..40e5c86271d1 100644
>> --- a/drivers/pci/pcie/aer.c
>> +++ b/drivers/pci/pcie/aer.c
>> @@ -307,6 +307,12 @@ int pcie_aer_get_firmware_first(struct pci_dev *dev)
>>   		aer_set_firmware_first(dev);
>>   	return dev->__aer_firmware_first;
>>   }
>> +
>> +static bool pcie_aer_is_kernel_first(struct pci_dev *dev)
>> +{
>> +	return !!dev->aer_cap && !pcie_aer_get_firmware_first(dev);
>> +}
> 
> I think it complicates things to have both "firmware_first" and
> "kernel_first" interfaces, so I would prefer to stick with the
> existing "firmware_first" style.
> 
>>   #define	PCI_EXP_AER_FLAGS	(PCI_EXP_DEVCTL_CERE | PCI_EXP_DEVCTL_NFERE | \
>>   				 PCI_EXP_DEVCTL_FERE | PCI_EXP_DEVCTL_URRE)
>>   
>> @@ -337,10 +343,7 @@ bool aer_acpi_firmware_first(void)
>>   
>>   int pci_enable_pcie_error_reporting(struct pci_dev *dev)
>>   {
>> -	if (pcie_aer_get_firmware_first(dev))
>> -		return -EIO;
>> -
>> -	if (!dev->aer_cap)
>> +	if (!pcie_aer_is_kernel_first(dev))
>>   		return -EIO;
>>   
>>   	return pcie_capability_set_word(dev, PCI_EXP_DEVCTL, PCI_EXP_AER_FLAGS);
> 
> This change doesn't actually fix anything, does it?  It looks like a
> cleanup that doesn't change the behavior.

Initially (v1), this was a one-liner, but someone had a complaint about 
having pcie_aer_get_firmware_first() boilerplate all over the place. 
That's why I added the "kernel_first" function (previous comment), and 
then updated this here for completeness. I'm also fine with v1.

>> @@ -349,7 +352,7 @@ EXPORT_SYMBOL_GPL(pci_enable_pcie_error_reporting);
>>   
>>   int pci_disable_pcie_error_reporting(struct pci_dev *dev)
>>   {
>> -	if (pcie_aer_get_firmware_first(dev))
>> +	if (!pcie_aer_is_kernel_first(dev))
>>   		return -EIO;
> 
> This change does effectively add a test for dev->aer_cap.  That makes
> sense in terms of symmetry with pci_enable_pcie_error_reporting(),
> but I think it should be a separate patch because it's conceptually
> separate from the change below.
> 
> We should keep the existing behavior (but add the symmetry) here for
> now, but it's not clear to me that these paths should care about AER
> or firmware-first at all.  PCI_EXP_DEVCTL is not an AER register and
> we have the _HPX mechanism for firmware to influence it (which these
> paths currently ignore).  I suspect we should program these reporting
> enable bits in the core enumeration path instead of having drivers
> call these interfaces.

The headache is that FFS needs the reporting bit to stay enabled in 
order to get AER notifications. Disabling things here could really break 
firmware. Of course, that's a cyclical argument, since FW is broken by 
definition.

> If/when we make changes along these lines, the history will be easier
> to follow if *this* change is not connected with the change below to
> pci_cleanup_aer_error_status_regs().

I agree. I think it might be preferred then to go with v1, and leave the 
refactoring to a later time, since the extra changes are cosmetical and 
social.

>>   	return pcie_capability_clear_word(dev, PCI_EXP_DEVCTL,
>> @@ -383,10 +386,10 @@ int pci_cleanup_aer_error_status_regs(struct pci_dev *dev)
>>   	if (!pci_is_pcie(dev))
>>   		return -ENODEV;
>>   
>> -	pos = dev->aer_cap;
>> -	if (!pos)
>> +	if (!pcie_aer_is_kernel_first(dev))
>>   		return -EIO;
> 
> This part makes sense to me, but I think I would rather have it match
> the existing style in pci_enable_pcie_error_reporting(), i.e., keep
> the test for dev->aer_cap and add a test for
> pcie_aer_get_firmware_first().

Had it that way in v1.

Alex

>> +	pos = dev->aer_cap;
>>   	port_type = pci_pcie_type(dev);
>>   	if (port_type == PCI_EXP_TYPE_ROOT_PORT) {
>>   		pci_read_config_dword(dev, pos + PCI_ERR_ROOT_STATUS, &status);
>> -- 
>> 2.17.1
>>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] PCI/AER: Do not clear AER bits if we don't own AER
  2018-07-17 15:31 [PATCH] PCI/AER: Do not clear AER bits if we don't own AER Alexandru Gagniuc
  2018-07-17 15:41 ` Sinan Kaya
  2018-07-23 16:52 ` [PATCH v2] " Alexandru Gagniuc
@ 2018-08-09 14:15 ` Bjorn Helgaas
  2018-08-09 16:46   ` Alex_Gagniuc
  2 siblings, 1 reply; 18+ messages in thread
From: Bjorn Helgaas @ 2018-08-09 14:15 UTC (permalink / raw)
  To: Alexandru Gagniuc
  Cc: bhelgaas, keith.busch, alex_gagniuc, austin_bolen, shyam_iyer,
	Frederick Lawler, Oza Pawandeep, linux-pci, linux-kernel

On Tue, Jul 17, 2018 at 10:31:23AM -0500, Alexandru Gagniuc wrote:
> When we don't own AER, we shouldn't touch the AER error bits. This
> happens unconditionally on device probe(). Clearing AER bits
> willy-nilly might cause firmware to miss errors. Instead
> these bits should get cleared by FFS, or via ACPI _HPX method.
> 
> This race is mostly of theoretical significance, as it is not easy to
> reasonably demonstrate it in testing.
> 
> Signed-off-by: Alexandru Gagniuc <mr.nuke.me@gmail.com>
> ---
>  drivers/pci/pcie/aer.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index a2e88386af28..18037a2a8231 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -383,6 +383,9 @@ int pci_cleanup_aer_error_status_regs(struct pci_dev *dev)
>  	if (!pci_is_pcie(dev))
>  		return -ENODEV;
>  
> +	if (pcie_aer_get_firmware_first(dev))
> +		return -EIO;

I like this patch.

Do we need the same thing in the following places that also clear AER
status bits or write AER control bits?

  enable_ecrc_checking()
  disable_ecrc_checking()
  pci_cleanup_aer_uncorrect_error_status()
  pci_aer_clear_fatal_status()

>  	pos = dev->aer_cap;
>  	if (!pos)
>  		return -EIO;
> -- 
> 2.14.3
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] PCI/AER: Do not clear AER bits if we don't own AER
  2018-08-09 14:15 ` [PATCH] " Bjorn Helgaas
@ 2018-08-09 16:46   ` Alex_Gagniuc
  2018-08-09 18:29     ` Bjorn Helgaas
  0 siblings, 1 reply; 18+ messages in thread
From: Alex_Gagniuc @ 2018-08-09 16:46 UTC (permalink / raw)
  To: helgaas, mr.nuke.me
  Cc: bhelgaas, keith.busch, Austin.Bolen, Shyam.Iyer, fred, poza,
	linux-pci, linux-kernel

On 08/09/2018 09:16 AM, Bjorn Helgaas wrote:
> On Tue, Jul 17, 2018 at 10:31:23AM -0500, Alexandru Gagniuc wrote:
>> When we don't own AER, we shouldn't touch the AER error bits. This
>> happens unconditionally on device probe(). Clearing AER bits
>> willy-nilly might cause firmware to miss errors. Instead
>> these bits should get cleared by FFS, or via ACPI _HPX method.
>>
>> This race is mostly of theoretical significance, as it is not easy to
>> reasonably demonstrate it in testing.
>>
>> Signed-off-by: Alexandru Gagniuc <mr.nuke.me@gmail.com>
>> ---
>>   drivers/pci/pcie/aer.c | 3 +++
>>   1 file changed, 3 insertions(+)
>>
>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>> index a2e88386af28..18037a2a8231 100644
>> --- a/drivers/pci/pcie/aer.c
>> +++ b/drivers/pci/pcie/aer.c
>> @@ -383,6 +383,9 @@ int pci_cleanup_aer_error_status_regs(struct pci_dev *dev)
>>   	if (!pci_is_pcie(dev))
>>   		return -ENODEV;
>>   
>> +	if (pcie_aer_get_firmware_first(dev))
>> +		return -EIO;
> 
> I like this patch.
> 
> Do we need the same thing in the following places that also clear AER
> status bits or write AER control bits?

In theory, every exported function would guard for this. I think the 
idea a long long time ago was that the check happens during 
initialization, and the others are not hit.

>    enable_ecrc_checking()
>    disable_ecrc_checking()

I don't immediately see how this would affect FFS, but the bits are part 
of the AER capability structure. According to the FFS model, those would 
be owned by FW, and we'd have to avoid touching them.

>    pci_cleanup_aer_uncorrect_error_status()

This probably should be guarded. It's only called from a few specific 
drivers, so the impact is not as high as being called from the core.

>    pci_aer_clear_fatal_status()

This is only called when doing fatal_recovery, right?
For practical considerations this is not an issue today. The ACPI error 
handling code currently crashes when it encounters any fatal error, so 
we wouldn't hit this in the FFS case.

If the ACPI code pulls its thinking appendage out of the other end of 
the digestive tract, then we could be hitting this in the future. For 
correctness, guarding makes sense.

The PCIe standards contact I usually talk to about these PCIe subtleties 
is currently on vacation. The number one issue was a FFS corner case 
with OS clearing bits on probe. The other functions you mention are a 
corner case of a corner case. The big fish is 
pci_cleanup_aer_error_status_regs() on probe(), and it would be nice to 
have that resolved.

I'll sync up with Austin when he gets back to see about the other 
functions though I suspect we'll end up fixing them as well.

Alex

>>   	pos = dev->aer_cap;
>>   	if (!pos)
>>   		return -EIO;
>> -- 
>> 2.14.3
>>
> 



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] PCI/AER: Do not clear AER bits if we don't own AER
  2018-08-09 16:46   ` Alex_Gagniuc
@ 2018-08-09 18:29     ` Bjorn Helgaas
  2018-08-09 19:00       ` Alex G.
  0 siblings, 1 reply; 18+ messages in thread
From: Bjorn Helgaas @ 2018-08-09 18:29 UTC (permalink / raw)
  To: Alex_Gagniuc
  Cc: mr.nuke.me, bhelgaas, keith.busch, Austin.Bolen, Shyam.Iyer,
	fred, poza, linux-pci, linux-kernel

On Thu, Aug 09, 2018 at 04:46:32PM +0000, Alex_Gagniuc@Dellteam.com wrote:
> On 08/09/2018 09:16 AM, Bjorn Helgaas wrote:
> > On Tue, Jul 17, 2018 at 10:31:23AM -0500, Alexandru Gagniuc wrote:
> >> When we don't own AER, we shouldn't touch the AER error bits. This
> >> happens unconditionally on device probe(). Clearing AER bits
> >> willy-nilly might cause firmware to miss errors. Instead
> >> these bits should get cleared by FFS, or via ACPI _HPX method.
> >>
> >> This race is mostly of theoretical significance, as it is not easy to
> >> reasonably demonstrate it in testing.
> >>
> >> Signed-off-by: Alexandru Gagniuc <mr.nuke.me@gmail.com>
> >> ---
> >>   drivers/pci/pcie/aer.c | 3 +++
> >>   1 file changed, 3 insertions(+)
> >>
> >> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> >> index a2e88386af28..18037a2a8231 100644
> >> --- a/drivers/pci/pcie/aer.c
> >> +++ b/drivers/pci/pcie/aer.c
> >> @@ -383,6 +383,9 @@ int pci_cleanup_aer_error_status_regs(struct pci_dev *dev)
> >>   	if (!pci_is_pcie(dev))
> >>   		return -ENODEV;
> >>   
> >> +	if (pcie_aer_get_firmware_first(dev))
> >> +		return -EIO;
> > 
> > I like this patch.
> > 
> > Do we need the same thing in the following places that also clear AER
> > status bits or write AER control bits?
> 
> In theory, every exported function would guard for this. I think the 
> idea a long long time ago was that the check happens during 
> initialization, and the others are not hit.
> 
> >    enable_ecrc_checking()
> >    disable_ecrc_checking()
> 
> I don't immediately see how this would affect FFS, but the bits are part 
> of the AER capability structure. According to the FFS model, those would 
> be owned by FW, and we'd have to avoid touching them.

Per ACPI v6.2, sec 18.3.2.4, the HEST may contain entries for Root
Ports that contain the FIRMWARE_FIRST flag as well as values the OS is
supposed to write to several AER capability registers.  It looks like
we currently ignore everything except the FIRMWARE_FIRST and GLOBAL
flags (ACPI_HEST_FIRMWARE_FIRST and ACPI_HEST_GLOBAL in Linux).

That seems like a pretty major screwup and more than I want to fix
right now.

> >    pci_cleanup_aer_uncorrect_error_status()
> 
> This probably should be guarded. It's only called from a few specific 
> drivers, so the impact is not as high as being called from the core.
> 
> >    pci_aer_clear_fatal_status()
> 
> This is only called when doing fatal_recovery, right?

True.  It takes a lot of analysis to convince oneself that this is not
used in the firmware-first path, so I think we should add a guard
there.

> For practical considerations this is not an issue today. The ACPI error 
> handling code currently crashes when it encounters any fatal error, so 
> we wouldn't hit this in the FFS case.

I wasn't aware the firmware-first path was *that* broken.  Are there
problem reports for this?  Is this a regression?

> The PCIe standards contact I usually talk to about these PCIe subtleties 
> is currently on vacation. The number one issue was a FFS corner case 
> with OS clearing bits on probe. The other functions you mention are a 
> corner case of a corner case. The big fish is 
> pci_cleanup_aer_error_status_regs() on probe(), and it would be nice to 
> have that resolved.
> 
> I'll sync up with Austin when he gets back to see about the other 
> functions though I suspect we'll end up fixing them as well.

I'd like to fix all the obvious cases at once (excluding the ECRC
stuff).  What do you think about the following patch?


commit 15ed68dcc26864c849a12a36db4d4771bad7991f
Author: Alexandru Gagniuc <mr.nuke.me@gmail.com>
Date:   Tue Jul 17 10:31:23 2018 -0500

    PCI/AER: Don't clear AER bits if error handling is Firmware-First
    
    If the platform requests Firmware-First error handling, firmware is
    responsible for reading and clearing AER status bits.  If OSPM also clears
    them, we may miss errors.  See ACPI v6.2, sec 18.3.2.5 and 18.4.
    
    This race is mostly of theoretical significance, as it is not easy to
    reasonably demonstrate it in testing.
    
    Signed-off-by: Alexandru Gagniuc <mr.nuke.me@gmail.com>
    [bhelgaas: add similar guards to pci_cleanup_aer_uncorrect_error_status()
    and pci_aer_clear_fatal_status()]
    Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index c6cc855bfa22..4e823ae051a7 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -397,6 +397,9 @@ int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev)
 	if (!pos)
 		return -EIO;
 
+	if (pcie_aer_get_firmware_first(dev))
+		return -EIO;
+
 	/* Clear status bits for ERR_NONFATAL errors only */
 	pci_read_config_dword(dev, pos + PCI_ERR_UNCOR_STATUS, &status);
 	pci_read_config_dword(dev, pos + PCI_ERR_UNCOR_SEVER, &sev);
@@ -417,6 +420,9 @@ void pci_aer_clear_fatal_status(struct pci_dev *dev)
 	if (!pos)
 		return;
 
+	if (pcie_aer_get_firmware_first(dev))
+		return;
+
 	/* Clear status bits for ERR_FATAL errors only */
 	pci_read_config_dword(dev, pos + PCI_ERR_UNCOR_STATUS, &status);
 	pci_read_config_dword(dev, pos + PCI_ERR_UNCOR_SEVER, &sev);
@@ -438,6 +444,9 @@ int pci_cleanup_aer_error_status_regs(struct pci_dev *dev)
 	if (!pos)
 		return -EIO;
 
+	if (pcie_aer_get_firmware_first(dev))
+		return -EIO;
+
 	port_type = pci_pcie_type(dev);
 	if (port_type == PCI_EXP_TYPE_ROOT_PORT) {
 		pci_read_config_dword(dev, pos + PCI_ERR_ROOT_STATUS, &status);

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] PCI/AER: Do not clear AER bits if we don't own AER
  2018-08-09 18:29     ` Bjorn Helgaas
@ 2018-08-09 19:00       ` Alex G.
  2018-08-09 19:18         ` Bjorn Helgaas
  0 siblings, 1 reply; 18+ messages in thread
From: Alex G. @ 2018-08-09 19:00 UTC (permalink / raw)
  To: Bjorn Helgaas, Alex_Gagniuc
  Cc: bhelgaas, keith.busch, Austin.Bolen, Shyam.Iyer, fred, poza,
	linux-pci, linux-kernel



On 08/09/2018 01:29 PM, Bjorn Helgaas wrote:
> On Thu, Aug 09, 2018 at 04:46:32PM +0000, Alex_Gagniuc@Dellteam.com wrote:
>> On 08/09/2018 09:16 AM, Bjorn Helgaas wrote:
(snip_
>>>     enable_ecrc_checking()
>>>     disable_ecrc_checking()
>>
>> I don't immediately see how this would affect FFS, but the bits are part
>> of the AER capability structure. According to the FFS model, those would
>> be owned by FW, and we'd have to avoid touching them.
> 
> Per ACPI v6.2, sec 18.3.2.4, the HEST may contain entries for Root
> Ports that contain the FIRMWARE_FIRST flag as well as values the OS is
> supposed to write to several AER capability registers.  It looks like
> we currently ignore everything except the FIRMWARE_FIRST and GLOBAL
> flags (ACPI_HEST_FIRMWARE_FIRST and ACPI_HEST_GLOBAL in Linux).
> 
> That seems like a pretty major screwup and more than I want to fix
> right now.

The logic is not very clear, but I think it goes like this:
For GLOBAL and FFS, disable native AER everywhere.
When !GLOBAL and FFS, then only disable native AER for the root port 
described by the HEST entry.

aer_acpi_firmware_first() doesn't care about context. Though check 
aer_set_firmware_first(), where we pass a pci_device as a context.

The FFS implementations I've seen have one HEST entry, with GLOBAL and 
FFS. I have yet to see more fine-grained control of root ports. I 
suspect that if we had this finer control from HEST, we'd honor it.

I do eventually want to get a test BIOS with different HEST entries, and 
make sure things work as intended. Though BIOS team is overloaded with 
other requests.

>>>     pci_cleanup_aer_uncorrect_error_status()
>>
>> This probably should be guarded. It's only called from a few specific
>> drivers, so the impact is not as high as being called from the core.
>>
>>>     pci_aer_clear_fatal_status()
>>
>> This is only called when doing fatal_recovery, right?
> 
> True.  It takes a lot of analysis to convince oneself that this is not
> used in the firmware-first path, so I think we should add a guard
> there.

I agree. GHES has a header severity and a severity field for each 
section. All BIOSes I've seen do fatal/fatal, though a BIOS that would 
report correctable/fatal, would bypass the apei code and take us here.

>> For practical considerations this is not an issue today. The ACPI error
>> handling code currently crashes when it encounters any fatal error, so
>> we wouldn't hit this in the FFS case.
> 
> I wasn't aware the firmware-first path was *that* broken.  Are there
> problem reports for this?  Is this a regression?

It's been like this since, I believe, 3.10, and probably much earlier. 
All reports that I have seen of linux crashing on surprise hot-plug have 
been caused by the panic() call in the apei code. Dell BIOSes do an 
extreme amount of work to determine when it's safe to _not_ report 
errors to the OS, since all known OSes crash on this path.

Fun fact: there's even dedicated hardware to accomplish the above.

>> The PCIe standards contact I usually talk to about these PCIe subtleties
>> is currently on vacation. The number one issue was a FFS corner case
>> with OS clearing bits on probe. The other functions you mention are a
>> corner case of a corner case. The big fish is
>> pci_cleanup_aer_error_status_regs() on probe(), and it would be nice to
>> have that resolved.
>>
>> I'll sync up with Austin when he gets back to see about the other
>> functions though I suspect we'll end up fixing them as well.
> 
> I'd like to fix all the obvious cases at once (excluding the ECRC
> stuff).  What do you think about the following patch?

That looks very sensible to me. Thank you for updating it :).
I'm pulling in the changes right now to run some quick tests, and I 
don't expect any trouble.

Alex

> commit 15ed68dcc26864c849a12a36db4d4771bad7991f
> Author: Alexandru Gagniuc <mr.nuke.me@gmail.com>
> Date:   Tue Jul 17 10:31:23 2018 -0500
> 
>      PCI/AER: Don't clear AER bits if error handling is Firmware-First
>      
>      If the platform requests Firmware-First error handling, firmware is
>      responsible for reading and clearing AER status bits.  If OSPM also clears
>      them, we may miss errors.  See ACPI v6.2, sec 18.3.2.5 and 18.4.
>      
>      This race is mostly of theoretical significance, as it is not easy to
>      reasonably demonstrate it in testing.
>      
>      Signed-off-by: Alexandru Gagniuc <mr.nuke.me@gmail.com>
>      [bhelgaas: add similar guards to pci_cleanup_aer_uncorrect_error_status()
>      and pci_aer_clear_fatal_status()]
>      Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index c6cc855bfa22..4e823ae051a7 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -397,6 +397,9 @@ int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev)
>   	if (!pos)
>   		return -EIO;
>   
> +	if (pcie_aer_get_firmware_first(dev))
> +		return -EIO;
> +
>   	/* Clear status bits for ERR_NONFATAL errors only */
>   	pci_read_config_dword(dev, pos + PCI_ERR_UNCOR_STATUS, &status);
>   	pci_read_config_dword(dev, pos + PCI_ERR_UNCOR_SEVER, &sev);
> @@ -417,6 +420,9 @@ void pci_aer_clear_fatal_status(struct pci_dev *dev)
>   	if (!pos)
>   		return;
>   
> +	if (pcie_aer_get_firmware_first(dev))
> +		return;
> +
>   	/* Clear status bits for ERR_FATAL errors only */
>   	pci_read_config_dword(dev, pos + PCI_ERR_UNCOR_STATUS, &status);
>   	pci_read_config_dword(dev, pos + PCI_ERR_UNCOR_SEVER, &sev);
> @@ -438,6 +444,9 @@ int pci_cleanup_aer_error_status_regs(struct pci_dev *dev)
>   	if (!pos)
>   		return -EIO;
>   
> +	if (pcie_aer_get_firmware_first(dev))
> +		return -EIO;
> +
>   	port_type = pci_pcie_type(dev);
>   	if (port_type == PCI_EXP_TYPE_ROOT_PORT) {
>   		pci_read_config_dword(dev, pos + PCI_ERR_ROOT_STATUS, &status);
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] PCI/AER: Do not clear AER bits if we don't own AER
  2018-08-09 19:00       ` Alex G.
@ 2018-08-09 19:18         ` Bjorn Helgaas
  2018-08-09 19:42           ` Alex G.
  0 siblings, 1 reply; 18+ messages in thread
From: Bjorn Helgaas @ 2018-08-09 19:18 UTC (permalink / raw)
  To: Alex G.
  Cc: Alex_Gagniuc, bhelgaas, keith.busch, Austin.Bolen, Shyam.Iyer,
	fred, poza, linux-pci, linux-kernel

On Thu, Aug 09, 2018 at 02:00:23PM -0500, Alex G. wrote:
> On 08/09/2018 01:29 PM, Bjorn Helgaas wrote:
> > On Thu, Aug 09, 2018 at 04:46:32PM +0000, Alex_Gagniuc@Dellteam.com wrote:
> > > On 08/09/2018 09:16 AM, Bjorn Helgaas wrote:
> (snip_
> > > >     enable_ecrc_checking()
> > > >     disable_ecrc_checking()
> > > 
> > > I don't immediately see how this would affect FFS, but the bits are part
> > > of the AER capability structure. According to the FFS model, those would
> > > be owned by FW, and we'd have to avoid touching them.
> > 
> > Per ACPI v6.2, sec 18.3.2.4, the HEST may contain entries for Root
> > Ports that contain the FIRMWARE_FIRST flag as well as values the OS is
> > supposed to write to several AER capability registers.  It looks like
> > we currently ignore everything except the FIRMWARE_FIRST and GLOBAL
> > flags (ACPI_HEST_FIRMWARE_FIRST and ACPI_HEST_GLOBAL in Linux).
> > 
> > That seems like a pretty major screwup and more than I want to fix
> > right now.
> 
> The logic is not very clear, but I think it goes like this:
> For GLOBAL and FFS, disable native AER everywhere.
> When !GLOBAL and FFS, then only disable native AER for the root port
> described by the HEST entry.

I agree the code is convoluted, but that sounds right to me.

What I meant is that we ignore the values the HEST entry tells us
we're supposed to write to Device Control and the AER Uncorrectable
Error Mask, Uncorrectable Error Severity, Correctable Error Mask, and
AER Capabilities and Control.

> > > For practical considerations this is not an issue today. The ACPI error
> > > handling code currently crashes when it encounters any fatal error, so
> > > we wouldn't hit this in the FFS case.
> > 
> > I wasn't aware the firmware-first path was *that* broken.  Are there
> > problem reports for this?  Is this a regression?
> 
> It's been like this since, I believe, 3.10, and probably much earlier. All
> reports that I have seen of linux crashing on surprise hot-plug have been
> caused by the panic() call in the apei code. Dell BIOSes do an extreme
> amount of work to determine when it's safe to _not_ report errors to the OS,
> since all known OSes crash on this path.

Oh, is this the __ghes_panic() path?  If so, I'm going to turn away
and plead ignorance unless the PCI core is doing something wrong that
eventually results in that panic.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] PCI/AER: Do not clear AER bits if we don't own AER
  2018-08-09 19:18         ` Bjorn Helgaas
@ 2018-08-09 19:42           ` Alex G.
  0 siblings, 0 replies; 18+ messages in thread
From: Alex G. @ 2018-08-09 19:42 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Alex_Gagniuc, bhelgaas, keith.busch, Austin.Bolen, Shyam.Iyer,
	fred, poza, linux-pci, linux-kernel



On 08/09/2018 02:18 PM, Bjorn Helgaas wrote:
> On Thu, Aug 09, 2018 at 02:00:23PM -0500, Alex G. wrote:
>> On 08/09/2018 01:29 PM, Bjorn Helgaas wrote:
>>> On Thu, Aug 09, 2018 at 04:46:32PM +0000, Alex_Gagniuc@Dellteam.com wrote:
>>>> On 08/09/2018 09:16 AM, Bjorn Helgaas wrote:
>> (snip_
>>>>>      enable_ecrc_checking()
>>>>>      disable_ecrc_checking()
>>>>
>>>> I don't immediately see how this would affect FFS, but the bits are part
>>>> of the AER capability structure. According to the FFS model, those would
>>>> be owned by FW, and we'd have to avoid touching them.
>>>
>>> Per ACPI v6.2, sec 18.3.2.4, the HEST may contain entries for Root
>>> Ports that contain the FIRMWARE_FIRST flag as well as values the OS is
>>> supposed to write to several AER capability registers.  It looks like
>>> we currently ignore everything except the FIRMWARE_FIRST and GLOBAL
>>> flags (ACPI_HEST_FIRMWARE_FIRST and ACPI_HEST_GLOBAL in Linux).
>>>
>>> That seems like a pretty major screwup and more than I want to fix
>>> right now.
>>
>> The logic is not very clear, but I think it goes like this:
>> For GLOBAL and FFS, disable native AER everywhere.
>> When !GLOBAL and FFS, then only disable native AER for the root port
>> described by the HEST entry.
> 
> I agree the code is convoluted, but that sounds right to me.
> 
> What I meant is that we ignore the values the HEST entry tells us
> we're supposed to write to Device Control and the AER Uncorrectable
> Error Mask, Uncorrectable Error Severity, Correctable Error Mask, and
> AER Capabilities and Control.

Wait, what? _HPX has the same information. This is madness!
Since root ports are not hot-swappable, the BIOS normally programs those 
registers. Even if linux doesn't apply said masks, the programming BIOS 
did should be sufficient to have *cough* correct *cough* behavior.

>>>> For practical considerations this is not an issue today. The ACPI error
>>>> handling code currently crashes when it encounters any fatal error, so
>>>> we wouldn't hit this in the FFS case.
>>>
>>> I wasn't aware the firmware-first path was *that* broken.  Are there
>>> problem reports for this?  Is this a regression?
>>
>> It's been like this since, I believe, 3.10, and probably much earlier. All
>> reports that I have seen of linux crashing on surprise hot-plug have been
>> caused by the panic() call in the apei code. Dell BIOSes do an extreme
>> amount of work to determine when it's safe to _not_ report errors to the OS,
>> since all known OSes crash on this path.
> 
> Oh, is this the __ghes_panic() path?  If so, I'm going to turn away
> and plead ignorance unless the PCI core is doing something wrong that
> eventually results in that panic.

I agree, and I'll quote you on that!

Alex

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2018-08-09 19:42 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-17 15:31 [PATCH] PCI/AER: Do not clear AER bits if we don't own AER Alexandru Gagniuc
2018-07-17 15:41 ` Sinan Kaya
2018-07-19 15:55   ` Alex G.
2018-07-19 16:58     ` Sinan Kaya
2018-07-19 19:56       ` Alex G.
2018-07-23 16:52 ` [PATCH v2] " Alexandru Gagniuc
2018-07-24 15:59   ` Alex G.
2018-07-30 23:35     ` [PATCH v3] " Alexandru Gagniuc
2018-08-08  1:14       ` Bjorn Helgaas
2018-08-08  3:46         ` Alex G.
2018-07-24 17:08   ` [PATCH v2] " kbuild test robot
2018-07-25  1:03   ` kbuild test robot
2018-08-09 14:15 ` [PATCH] " Bjorn Helgaas
2018-08-09 16:46   ` Alex_Gagniuc
2018-08-09 18:29     ` Bjorn Helgaas
2018-08-09 19:00       ` Alex G.
2018-08-09 19:18         ` Bjorn Helgaas
2018-08-09 19:42           ` Alex G.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).