linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH for-next 0/2] Allow drivers to configure AER registers
@ 2019-04-10 12:34 Dennis Dalessandro
  2019-04-10 12:34 ` [PATCH for-next 1/2] PCI/AER: Helper function for configuring " Dennis Dalessandro
  2019-04-10 12:35 ` [PATCH for-next 2/2] IB/hfi1: Make Unsupported Request error non-fatal Dennis Dalessandro
  0 siblings, 2 replies; 11+ messages in thread
From: Dennis Dalessandro @ 2019-04-10 12:34 UTC (permalink / raw)
  To: bhelgaas
  Cc: Mike Marciniszyn, Andriy Shevchenko, jgg, linux-rdma, linux-pci,
	Michael J. Ruhl, dledford, Kamenee Arumugam

Hi Folks,

Here is a set of patches from Kamenee that adds an API for
configuring AER registers. At least one driver needs to be able to change the
default settings and a patch (2/2) is included here. Sending both to linux-pci
since there really isn't a reason for the first to exist without a user.
However normally changes to the hfi1 driver would flow through linux-rdma.

This series is based on the PCI branch 'next' but I expect would apply the same
to rdma-next as well if we wanted to go that route.

---

Kamenee Arumugam (2):
      PCI/AER: Helper function for configuring AER registers
      IB/hfi1: Make Unsupported Request error non-fatal


 drivers/infiniband/hw/hfi1/pcie.c |    1 +
 drivers/pci/pcie/aer.c            |   33 +++++++++++++++++++++++++++++++++
 include/linux/aer.h               |   17 +++++++++++++++++
 3 files changed, 51 insertions(+), 0 deletions(-)

--
-Denny

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH for-next 1/2] PCI/AER: Helper function for configuring AER registers
  2019-04-10 12:34 [PATCH for-next 0/2] Allow drivers to configure AER registers Dennis Dalessandro
@ 2019-04-10 12:34 ` Dennis Dalessandro
  2019-04-10 13:46   ` Andriy Shevchenko
  2019-04-10 12:35 ` [PATCH for-next 2/2] IB/hfi1: Make Unsupported Request error non-fatal Dennis Dalessandro
  1 sibling, 1 reply; 11+ messages in thread
From: Dennis Dalessandro @ 2019-04-10 12:34 UTC (permalink / raw)
  To: bhelgaas
  Cc: Mike Marciniszyn, Andriy Shevchenko, jgg, linux-rdma, linux-pci,
	Michael J. Ruhl, dledford, Kamenee Arumugam

From: Kamenee Arumugam <kamenee.arumugam@intel.com>

In some use cases, drivers may need to change the default
AER settings. Introduce a helper function for setting
and clearing the AER configuration registers.

Access to the AER registers is not serialized. If multiple
access is required, correct locking must be done.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Cc: Andriy Shevchenko <andriy.shevchenko@intel.com>
Signed-off-by: Kamenee Arumugam <kamenee.arumugam@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
---
 drivers/pci/pcie/aer.c |   33 +++++++++++++++++++++++++++++++++
 include/linux/aer.h    |   17 +++++++++++++++++
 2 files changed, 50 insertions(+), 0 deletions(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index f8fc211..b0435f9 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -353,6 +353,39 @@ int pci_enable_pcie_error_reporting(struct pci_dev *dev)
 }
 EXPORT_SYMBOL_GPL(pci_enable_pcie_error_reporting);
 
+/**
+ * pcie_aer_clear_and_set_dword - Set or clear AER registers
+ * @dev: pci dev data
+ * @pos: The offset of AER registers
+ * @clear: The bits to clear
+ * @set: The bits to set
+ *
+ * This function must only be used by the driver owning the device.
+ * Return:
+ * * 0 - on success
+ * * Negative error code - on generic failures
+ * * Positive error code - on PCI access errors
+ */
+int pcie_aer_clear_and_set_dword(struct pci_dev *dev, int pos,
+				 u32 clear, u32 set)
+{
+	u32 data;
+	int ret;
+
+	if (!dev->aer_cap)
+		return -EIO;
+
+	ret = pci_read_config_dword(dev, dev->aer_cap + pos, &data);
+	if (!ret) {
+		data &= ~clear;
+		data |= set;
+		return pci_write_config_dword(dev, dev->aer_cap + pos, data);
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(pcie_aer_clear_and_set_dword);
+
 int pci_disable_pcie_error_reporting(struct pci_dev *dev)
 {
 	if (pcie_aer_get_firmware_first(dev))
diff --git a/include/linux/aer.h b/include/linux/aer.h
index 514bffa..e21d65c 100644
--- a/include/linux/aer.h
+++ b/include/linux/aer.h
@@ -46,6 +46,8 @@ struct aer_capability_regs {
 int pci_disable_pcie_error_reporting(struct pci_dev *dev);
 int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev);
 int pci_cleanup_aer_error_status_regs(struct pci_dev *dev);
+int pcie_aer_clear_and_set_dword(struct pci_dev *dev, int pos,
+				 u32 clear, u32 set);
 #else
 static inline int pci_enable_pcie_error_reporting(struct pci_dev *dev)
 {
@@ -63,6 +65,12 @@ static inline int pci_cleanup_aer_error_status_regs(struct pci_dev *dev)
 {
 	return -EINVAL;
 }
+
+static inline int pcie_aer_clear_and_set_dword(struct pci_dev *dev, int pos,
+					       u32 clear, u32 set)
+{
+	return -EINVAL;
+}
 #endif
 
 void cper_print_aer(struct pci_dev *dev, int aer_severity,
@@ -70,5 +78,14 @@ void cper_print_aer(struct pci_dev *dev, int aer_severity,
 int cper_severity_to_aer(int cper_severity);
 void aer_recover_queue(int domain, unsigned int bus, unsigned int devfn,
 		       int severity, struct aer_capability_regs *aer_regs);
+static inline int pcie_aer_set_dword(struct pci_dev *dev, int pos, u32 set)
+{
+	return pcie_aer_clear_and_set_dword(dev, pos, 0, set);
+}
+
+static inline int pcie_aer_clear_dword(struct pci_dev *dev, int pos, u32 clear)
+{
+	return pcie_aer_clear_and_set_dword(dev, pos, clear, 0);
+}
 #endif //_AER_H_
 


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH for-next 2/2] IB/hfi1: Make Unsupported Request error non-fatal
  2019-04-10 12:34 [PATCH for-next 0/2] Allow drivers to configure AER registers Dennis Dalessandro
  2019-04-10 12:34 ` [PATCH for-next 1/2] PCI/AER: Helper function for configuring " Dennis Dalessandro
@ 2019-04-10 12:35 ` Dennis Dalessandro
  2019-04-10 19:29   ` Bjorn Helgaas
  1 sibling, 1 reply; 11+ messages in thread
From: Dennis Dalessandro @ 2019-04-10 12:35 UTC (permalink / raw)
  To: bhelgaas
  Cc: jgg, linux-rdma, linux-pci, Michael J. Ruhl, dledford, Kamenee Arumugam

From: Kamenee Arumugam <kamenee.arumugam@intel.com>

For hfi1, the unsupported request error is not considered a fatal
error. When the PCIe advanced error reporting capability (AER) is
configured to report unsupported requests as fatal, the system will
hang on this error.

Set Unsupported Request Error bit in Uncorrectable Error Mask
register to disable error reporting to the PCIe root complex.

Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: Kamenee Arumugam <kamenee.arumugam@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
---
 drivers/infiniband/hw/hfi1/pcie.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/hfi1/pcie.c b/drivers/infiniband/hw/hfi1/pcie.c
index c96d193..a033e28 100644
--- a/drivers/infiniband/hw/hfi1/pcie.c
+++ b/drivers/infiniband/hw/hfi1/pcie.c
@@ -114,6 +114,7 @@ int hfi1_pcie_init(struct hfi1_devdata *dd)
 	}
 
 	pci_set_master(pdev);
+	pcie_aer_set_dword(pdev, PCI_ERR_UNCOR_MASK, PCI_ERR_UNC_UNSUP);
 	(void)pci_enable_pcie_error_reporting(pdev);
 	return 0;
 


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH for-next 1/2] PCI/AER: Helper function for configuring AER registers
  2019-04-10 12:34 ` [PATCH for-next 1/2] PCI/AER: Helper function for configuring " Dennis Dalessandro
@ 2019-04-10 13:46   ` Andriy Shevchenko
  0 siblings, 0 replies; 11+ messages in thread
From: Andriy Shevchenko @ 2019-04-10 13:46 UTC (permalink / raw)
  To: Dennis Dalessandro
  Cc: bhelgaas, Mike Marciniszyn, jgg, linux-rdma, linux-pci,
	Michael J. Ruhl, dledford, Kamenee Arumugam

On Wed, Apr 10, 2019 at 05:34:50AM -0700, Dennis Dalessandro wrote:
> From: Kamenee Arumugam <kamenee.arumugam@intel.com>
> 
> In some use cases, drivers may need to change the default
> AER settings. Introduce a helper function for setting
> and clearing the AER configuration registers.
> 
> Access to the AER registers is not serialized. If multiple
> access is required, correct locking must be done.
> 

FWIW,
Reviewed-by: Andriy Shevchenko <andriy.shevchenko@intel.com>

Thanks.

> Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
> Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
> Cc: Andriy Shevchenko <andriy.shevchenko@intel.com>
> Signed-off-by: Kamenee Arumugam <kamenee.arumugam@intel.com>
> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
> ---
>  drivers/pci/pcie/aer.c |   33 +++++++++++++++++++++++++++++++++
>  include/linux/aer.h    |   17 +++++++++++++++++
>  2 files changed, 50 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index f8fc211..b0435f9 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -353,6 +353,39 @@ int pci_enable_pcie_error_reporting(struct pci_dev *dev)
>  }
>  EXPORT_SYMBOL_GPL(pci_enable_pcie_error_reporting);
>  
> +/**
> + * pcie_aer_clear_and_set_dword - Set or clear AER registers
> + * @dev: pci dev data
> + * @pos: The offset of AER registers
> + * @clear: The bits to clear
> + * @set: The bits to set
> + *
> + * This function must only be used by the driver owning the device.
> + * Return:
> + * * 0 - on success
> + * * Negative error code - on generic failures
> + * * Positive error code - on PCI access errors
> + */
> +int pcie_aer_clear_and_set_dword(struct pci_dev *dev, int pos,
> +				 u32 clear, u32 set)
> +{
> +	u32 data;
> +	int ret;
> +
> +	if (!dev->aer_cap)
> +		return -EIO;
> +
> +	ret = pci_read_config_dword(dev, dev->aer_cap + pos, &data);
> +	if (!ret) {
> +		data &= ~clear;
> +		data |= set;
> +		return pci_write_config_dword(dev, dev->aer_cap + pos, data);
> +	}
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(pcie_aer_clear_and_set_dword);
> +
>  int pci_disable_pcie_error_reporting(struct pci_dev *dev)
>  {
>  	if (pcie_aer_get_firmware_first(dev))
> diff --git a/include/linux/aer.h b/include/linux/aer.h
> index 514bffa..e21d65c 100644
> --- a/include/linux/aer.h
> +++ b/include/linux/aer.h
> @@ -46,6 +46,8 @@ struct aer_capability_regs {
>  int pci_disable_pcie_error_reporting(struct pci_dev *dev);
>  int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev);
>  int pci_cleanup_aer_error_status_regs(struct pci_dev *dev);
> +int pcie_aer_clear_and_set_dword(struct pci_dev *dev, int pos,
> +				 u32 clear, u32 set);
>  #else
>  static inline int pci_enable_pcie_error_reporting(struct pci_dev *dev)
>  {
> @@ -63,6 +65,12 @@ static inline int pci_cleanup_aer_error_status_regs(struct pci_dev *dev)
>  {
>  	return -EINVAL;
>  }
> +
> +static inline int pcie_aer_clear_and_set_dword(struct pci_dev *dev, int pos,
> +					       u32 clear, u32 set)
> +{
> +	return -EINVAL;
> +}
>  #endif
>  
>  void cper_print_aer(struct pci_dev *dev, int aer_severity,
> @@ -70,5 +78,14 @@ void cper_print_aer(struct pci_dev *dev, int aer_severity,
>  int cper_severity_to_aer(int cper_severity);
>  void aer_recover_queue(int domain, unsigned int bus, unsigned int devfn,
>  		       int severity, struct aer_capability_regs *aer_regs);
> +static inline int pcie_aer_set_dword(struct pci_dev *dev, int pos, u32 set)
> +{
> +	return pcie_aer_clear_and_set_dword(dev, pos, 0, set);
> +}
> +
> +static inline int pcie_aer_clear_dword(struct pci_dev *dev, int pos, u32 clear)
> +{
> +	return pcie_aer_clear_and_set_dword(dev, pos, clear, 0);
> +}
>  #endif //_AER_H_
>  
> 

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH for-next 2/2] IB/hfi1: Make Unsupported Request error non-fatal
  2019-04-10 12:35 ` [PATCH for-next 2/2] IB/hfi1: Make Unsupported Request error non-fatal Dennis Dalessandro
@ 2019-04-10 19:29   ` Bjorn Helgaas
       [not found]     ` <14063C7AD467DE4B82DEDB5C278E8663BE6A1B14@FMSMSX108.amr.corp.intel.com>
  0 siblings, 1 reply; 11+ messages in thread
From: Bjorn Helgaas @ 2019-04-10 19:29 UTC (permalink / raw)
  To: Dennis Dalessandro
  Cc: jgg, linux-rdma, linux-pci, Michael J. Ruhl, dledford, Kamenee Arumugam

Hi Dennis,

On Wed, Apr 10, 2019 at 05:35:01AM -0700, Dennis Dalessandro wrote:
> From: Kamenee Arumugam <kamenee.arumugam@intel.com>
> 
> For hfi1, the unsupported request error is not considered a fatal
> error. When the PCIe advanced error reporting capability (AER) is
> configured to report unsupported requests as fatal, the system will
> hang on this error.

I know there are a few drivers that fiddle with AER bits, but that
makes me a little bit nervous because error handling is more than just
a driver issue.  It involves the PCI core and the platform firmware as
well.

Anyway, let's figure out more about this particular case.  Unsupported
Request is a PCIe protocol-level issue.  You're masking it in the HFI
adapter, which I guess means you want to prevent it from reporting UR.
So the HFI is receiving a TLP that it doesn't support?

What exactly is causing the UR?  Is it something the driver could
potentially avoid, e.g., an AtomicOp that HFI doesn't support?  I have
a vague notion that InfiniBand allows some sort of direct user-space
access to hardware; is there something there that can cause a UR?

The system hang sounds like a separate problem that should also be
fixed.  Even if HFI signals a UR error, I would not expect a system
hang.

Bjorn

> Set Unsupported Request Error bit in Uncorrectable Error Mask
> register to disable error reporting to the PCIe root complex.
> 
> Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
> Signed-off-by: Kamenee Arumugam <kamenee.arumugam@intel.com>
> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
> ---
>  drivers/infiniband/hw/hfi1/pcie.c |    1 +
>  1 files changed, 1 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/infiniband/hw/hfi1/pcie.c b/drivers/infiniband/hw/hfi1/pcie.c
> index c96d193..a033e28 100644
> --- a/drivers/infiniband/hw/hfi1/pcie.c
> +++ b/drivers/infiniband/hw/hfi1/pcie.c
> @@ -114,6 +114,7 @@ int hfi1_pcie_init(struct hfi1_devdata *dd)
>  	}
>  
>  	pci_set_master(pdev);
> +	pcie_aer_set_dword(pdev, PCI_ERR_UNCOR_MASK, PCI_ERR_UNC_UNSUP);
>  	(void)pci_enable_pcie_error_reporting(pdev);
>  	return 0;
>  
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [PATCH for-next 2/2] IB/hfi1: Make Unsupported Request error non-fatal
       [not found]     ` <14063C7AD467DE4B82DEDB5C278E8663BE6A1B14@FMSMSX108.amr.corp.intel.com>
@ 2019-04-11 18:22       ` Arumugam, Kamenee
  2019-04-11 18:29         ` Jason Gunthorpe
  0 siblings, 1 reply; 11+ messages in thread
From: Arumugam, Kamenee @ 2019-04-11 18:22 UTC (permalink / raw)
  To: Dalessandro, Dennis, bhelgaas
  Cc: jgg, linux-rdma, linux-pci, Ruhl, Michael J, dledford, Arumugam, Kamenee


-----Original Message-----
From: linux-rdma-owner@vger.kernel.org [mailto:linux-rdma-owner@vger.kernel.org] On Behalf Of Bjorn Helgaas
Sent: Wednesday, April 10, 2019 3:30 PM
To: Dalessandro, Dennis <dennis.dalessandro@intel.com>
Cc: jgg@ziepe.ca; linux-rdma@vger.kernel.org; linux-pci@vger.kernel.org; Ruhl, Michael J <michael.j.ruhl@intel.com>; dledford@redhat.com; Arumugam, Kamenee <kamenee.arumugam@intel.com>
Subject: Re: [PATCH for-next 2/2] IB/hfi1: Make Unsupported Request error non-fatal

Hi Bjorn,

> I know there are a few drivers that fiddle with AER bits, but that makes me a little bit nervous because error handling is more than just a > driver issue.  It involves the PCI core and the platform firmware as well.

> Anyway, let's figure out more about this particular case.  Unsupported 
> Request is a PCIe protocol-level issue.  You're masking it in the HFI adapter, which I guess means you want to prevent it from reporting UR.
> So the HFI is receiving a TLP that it doesn't support?

Yes, HFI is receiving a TLP with unsupported request error.

> What exactly is causing the UR?  Is it something the driver could potentially avoid, e.g., an AtomicOp that HFI doesn't support?  I have a > vague notion that InfiniBand allows some sort of direct user-space access to hardware; is there something there that can cause a UR?

HFI PCIe BAR are mapped to user space to implement kernel bypass for MPI/PSM jobs. In this case, user-level application is making spurious read accesses (invalid width access) to this memory mapping causing the device to report an unsupported request error through AER. The spurious read accesses may be due to errant application behavior (e.g. reading beyond the end of an array).

> The system hang sounds like a separate problem that should also be fixed.  Even if HFI signals a UR error, I would not expect a system > > hang.
 
We haven't root cause the system hang but it doesn't appear to be related to our driver.


>> Set Unsupported Request Error bit in Uncorrectable Error Mask register 
>> to disable error reporting to the PCIe root complex.
>> 
>> Reviewed-by: Michael J. Ruhl <michael.j.ruhl@intel.com>
>> Signed-off-by: Kamenee Arumugam <kamenee.arumugam@intel.com>
>> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
>> ---
>>  drivers/infiniband/hw/hfi1/pcie.c |    1 +
>>  1 files changed, 1 insertions(+), 0 deletions(-)
>> 
>> diff --git a/drivers/infiniband/hw/hfi1/pcie.c
>> b/drivers/infiniband/hw/hfi1/pcie.c
>> index c96d193..a033e28 100644
>> --- a/drivers/infiniband/hw/hfi1/pcie.c
>> +++ b/drivers/infiniband/hw/hfi1/pcie.c
>> @@ -114,6 +114,7 @@ int hfi1_pcie_init(struct hfi1_devdata *dd)
>>  	}
>>  
>>  	pci_set_master(pdev);
>> +	pcie_aer_set_dword(pdev, PCI_ERR_UNCOR_MASK, PCI_ERR_UNC_UNSUP);
>>  	(void)pci_enable_pcie_error_reporting(pdev);
>>  	return 0;
>>  
>> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH for-next 2/2] IB/hfi1: Make Unsupported Request error non-fatal
  2019-04-11 18:22       ` Arumugam, Kamenee
@ 2019-04-11 18:29         ` Jason Gunthorpe
  2019-04-11 20:37           ` Arumugam, Kamenee
  0 siblings, 1 reply; 11+ messages in thread
From: Jason Gunthorpe @ 2019-04-11 18:29 UTC (permalink / raw)
  To: Arumugam, Kamenee
  Cc: Dalessandro, Dennis, bhelgaas, linux-rdma, linux-pci, Ruhl,
	Michael J, dledford

On Thu, Apr 11, 2019 at 06:22:45PM +0000, Arumugam, Kamenee wrote:

> > What exactly is causing the UR?  Is it something the driver could
> > potentially avoid, e.g., an AtomicOp that HFI doesn't support?  I
> > have a > vague notion that InfiniBand allows some sort of direct
> > user-space access to hardware; is there something there that can
> > cause a UR?
> 
> HFI PCIe BAR are mapped to user space to implement kernel bypass for
> MPI/PSM jobs. In this case, user-level application is making
> spurious read accesses (invalid width access) to this memory mapping
> causing the device to report an unsupported request error through
> AER. The spurious read accesses may be due to errant application
> behavior (e.g. reading beyond the end of an array).

This is a device bug then. 

A RDMA device must accept and respond to all TLPs that the CPU could
create for the user accessible BAR pages.

A user process must not be able to crash the CPU or make the device
malfunction by accessing the exposed BAR page. This includes a broad
range of topics, like mis-aligned acceses, SSE instructions, atomics,
etc.

Is blocking AER even enough here? If the device isn't generating a
reasonable reply I have a bad feeling worse will happen.

Jason

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [PATCH for-next 2/2] IB/hfi1: Make Unsupported Request error non-fatal
  2019-04-11 18:29         ` Jason Gunthorpe
@ 2019-04-11 20:37           ` Arumugam, Kamenee
  2019-04-12 13:55             ` Jason Gunthorpe
  0 siblings, 1 reply; 11+ messages in thread
From: Arumugam, Kamenee @ 2019-04-11 20:37 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Dalessandro, Dennis, bhelgaas, linux-rdma, linux-pci, Ruhl,
	Michael J, dledford, Arumugam, Kamenee


-----Original Message-----
From: Jason Gunthorpe [mailto:jgg@ziepe.ca] 
Sent: Thursday, April 11, 2019 2:30 PM
To: Arumugam, Kamenee <kamenee.arumugam@intel.com>
Cc: Dalessandro, Dennis <dennis.dalessandro@intel.com>; bhelgaas@google.com; linux-rdma@vger.kernel.org; linux-pci@vger.kernel.org; Ruhl, Michael J <michael.j.ruhl@intel.com>; dledford@redhat.com
Subject: Re: [PATCH for-next 2/2] IB/hfi1: Make Unsupported Request error non-fatal

On Thu, Apr 11, 2019 at 06:22:45PM +0000, Arumugam, Kamenee wrote:

> This is a device bug then. 

> A RDMA device must accept and respond to all TLPs that the CPU could create for the user accessible BAR pages.

> A user process must not be able to crash the CPU or make the device malfunction by accessing the exposed BAR page. This includes a broad range of topics, like mis-aligned acceses, SSE instructions, atomics, >etc.

> Is blocking AER even enough here? If the device isn't generating a reasonable reply I have a bad feeling worse will happen.

After blocking unsupported request error, we don't see any other issue including no system hang. 



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH for-next 2/2] IB/hfi1: Make Unsupported Request error non-fatal
  2019-04-11 20:37           ` Arumugam, Kamenee
@ 2019-04-12 13:55             ` Jason Gunthorpe
  2019-04-15 18:47               ` Dennis Dalessandro
  0 siblings, 1 reply; 11+ messages in thread
From: Jason Gunthorpe @ 2019-04-12 13:55 UTC (permalink / raw)
  To: Arumugam, Kamenee
  Cc: Dalessandro, Dennis, bhelgaas, linux-rdma, linux-pci, Ruhl,
	Michael J, dledford

On Thu, Apr 11, 2019 at 08:37:53PM +0000, Arumugam, Kamenee wrote:

> On Thu, Apr 11, 2019 at 06:22:45PM +0000, Arumugam, Kamenee wrote:
> 
> > This is a device bug then. 
> 
> > A RDMA device must accept and respond to all TLPs that the CPU could create for the user accessible BAR pages.
> 
> > A user process must not be able to crash the CPU or make the device malfunction by accessing the exposed BAR page. This includes a broad range of topics, like mis-aligned acceses, SSE instructions, atomics, >etc.
> 
> > Is blocking AER even enough here? If the device isn't generating a reasonable reply I have a bad feeling worse will happen.
> 
> After blocking unsupported request error, we don't see any other issue including no system hang. 

Are you specifically testing all the special TLPs the CPU can produce?

Jason

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH for-next 2/2] IB/hfi1: Make Unsupported Request error non-fatal
  2019-04-12 13:55             ` Jason Gunthorpe
@ 2019-04-15 18:47               ` Dennis Dalessandro
  2019-04-15 21:46                 ` Bjorn Helgaas
  0 siblings, 1 reply; 11+ messages in thread
From: Dennis Dalessandro @ 2019-04-15 18:47 UTC (permalink / raw)
  To: Jason Gunthorpe, Arumugam, Kamenee
  Cc: bhelgaas, linux-rdma, linux-pci, Ruhl, Michael J, dledford

On 4/12/2019 9:55 AM, Jason Gunthorpe wrote:
> On Thu, Apr 11, 2019 at 08:37:53PM +0000, Arumugam, Kamenee wrote:
> 
>> On Thu, Apr 11, 2019 at 06:22:45PM +0000, Arumugam, Kamenee wrote:
>>
>>> This is a device bug then.
>>
>>> A RDMA device must accept and respond to all TLPs that the CPU could create for the user accessible BAR pages.
>>
>>> A user process must not be able to crash the CPU or make the device malfunction by accessing the exposed BAR page. This includes a broad range of topics, like mis-aligned acceses, SSE instructions, atomics, >etc.
>>
>>> Is blocking AER even enough here? If the device isn't generating a reasonable reply I have a bad feeling worse will happen.
>>
>> After blocking unsupported request error, we don't see any other issue including no system hang.
> 
> Are you specifically testing all the special TLPs the CPU can produce?

All the special TLPs should have been tested. This however seems to be a 
missed test case. Not that surprising though given differences in BIOS 
and things of that nature that something falls through the cracks and is 
extra hard to find.

-Denny


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH for-next 2/2] IB/hfi1: Make Unsupported Request error non-fatal
  2019-04-15 18:47               ` Dennis Dalessandro
@ 2019-04-15 21:46                 ` Bjorn Helgaas
  0 siblings, 0 replies; 11+ messages in thread
From: Bjorn Helgaas @ 2019-04-15 21:46 UTC (permalink / raw)
  To: Dennis Dalessandro
  Cc: Jason Gunthorpe, Arumugam, Kamenee, linux-rdma, linux-pci, Ruhl,
	Michael J, dledford

On Mon, Apr 15, 2019 at 02:47:01PM -0400, Dennis Dalessandro wrote:
> On 4/12/2019 9:55 AM, Jason Gunthorpe wrote:
> > On Thu, Apr 11, 2019 at 08:37:53PM +0000, Arumugam, Kamenee wrote:
> > > On Thu, Apr 11, 2019 at 06:22:45PM +0000, Arumugam, Kamenee wrote:
> > > 
> > > > This is a device bug then.
> > > 
> > > > A RDMA device must accept and respond to all TLPs that the CPU
> > > > could create for the user accessible BAR pages.
> > > 
> > > > A user process must not be able to crash the CPU or make the
> > > > device malfunction by accessing the exposed BAR page. This
> > > > includes a broad range of topics, like mis-aligned acceses,
> > > > SSE instructions, atomics, >etc.
> > > 
> > > > Is blocking AER even enough here? If the device isn't
> > > > generating a reasonable reply I have a bad feeling worse will
> > > > happen.
> > > 
> > > After blocking unsupported request error, we don't see any other
> > > issue including no system hang.
> > 
> > Are you specifically testing all the special TLPs the CPU can
> > produce?
> 
> All the special TLPs should have been tested. This however seems to
> be a missed test case. Not that surprising though given differences
> in BIOS and things of that nature that something falls through the
> cracks and is extra hard to find.

Is there a published erratum for this?  I don't have warm fuzzies yet
that we actually know the root cause here.

Kamenee said the problem case was:

  user-level application is making spurious read accesses (invalid
  width access) to this memory mapping causing the device to report an
  unsupported request error through AER.

So I guess that means the application performed a read and got invalid
data back?  I think the Root Complex had to supply *some* data to
complete the CPU's read, and since the HFI responded with UR instead
of data, the RC probably fabricated something.  Many RCs fabricate ~0,
but I don't think that's actually required by the spec, so I'm
doubtful that the application can reliably detect this.

I'd be really surprised that something as obvious as an invalid width
wasn't tested, especially if this is intended for direct mapping into
user applications.

Bjorn

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2019-04-15 21:46 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-10 12:34 [PATCH for-next 0/2] Allow drivers to configure AER registers Dennis Dalessandro
2019-04-10 12:34 ` [PATCH for-next 1/2] PCI/AER: Helper function for configuring " Dennis Dalessandro
2019-04-10 13:46   ` Andriy Shevchenko
2019-04-10 12:35 ` [PATCH for-next 2/2] IB/hfi1: Make Unsupported Request error non-fatal Dennis Dalessandro
2019-04-10 19:29   ` Bjorn Helgaas
     [not found]     ` <14063C7AD467DE4B82DEDB5C278E8663BE6A1B14@FMSMSX108.amr.corp.intel.com>
2019-04-11 18:22       ` Arumugam, Kamenee
2019-04-11 18:29         ` Jason Gunthorpe
2019-04-11 20:37           ` Arumugam, Kamenee
2019-04-12 13:55             ` Jason Gunthorpe
2019-04-15 18:47               ` Dennis Dalessandro
2019-04-15 21:46                 ` Bjorn Helgaas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).