All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] PCI/ERR: Fix run error recovery callbacks for all affected devices
@ 2019-01-24 13:50 Dongdong Liu
  2019-01-24 18:18 ` Sinan Kaya
  0 siblings, 1 reply; 13+ messages in thread
From: Dongdong Liu @ 2019-01-24 13:50 UTC (permalink / raw)
  To: helgaas, keith.busch; +Cc: linux-pci, linuxarm, Dongdong Liu, Bjorn Helgaas

The patch [1] PCI/ERR: Run error recovery callbacks for all affected
devices have broken the non-fatal error handling logic in patch [2].
For non-fatal error, link is reliable, so no need to reset link,
handle non-fatal error for all subordinates seems incorrect.
Restore the non-fatal errors process logic.

[1] PCI/ERR: Run error recovery callbacks for all affected devices   #4.20
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=bfcb79fca19d267712e425af1dd48812c40dec0c

[2] PCI/AER: Report non-fatal errors only to the affected endpoint  #4.15
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v5.0-rc2&id=86acc790717fb60fb51ea3095084e331d8711c74

Fixes: bfcb79fca19d ("PCI/ERR: Run error recovery callbacks for all affected devices")
Reported-by: Xiaofei Tan <tanxiaofei@huawei.com>
Signed-off-by: Dongdong Liu <liudongdong3@huawei.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
---
 drivers/pci/pcie/err.c | 37 ++++++++++++++++++++++++++++---------
 1 file changed, 28 insertions(+), 9 deletions(-)

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 773197a..9de3880 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -187,7 +187,8 @@ void pcie_do_recovery(struct pci_dev *dev, enum pci_channel_state state,
 		      u32 service)
 {
 	pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
-	struct pci_bus *bus;
+	struct pci_bus *bus = dev->bus;
+	struct pci_dev *bridge = dev;
 
 	/*
 	 * Error recovery runs on all subordinates of the first downstream port.
@@ -195,23 +196,33 @@ void pcie_do_recovery(struct pci_dev *dev, enum pci_channel_state state,
 	 */
 	if (!(pci_pcie_type(dev) == PCI_EXP_TYPE_ROOT_PORT ||
 	      pci_pcie_type(dev) == PCI_EXP_TYPE_DOWNSTREAM))
-		dev = dev->bus->self;
-	bus = dev->subordinate;
+		bridge = bus->self;
+
+	if (bridge)
+		bus = bridge->subordinate;
 
 	pci_dbg(dev, "broadcast error_detected message\n");
 	if (state == pci_channel_io_frozen)
 		pci_walk_bus(bus, report_frozen_detected, &status);
-	else
-		pci_walk_bus(bus, report_normal_detected, &status);
+	else {
+		if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE)
+			report_normal_detected(dev, &status);
+		else
+			pci_walk_bus(bus, report_normal_detected, &status);
+	}
 
 	if (state == pci_channel_io_frozen &&
-	    reset_link(dev, service) != PCI_ERS_RESULT_RECOVERED)
+	    reset_link(bridge, service) != PCI_ERS_RESULT_RECOVERED)
 		goto failed;
 
 	if (status == PCI_ERS_RESULT_CAN_RECOVER) {
 		status = PCI_ERS_RESULT_RECOVERED;
 		pci_dbg(dev, "broadcast mmio_enabled message\n");
-		pci_walk_bus(bus, report_mmio_enabled, &status);
+		if (state == pci_channel_io_normal &&
+		    dev->hdr_type != PCI_HEADER_TYPE_BRIDGE)
+			report_mmio_enabled(dev, &status);
+		else
+			pci_walk_bus(bus, report_mmio_enabled, &status);
 	}
 
 	if (status == PCI_ERS_RESULT_NEED_RESET) {
@@ -222,14 +233,22 @@ void pcie_do_recovery(struct pci_dev *dev, enum pci_channel_state state,
 		 */
 		status = PCI_ERS_RESULT_RECOVERED;
 		pci_dbg(dev, "broadcast slot_reset message\n");
-		pci_walk_bus(bus, report_slot_reset, &status);
+		if (state == pci_channel_io_normal &&
+		    dev->hdr_type != PCI_HEADER_TYPE_BRIDGE)
+			report_slot_reset(dev, &status);
+		else
+			pci_walk_bus(bus, report_slot_reset, &status);
 	}
 
 	if (status != PCI_ERS_RESULT_RECOVERED)
 		goto failed;
 
 	pci_dbg(dev, "broadcast resume message\n");
-	pci_walk_bus(bus, report_resume, &status);
+	if (state == pci_channel_io_normal &&
+	    dev->hdr_type != PCI_HEADER_TYPE_BRIDGE)
+		report_resume(dev, &status);
+	else
+		pci_walk_bus(bus, report_resume, &status);
 
 	pci_aer_clear_device_status(dev);
 	pci_cleanup_aer_uncorrect_error_status(dev);
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH] PCI/ERR: Fix run error recovery callbacks for all affected devices
  2019-01-24 13:50 [PATCH] PCI/ERR: Fix run error recovery callbacks for all affected devices Dongdong Liu
@ 2019-01-24 18:18 ` Sinan Kaya
  2019-01-24 21:37   ` Keith Busch
  0 siblings, 1 reply; 13+ messages in thread
From: Sinan Kaya @ 2019-01-24 18:18 UTC (permalink / raw)
  To: Dongdong Liu, helgaas, keith.busch; +Cc: linux-pci, linuxarm, Bjorn Helgaas

On 1/24/2019 8:50 AM, Dongdong Liu wrote:
> The patch [1] PCI/ERR: Run error recovery callbacks for all affected
> devices have broken the non-fatal error handling logic in patch [2].
> For non-fatal error, link is reliable, so no need to reset link,
> handle non-fatal error for all subordinates seems incorrect.
> Restore the non-fatal errors process logic.
> 
> [1] PCI/ERR: Run error recovery callbacks for all affected devices   #4.20
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=bfcb79fca19d267712e425af1dd48812c40dec0c
> 
> [2] PCI/AER: Report non-fatal errors only to the affected endpoint  #4.15
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v5.0-rc2&id=86acc790717fb60fb51ea3095084e331d8711c74
> 
> Fixes: bfcb79fca19d ("PCI/ERR: Run error recovery callbacks for all affected devices")
> Reported-by: Xiaofei Tan<tanxiaofei@huawei.com>
> Signed-off-by: Dongdong Liu<liudongdong3@huawei.com>
> Cc: Keith Busch<keith.busch@intel.com>
> Cc: Bjorn Helgaas<bhelgaas@google.com>


According to what I see in the code, link will be reset only if the AER
severity is AER_FATAL.

	} else if (info->severity == AER_NONFATAL)
		pcie_do_recovery(dev, pci_channel_io_normal,
				 PCIE_PORT_SERVICE_AER);
	else if (info->severity == AER_FATAL)
		pcie_do_recovery(dev, pci_channel_io_frozen,
				 PCIE_PORT_SERVICE_AER);

Can you show the path where it leads to link reset and severity is AER_NONFATAL?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] PCI/ERR: Fix run error recovery callbacks for all affected devices
  2019-01-24 18:18 ` Sinan Kaya
@ 2019-01-24 21:37   ` Keith Busch
  2019-01-25 14:28     ` Dongdong Liu
  0 siblings, 1 reply; 13+ messages in thread
From: Keith Busch @ 2019-01-24 21:37 UTC (permalink / raw)
  To: Sinan Kaya; +Cc: Dongdong Liu, helgaas, linux-pci, linuxarm, Bjorn Helgaas

On Thu, Jan 24, 2019 at 10:18:26AM -0800, Sinan Kaya wrote:
> On 1/24/2019 8:50 AM, Dongdong Liu wrote:
> > The patch [1] PCI/ERR: Run error recovery callbacks for all affected
> > devices have broken the non-fatal error handling logic in patch [2].
> > For non-fatal error, link is reliable, so no need to reset link,
> > handle non-fatal error for all subordinates seems incorrect.
> > Restore the non-fatal errors process logic.
> > 
> > [1] PCI/ERR: Run error recovery callbacks for all affected devices   #4.20
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=bfcb79fca19d267712e425af1dd48812c40dec0c
> > 
> > [2] PCI/AER: Report non-fatal errors only to the affected endpoint  #4.15
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v5.0-rc2&id=86acc790717fb60fb51ea3095084e331d8711c74
> > 
> > Fixes: bfcb79fca19d ("PCI/ERR: Run error recovery callbacks for all affected devices")
> > Reported-by: Xiaofei Tan<tanxiaofei@huawei.com>
> > Signed-off-by: Dongdong Liu<liudongdong3@huawei.com>
> > Cc: Keith Busch<keith.busch@intel.com>
> > Cc: Bjorn Helgaas<bhelgaas@google.com>
> 
> 
> According to what I see in the code, link will be reset only if the AER
> severity is AER_FATAL.
> 
> 	} else if (info->severity == AER_NONFATAL)
> 		pcie_do_recovery(dev, pci_channel_io_normal,
> 				 PCIE_PORT_SERVICE_AER);
> 	else if (info->severity == AER_FATAL)
> 		pcie_do_recovery(dev, pci_channel_io_frozen,
> 				 PCIE_PORT_SERVICE_AER);
> 
> Can you show the path where it leads to link reset and severity is AER_NONFATAL?

Yes, I don't follow what this patch is saying either, and it actually
looks quite broken: it assigns 'bridge' to 'dev', which may not even be a
bridge, and then dereferences 'bridge->subordinate' which be NULL.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] PCI/ERR: Fix run error recovery callbacks for all affected devices
  2019-01-24 21:37   ` Keith Busch
@ 2019-01-25 14:28     ` Dongdong Liu
  2019-01-25 17:09       ` Sinan Kaya
  2019-01-25 17:17       ` Keith Busch
  0 siblings, 2 replies; 13+ messages in thread
From: Dongdong Liu @ 2019-01-25 14:28 UTC (permalink / raw)
  To: Keith Busch, Sinan Kaya
  Cc: helgaas, linux-pci, linuxarm, Bjorn Helgaas, tanxiaofei

Hi Keith, Sinan

Many thanks for your review.
在 2019/1/25 5:37, Keith Busch 写道:
> On Thu, Jan 24, 2019 at 10:18:26AM -0800, Sinan Kaya wrote:
>> On 1/24/2019 8:50 AM, Dongdong Liu wrote:
>>> The patch [1] PCI/ERR: Run error recovery callbacks for all affected
>>> devices have broken the non-fatal error handling logic in patch [2].
>>> For non-fatal error, link is reliable, so no need to reset link,
>>> handle non-fatal error for all subordinates seems incorrect.
>>> Restore the non-fatal errors process logic.
>>>
>>> [1] PCI/ERR: Run error recovery callbacks for all affected devices   #4.20
>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=bfcb79fca19d267712e425af1dd48812c40dec0c
>>>
>>> [2] PCI/AER: Report non-fatal errors only to the affected endpoint  #4.15
>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v5.0-rc2&id=86acc790717fb60fb51ea3095084e331d8711c74
>>>
>>> Fixes: bfcb79fca19d ("PCI/ERR: Run error recovery callbacks for all affected devices")
>>> Reported-by: Xiaofei Tan<tanxiaofei@huawei.com>
>>> Signed-off-by: Dongdong Liu<liudongdong3@huawei.com>
>>> Cc: Keith Busch<keith.busch@intel.com>
>>> Cc: Bjorn Helgaas<bhelgaas@google.com>
>>
>>
>> According to what I see in the code, link will be reset only if the AER
>> severity is AER_FATAL.
>>
>> 	} else if (info->severity == AER_NONFATAL)
>> 		pcie_do_recovery(dev, pci_channel_io_normal,
>> 				 PCIE_PORT_SERVICE_AER);
>> 	else if (info->severity == AER_FATAL)
>> 		pcie_do_recovery(dev, pci_channel_io_frozen,
>> 				 PCIE_PORT_SERVICE_AER);
>>
>> Can you show the path where it leads to link reset and severity is AER_NONFATAL?
>
> Yes, I don't follow what this patch is saying either, and it actually
> looks quite broken: it assigns 'bridge' to 'dev', which may not even be a
> bridge, and then dereferences 'bridge->subordinate' which be NULL.
>

I want to fix 2 points by the patch.

1. For EP devices (such as multi-function EP device) under the same bus,
when one of the EP devices met non-fatal error, should report non-fatal
error only to the error endpoint device, no need to broadcast all of them.
That is the patch (PCI/AER: Report non-fatal errors only to the affected endpoint  #4.15)
have done, but current code PATCH [1] broken this.

2. We found a NULL pointer dereference issue for 74:02.0 device
when the device met a non-fatal error (firmware-first) after 4.19.
The topology is as below.

   +-[0000:74]-+-02.0  Huawei Technologies Co., Ltd. HiSilicon SAS 3.0 HBA
   |           \-03.0  Huawei Technologies Co., Ltd. HiSilicon AHCI HBA

The callstack is as below
[ 1564.489537] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000018
[ 1564.588821] random: crng init done
[ 1564.643295] Mem abort info:
[ 1564.789341] ESR = 0x96000004
[ 1564.798182] Exception class = DABT (current EL), IL = 32 bits
[ 1564.812802] SET = 0, FnV = 0
[ 1564.821677] EA = 0, S1PTW = 0
[ 1564.830729] Data abort info:
[ 1564.839257] ISV = 0, ISS = 0x00000004
[ 1564.848744] CM = 0, WnR = 0
[ 1564.854670] [0000000000000018] user address but active_mm is swapper
[ 1564.867392] Internal error: Oops: 96000004 [#1] PREEMPT SMP
[ 1564.878540] Modules linked in:
[ 1564.886637] CPU: 0 PID: 13 Comm: kworker/0:1 Tainted: G W 5.0.0-rc1-28680-g7558ecf-dirty #753
[ 1564.906319] Hardware name: Huawei D06/D06, BIOS Hisilicon D06 UEFI RC0 - V1.10.01 12/29/2018
[ 1564.923217] Workqueue: events aer_recover_work_func
[ 1564.932973] pstate: 80c00009 (Nzcv daif +PAN +UAO)
[ 1564.942555] pc : pcie_do_recovery+0x58/0x244
[ 1564.951091] lr : aer_recover_work_func+0xd4/0x108
[ 1564.960497] sp : ffff000011e23d00
[ 1564.967117] x29: ffff000011e23d00 x28: ffff000011d9bcd8
[ 1564.977742] x27: ffff802fbb9df438 x26: ffff802fb74030cc
[ 1564.988368] x25: 0000000000000000 x24: 0000000000000074
[ 1564.998994] x23: 0000000000000002 x22: ffff000011e23d54
[ 1565.009620] x21: ffff0000114a6090 x20: ffff0000117e5000
[ 1565.020245] x19: 0000000000000000 x18: 000000000000000e
[ 1565.030870] x17: 0000000000000001 x16: 0000000000000000
[ 1565.041496] x15: 0000000000000000 x14: 3d746e6567615f72
[ 1565.052122] x13: 6561202c72657961 x12: 674238a07d44a100
[ 1565.062747] x11: ffffffffffffffff x10: 0000000000000006
[ 1565.073373] x9 : 0000000000000810 x8 : ffff0000117e5b48
[ 1565.083999] x7 : 636e755f72656120 x6 : ffff00001067a680
[ 1565.094624] x5 : 0000000000000000 x4 : 0000000000000000
[ 1565.105249] x3 : 674238a07d44a100 x2 : 0000000000000002
[ 1565.115875] x1 : 0000000000000001 x0 : ffff8a2fbb860000
[ 1565.126501] Process kworker/0:1 (pid: 13, stack limit = 0x00000000d6c8d0e9)
[ 1565.140435] Call trace:
[ 1565.145315] pcie_do_recovery+0x58/0x244
[ 1565.153154] aer_recover_work_func+0xd4/0x108
[ 1565.161866] process_one_work+0x14c/0x3ec
[ 1565.169880] worker_thread+0x130/0x3e4
[ 1565.177371] kthread+0x100/0x12c
[ 1565.183817] ret_from_fork+0x10/0x18
[ 1565.190960] Code: f9400a60 f9401c13 7100083f 910153b6 (f9400e75)
[ 1565.203154] ---[ end trace 980b44e0229f8807 ]---

Thanks,
Dongdong
> .
>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] PCI/ERR: Fix run error recovery callbacks for all affected devices
  2019-01-25 14:28     ` Dongdong Liu
@ 2019-01-25 17:09       ` Sinan Kaya
  2019-01-28 14:05         ` Dongdong Liu
  2019-01-25 17:17       ` Keith Busch
  1 sibling, 1 reply; 13+ messages in thread
From: Sinan Kaya @ 2019-01-25 17:09 UTC (permalink / raw)
  To: Dongdong Liu, Keith Busch
  Cc: helgaas, linux-pci, linuxarm, Bjorn Helgaas, tanxiaofei

On 1/25/2019 9:28 AM, Dongdong Liu wrote:
> I want to fix 2 points by the patch.
> 
> 1. For EP devices (such as multi-function EP device) under the same bus,
> when one of the EP devices met non-fatal error, should report non-fatal
> error only to the error endpoint device, no need to broadcast all of them.
> That is the patch (PCI/AER: Report non-fatal errors only to the affected endpoint  #4.15) 
> 
> have done, but current code PATCH [1] broken this.
> 
> 2. We found a NULL pointer dereference issue for 74:02.0 device
> when the device met a non-fatal error (firmware-first) after 4.19.
> The topology is as below.

Is it possible to split the patches into two to address these two different
issues?

I can understand the first one but second one seems some mystery that needs
to be explored in detail.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] PCI/ERR: Fix run error recovery callbacks for all affected devices
  2019-01-25 14:28     ` Dongdong Liu
  2019-01-25 17:09       ` Sinan Kaya
@ 2019-01-25 17:17       ` Keith Busch
  2019-01-25 17:37         ` Sinan Kaya
  1 sibling, 1 reply; 13+ messages in thread
From: Keith Busch @ 2019-01-25 17:17 UTC (permalink / raw)
  To: Dongdong Liu
  Cc: Sinan Kaya, helgaas, linux-pci, linuxarm, Bjorn Helgaas, tanxiaofei

On Fri, Jan 25, 2019 at 06:28:03AM -0800, Dongdong Liu wrote:
> I want to fix 2 points by the patch.
> 
> 1. For EP devices (such as multi-function EP device) under the same bus,
> when one of the EP devices met non-fatal error, should report non-fatal
> error only to the error endpoint device, no need to broadcast all of them.
> That is the patch (PCI/AER: Report non-fatal errors only to the affected endpoint  #4.15)
> have done, but current code PATCH [1] broken this.

How do you know a non-fatal affects only the reporting end point? These can
certainly be bus errors, and it's not the first to detect may be affected.

In any case, what harm does the broadcast cause?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] PCI/ERR: Fix run error recovery callbacks for all affected devices
  2019-01-25 17:17       ` Keith Busch
@ 2019-01-25 17:37         ` Sinan Kaya
  2019-01-25 17:46           ` Keith Busch
  2019-01-25 17:46           ` Sinan Kaya
  0 siblings, 2 replies; 13+ messages in thread
From: Sinan Kaya @ 2019-01-25 17:37 UTC (permalink / raw)
  To: Keith Busch, Dongdong Liu
  Cc: helgaas, linux-pci, linuxarm, Bjorn Helgaas, tanxiaofei

On 1/25/2019 12:17 PM, Keith Busch wrote:
> On Fri, Jan 25, 2019 at 06:28:03AM -0800, Dongdong Liu wrote:
>> I want to fix 2 points by the patch.
>>
>> 1. For EP devices (such as multi-function EP device) under the same bus,
>> when one of the EP devices met non-fatal error, should report non-fatal
>> error only to the error endpoint device, no need to broadcast all of them.
>> That is the patch (PCI/AER: Report non-fatal errors only to the affected endpoint  #4.15)
>> have done, but current code PATCH [1] broken this.
> 
> How do you know a non-fatal affects only the reporting end point? These can
> certainly be bus errors, and it's not the first to detect may be affected.
> 
> In any case, what harm does the broadcast cause?
> 

What is the PCIe spec rule about AER errors for multi-function devices?

Does it say it needs to be propagated to all functions or each function has
its own unique AER error handler?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] PCI/ERR: Fix run error recovery callbacks for all affected devices
  2019-01-25 17:37         ` Sinan Kaya
@ 2019-01-25 17:46           ` Keith Busch
  2019-01-25 17:46           ` Sinan Kaya
  1 sibling, 0 replies; 13+ messages in thread
From: Keith Busch @ 2019-01-25 17:46 UTC (permalink / raw)
  To: Sinan Kaya
  Cc: Dongdong Liu, helgaas, linux-pci, linuxarm, Bjorn Helgaas, tanxiaofei

On Fri, Jan 25, 2019 at 12:37:14PM -0500, Sinan Kaya wrote:
> On 1/25/2019 12:17 PM, Keith Busch wrote:
> > On Fri, Jan 25, 2019 at 06:28:03AM -0800, Dongdong Liu wrote:
> > > I want to fix 2 points by the patch.
> > > 
> > > 1. For EP devices (such as multi-function EP device) under the same bus,
> > > when one of the EP devices met non-fatal error, should report non-fatal
> > > error only to the error endpoint device, no need to broadcast all of them.
> > > That is the patch (PCI/AER: Report non-fatal errors only to the affected endpoint  #4.15)
> > > have done, but current code PATCH [1] broken this.
> > 
> > How do you know a non-fatal affects only the reporting end point? These can
> > certainly be bus errors, and it's not the first to detect may be affected.
> > 
> > In any case, what harm does the broadcast cause?
> > 
> 
> What is the PCIe spec rule about AER errors for multi-function devices?

6.2.4 lists the errors that are not function specific (it's nearly 
all them).
 
> Does it say it needs to be propagated to all functions or each function has
> its own unique AER error handler?

The spec goes on to say only one function should send the error
message, but "Software is responsible for scanning all Functions in a
Multi-Function Device when it detects one of those errors."

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] PCI/ERR: Fix run error recovery callbacks for all affected devices
  2019-01-25 17:37         ` Sinan Kaya
  2019-01-25 17:46           ` Keith Busch
@ 2019-01-25 17:46           ` Sinan Kaya
  2019-01-28 14:54             ` Dongdong Liu
  1 sibling, 1 reply; 13+ messages in thread
From: Sinan Kaya @ 2019-01-25 17:46 UTC (permalink / raw)
  To: Keith Busch, Dongdong Liu
  Cc: helgaas, linux-pci, linuxarm, Bjorn Helgaas, tanxiaofei

On 1/25/2019 12:37 PM, Sinan Kaya wrote:
> On 1/25/2019 12:17 PM, Keith Busch wrote:
>> On Fri, Jan 25, 2019 at 06:28:03AM -0800, Dongdong Liu wrote:
>>> I want to fix 2 points by the patch.
>>>
>>> 1. For EP devices (such as multi-function EP device) under the same bus,
>>> when one of the EP devices met non-fatal error, should report non-fatal
>>> error only to the error endpoint device, no need to broadcast all of them.
>>> That is the patch (PCI/AER: Report non-fatal errors only to the affected 
>>> endpoint  #4.15)
>>> have done, but current code PATCH [1] broken this.
>>
>> How do you know a non-fatal affects only the reporting end point? These can
>> certainly be bus errors, and it's not the first to detect may be affected.
>>
>> In any case, what harm does the broadcast cause?
>>
> 
> What is the PCIe spec rule about AER errors for multi-function devices?
> 
> Does it say it needs to be propagated to all functions or each function has
> its own unique AER error handler?
> 

Thinking more...

I think there is value in probing all devices for errors like today because
multiple errors bit can be set. Since root port's AER register only captures the
first error, the rest of the errors requires OS to poll each device to see what
is going on.

In this case the AER error status of other functions should not report any
outstanding event. Please verify this. Otherwise, you are looking at a device quirk.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] PCI/ERR: Fix run error recovery callbacks for all affected devices
  2019-01-25 17:09       ` Sinan Kaya
@ 2019-01-28 14:05         ` Dongdong Liu
  0 siblings, 0 replies; 13+ messages in thread
From: Dongdong Liu @ 2019-01-28 14:05 UTC (permalink / raw)
  To: Sinan Kaya, Keith Busch
  Cc: helgaas, linux-pci, linuxarm, Bjorn Helgaas, tanxiaofei

Hi Sinan
在 2019/1/26 1:09, Sinan Kaya 写道:
> On 1/25/2019 9:28 AM, Dongdong Liu wrote:
>> I want to fix 2 points by the patch.
>>
>> 1. For EP devices (such as multi-function EP device) under the same bus,
>> when one of the EP devices met non-fatal error, should report non-fatal
>> error only to the error endpoint device, no need to broadcast all of them.
>> That is the patch (PCI/AER: Report non-fatal errors only to the affected endpoint  #4.15)
>> have done, but current code PATCH [1] broken this.
>>
>> 2. We found a NULL pointer dereference issue for 74:02.0 device
>> when the device met a non-fatal error (firmware-first) after 4.19.
>> The topology is as below.
>
> Is it possible to split the patches into two to address these two different
> issues?

Good suggestion, will do.

>
> I can understand the first one but second one seems some mystery that needs
> to be explored in detail.
>

  +-[0000:74]-+-02.0  Huawei Technologies Co., Ltd. HiSilicon SAS 3.0 HBA
  |           \-03.0  Huawei Technologies Co., Ltd. HiSilicon AHCI HBA

74:02.0 is a RCiEP, but do not under a root port.
Current code have the issue as the below code, see [DD].
aer_recover_work_func()
     pcie_do_recovery() {

	/*
	 * Error recovery runs on all subordinates of the first downstream port.
	 * If the downstream port detected the error, it is cleared at the end.
	 */
	if (!(pci_pcie_type(dev) == PCI_EXP_TYPE_ROOT_PORT ||
	      pci_pcie_type(dev) == PCI_EXP_TYPE_DOWNSTREAM))
		dev = dev->bus->self;   //[DD]: Here dev will be NULL pointer for 74:02.0(RCiEP)as it does not have a root port.
	bus = dev->subordinate;
  }

We do not have the  NULL pointer dereference issue
before the patch [1] PCI/ERR: Run error recovery callbacks for all affected devices   #4.20 .

Thanks,
Dongdong
> .
>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] PCI/ERR: Fix run error recovery callbacks for all affected devices
  2019-01-25 17:46           ` Sinan Kaya
@ 2019-01-28 14:54             ` Dongdong Liu
  2019-01-28 15:47               ` Sinan Kaya
  0 siblings, 1 reply; 13+ messages in thread
From: Dongdong Liu @ 2019-01-28 14:54 UTC (permalink / raw)
  To: Sinan Kaya, Keith Busch
  Cc: helgaas, linux-pci, linuxarm, Bjorn Helgaas, tanxiaofei



在 2019/1/26 1:46, Sinan Kaya 写道:
> On 1/25/2019 12:37 PM, Sinan Kaya wrote:
>> On 1/25/2019 12:17 PM, Keith Busch wrote:
>>> On Fri, Jan 25, 2019 at 06:28:03AM -0800, Dongdong Liu wrote:
>>>> I want to fix 2 points by the patch.
>>>>
>>>> 1. For EP devices (such as multi-function EP device) under the same bus,
>>>> when one of the EP devices met non-fatal error, should report non-fatal
>>>> error only to the error endpoint device, no need to broadcast all of them.
>>>> That is the patch (PCI/AER: Report non-fatal errors only to the affected endpoint  #4.15)
>>>> have done, but current code PATCH [1] broken this.
>>>
>>> How do you know a non-fatal affects only the reporting end point? These can
>>> certainly be bus errors, and it's not the first to detect may be affected.
>>>
>>> In any case, what harm does the broadcast cause?
>>>
>>
>> What is the PCIe spec rule about AER errors for multi-function devices?
>>
>> Does it say it needs to be propagated to all functions or each function has
>> its own unique AER error handler?
>>
>
> Thinking more...
>
> I think there is value in probing all devices for errors like today because
> multiple errors bit can be set. Since root port's AER register only captures the
> first error, the rest of the errors requires OS to poll each device to see what
> is going on.
>
> In this case the AER error status of other functions should not report any
> outstanding event. Please verify this. Otherwise, you are looking at a device quirk.

Agree, multiple errors bit can be set, AER driver or firmware will collect the error
devices and call pcie_do_recovery() for every error devices.

also have different PFs (device numbers are different) under the same bus.
This case do not need to brodcast all the devices under the same bus.

Thanks
Dongdong
>
>
> .
>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] PCI/ERR: Fix run error recovery callbacks for all affected devices
  2019-01-28 14:54             ` Dongdong Liu
@ 2019-01-28 15:47               ` Sinan Kaya
  2019-01-28 16:15                 ` Sinan Kaya
  0 siblings, 1 reply; 13+ messages in thread
From: Sinan Kaya @ 2019-01-28 15:47 UTC (permalink / raw)
  To: Dongdong Liu, Keith Busch
  Cc: helgaas, linux-pci, linuxarm, Bjorn Helgaas, tanxiaofei

On 1/28/2019 9:54 AM, Dongdong Liu wrote:
>> In this case the AER error status of other functions should not report any
>> outstanding event. Please verify this. Otherwise, you are looking at a device quirk. 
>>
> 
> Agree, multiple errors bit can be set, AER driver or firmware will collect the error 
> 
> devices and call pcie_do_recovery() for every error devices.

pcie_do_recovery() is also being called for non-firmware first scenario where OS
has no prior knowledge of which devices are in fault.

> 
> also have different PFs (device numbers are different) under the same bus.
> This case do not need to brodcast all the devices under the same bus.

Even if OS was to call pcie_do_recovery() for other devices, nothing should
happen because the expectation for other devices' AER status register to report
no errors. This is the case we want you to validate. If this is not true, you
are looking at a firmware/HW bug.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] PCI/ERR: Fix run error recovery callbacks for all affected devices
  2019-01-28 15:47               ` Sinan Kaya
@ 2019-01-28 16:15                 ` Sinan Kaya
  0 siblings, 0 replies; 13+ messages in thread
From: Sinan Kaya @ 2019-01-28 16:15 UTC (permalink / raw)
  To: Dongdong Liu, Keith Busch
  Cc: helgaas, linux-pci, linuxarm, Bjorn Helgaas, tanxiaofei

On 1/28/2019 10:47 AM, Sinan Kaya wrote:
>> also have different PFs (device numbers are different) under the same bus.
>> This case do not need to brodcast all the devices under the same bus.
> 
> Even if OS was to call pcie_do_recovery() for other devices, nothing should
> happen because the expectation for other devices' AER status register to report
> no errors. This is the case we want you to validate. If this is not true, you
> are looking at a firmware/HW bug.

Slight clarification about the statement above. Based on quick code scan,
pcie_do_recovery() should only be called if there is an outstanding AER error
for the first AER device in non-FF (non-firmware-first) scenario.

Whereas it gets called per AER device on FF scenario and this is where the
conflict is.

I think your argument is that "if there is a non-fatal error on the parent,
should it be broadcasted to all children devices in pcie_do_recovery()?"

IMO, I think the answer is yes only if the children's AER status reports a
problem and I don't see this check in the code.

Code needs to be refactored to follow a similar pattern to FF scenario and
pci_walk_bus() calls should be removed from the NON-FATAL path.


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2019-01-28 17:45 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-24 13:50 [PATCH] PCI/ERR: Fix run error recovery callbacks for all affected devices Dongdong Liu
2019-01-24 18:18 ` Sinan Kaya
2019-01-24 21:37   ` Keith Busch
2019-01-25 14:28     ` Dongdong Liu
2019-01-25 17:09       ` Sinan Kaya
2019-01-28 14:05         ` Dongdong Liu
2019-01-25 17:17       ` Keith Busch
2019-01-25 17:37         ` Sinan Kaya
2019-01-25 17:46           ` Keith Busch
2019-01-25 17:46           ` Sinan Kaya
2019-01-28 14:54             ` Dongdong Liu
2019-01-28 15:47               ` Sinan Kaya
2019-01-28 16:15                 ` Sinan Kaya

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.