[PATCH v6 0/1] PCI: Mask and unmask hotplug interrupts during reset

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v6 0/1] PCI: Mask and unmask hotplug interrupts during reset
@ 2018-07-28  9:13 Sinan Kaya
  2018-07-28  9:13 ` [PATCH v6 1/1] PCI: pciehp: Ignore link events when there is a fatal error pending Sinan Kaya
  0 siblings, 1 reply; 8+ messages in thread
From: Sinan Kaya @ 2018-07-28  9:13 UTC (permalink / raw)
  To: linux-pci; +Cc: Sinan Kaya

If a bridge supports hotplug and observes a PCIe fatal error, the following
events happen:

1. AER driver removes the devices from PCI tree on fatal error
2. AER driver brings down the link by issuing a secondary bus reset waits
for the link to come up.
3. Hotplug driver observes a link down interrupt
4. Hotplug driver tries to remove the devices waiting for the rescan lock
but devices are already removed by the AER driver and AER driver is waiting
for the link to come back up.
5. AER driver tries to re-enumerate devices after polling for the link
state to go up.
6. Hotplug driver obtains the lock and tries to remove the devices again.

Ignore link events caused by fatal error in hotplug driver. Note that
surprise link down fatal error is a hotplug event.

Sinan Kaya (1):
  PCI: pciehp: Ignore link events when there is a fatal error pending

 drivers/pci/hotplug/pciehp_hpc.c | 13 ++++++++----
 drivers/pci/pci.h                |  1 +
 drivers/pci/pcie/err.c           | 35 ++++++++++++++++++++++++++++++++
 3 files changed, 45 insertions(+), 4 deletions(-)

-- 
2.17.1

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v6 1/1] PCI: pciehp: Ignore link events when there is a fatal error pending
  2018-07-28  9:13 [PATCH v6 0/1] PCI: Mask and unmask hotplug interrupts during reset Sinan Kaya
@ 2018-07-28  9:13 ` Sinan Kaya
  2018-07-29 11:57   ` Lukas Wunner
  0 siblings, 1 reply; 8+ messages in thread
From: Sinan Kaya @ 2018-07-28  9:13 UTC (permalink / raw)
  To: linux-pci
  Cc: Sinan Kaya, Bjorn Helgaas, Mika Westerberg, Keith Busch,
	Oza Pawandeep, Markus Elfring, Lukas Wunner, Kees Cook

We need to figure out how to gracefully return inside hotplug driver
if link down happened and there is an error pending. Fatal error needs
to be serviced by AER/DPC drivers. Hotplug driver is observing link
events while AER/DPC drivers are performing link recovery today and
are causing confusion for the hotplug statemachine.

1. check if there is a fatal error pending in the device_status register
of the PCI Express capability on the root port.
2. bail out from hotplug routine if this is the case.
3. otherwise, existing behavior.

If fatal error is pending and a fatal error service such as DPC or AER
is running, it is the responsibility of the fatal error service to
recover the link.

Signed-off-by: Sinan Kaya <okaya@kernel.org>
---
 drivers/pci/hotplug/pciehp_hpc.c | 13 ++++++++----
 drivers/pci/pci.h                |  1 +
 drivers/pci/pcie/err.c           | 35 ++++++++++++++++++++++++++++++++
 3 files changed, 45 insertions(+), 4 deletions(-)

diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
index 718b6073afad..776566ab7583 100644
--- a/drivers/pci/hotplug/pciehp_hpc.c
+++ b/drivers/pci/hotplug/pciehp_hpc.c
@@ -612,10 +612,15 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id)
 	 * and cause the wrong event to queue.
 	 */
 	if (events & PCI_EXP_SLTSTA_DLLSC) {
-		ctrl_info(ctrl, "Slot(%s): Link %s\n", slot_name(slot),
-			  link ? "Up" : "Down");
-		pciehp_queue_interrupt_event(slot, link ? INT_LINK_UP :
-					     INT_LINK_DOWN);
+		if (pci_fatal_error_pending(pdev, PCI_ERR_UNC_SURPDN))
+			ctrl_info(ctrl, "Slot(%s): Ignoring Link %s event due to detected Fatal Error\n",
+				  slot_name(slot), link ? "Up" : "Down");
+		else {
+			ctrl_info(ctrl, "Slot(%s): Link %s\n", slot_name(slot),
+				  link ? "Up" : "Down");
+			pciehp_queue_interrupt_event(slot, link ? INT_LINK_UP :
+						     INT_LINK_DOWN);
+		}
 	} else if (events & PCI_EXP_SLTSTA_PDC) {
 		present = !!(status & PCI_EXP_SLTSTA_PDS);
 		ctrl_info(ctrl, "Slot(%s): Card %spresent\n", slot_name(slot),
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 882f1f9596df..7494b2c0c5ff 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -360,6 +360,7 @@ void pci_enable_acs(struct pci_dev *dev);
 /* PCI error reporting and recovery */
 void pcie_do_fatal_recovery(struct pci_dev *dev, u32 service);
 void pcie_do_nonfatal_recovery(struct pci_dev *dev);
+bool pci_fatal_error_pending(struct pci_dev *pdev, u32 usr_mask);
 
 bool pcie_wait_for_link(struct pci_dev *pdev, bool active);
 #ifdef CONFIG_PCIEASPM
diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index f7ce0cb0b0b7..316b2d2750b9 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -386,3 +386,38 @@ void pcie_do_nonfatal_recovery(struct pci_dev *dev)
 	/* TODO: Should kernel panic here? */
 	pci_info(dev, "AER: Device recovery failed\n");
 }
+
+bool pci_fatal_error_pending(struct pci_dev *pdev, u32 usr_mask)
+{
+	u16 err_status = 0;
+	u32 status, mask;
+	int rc;
+
+	if (!pci_is_pcie(pdev))
+		return false;
+
+	rc = pcie_capability_read_word(pdev, PCI_EXP_DEVSTA, &err_status);
+	if (rc)
+		return false;
+
+	if (!(err_status & PCI_EXP_DEVSTA_FED))
+		return false;
+
+	if (!pdev->aer_cap)
+		return false;
+
+	rc = pci_read_config_dword(pdev, pdev->aer_cap + PCI_ERR_UNCOR_STATUS,
+				   &status);
+	if (rc)
+		return false;
+
+	rc = pci_read_config_dword(pdev, pdev->aer_cap + PCI_ERR_UNCOR_MASK,
+				   &mask);
+	if (rc)
+		return false;
+
+	status &= mask;
+	status &= ~usr_mask;
+
+	return !!status;
+}
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v6 1/1] PCI: pciehp: Ignore link events when there is a fatal error pending
  2018-07-28  9:13 ` [PATCH v6 1/1] PCI: pciehp: Ignore link events when there is a fatal error pending Sinan Kaya
@ 2018-07-29 11:57   ` Lukas Wunner
  2018-07-29 16:44     ` Sinan Kaya
  0 siblings, 1 reply; 8+ messages in thread
From: Lukas Wunner @ 2018-07-29 11:57 UTC (permalink / raw)
  To: Sinan Kaya
  Cc: linux-pci, Bjorn Helgaas, Mika Westerberg, Keith Busch, Oza Pawandeep

On Sat, Jul 28, 2018 at 02:13:24AM -0700, Sinan Kaya wrote:
> We need to figure out how to gracefully return inside hotplug driver
> if link down happened and there is an error pending. Fatal error needs
> to be serviced by AER/DPC drivers. Hotplug driver is observing link
> events while AER/DPC drivers are performing link recovery today and
> are causing confusion for the hotplug statemachine.
> 
> 1. check if there is a fatal error pending in the device_status register
> of the PCI Express capability on the root port.
> 2. bail out from hotplug routine if this is the case.
> 3. otherwise, existing behavior.
> 
> If fatal error is pending and a fatal error service such as DPC or AER
> is running, it is the responsibility of the fatal error service to
> recover the link.
[snip]
> --- a/drivers/pci/hotplug/pciehp_hpc.c
> +++ b/drivers/pci/hotplug/pciehp_hpc.c
> @@ -612,10 +612,15 @@ static irqreturn_t pciehp_isr(int irq, void *dev_id)
>  	 * and cause the wrong event to queue.
>  	 */
>  	if (events & PCI_EXP_SLTSTA_DLLSC) {
> -		ctrl_info(ctrl, "Slot(%s): Link %s\n", slot_name(slot),
> -			  link ? "Up" : "Down");
> -		pciehp_queue_interrupt_event(slot, link ? INT_LINK_UP :
> -					     INT_LINK_DOWN);
> +		if (pci_fatal_error_pending(pdev, PCI_ERR_UNC_SURPDN))
> +			ctrl_info(ctrl, "Slot(%s): Ignoring Link %s event due to detected Fatal Error\n",
> +				  slot_name(slot), link ? "Up" : "Down");
> +		else {
> +			ctrl_info(ctrl, "Slot(%s): Link %s\n", slot_name(slot),
> +				  link ? "Up" : "Down");
> +			pciehp_queue_interrupt_event(slot, link ? INT_LINK_UP :
> +						     INT_LINK_DOWN);
> +		}

The above is still based on the event handling in v4.18, so the patch
doesn't apply to what's currently queued on Bjorn's pci/hotplug branch.

That said, a problem I see with the approach you're suggesting is that
recovery from the fatal error may fail in pcie_do_fatal_recovery().
The devices below the hotplug port will then have been removed but
internally in pciehp the slot will remain in ON_STATE and slot power
will remain enabled.  That feels wrong.

After re-reading all the e-mails we've exchanged since early July,
the approach Oza suggested in this e-mail seems the most sensible to me:
https://lkml.kernel.org/r/324f8cf2fe6f7bdc43ca8a646eea908d@codeaurora.org

He suggested skipping removal of devices in pcie_do_fatal_recovery()
for hotplug ports.

The rationale is that when a PCIe port raises a fatal error, that port
will normally not have the ability to remove its children from the
hierarchy unless it's a hotplug port.  Thus, if the port is a hotplug
port, pcie_do_fatal_recovery() should let pciehp handle removal of the
devices.  Only if it's not a hotplug port should pcie_do_fatal_recovery()
remove the devices.  My understanding is that after the error has been
cleared, the link should automatically come back up, is that correct?
pciehp will then bring up the slot on its own.

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v6 1/1] PCI: pciehp: Ignore link events when there is a fatal error pending
  2018-07-29 11:57   ` Lukas Wunner
@ 2018-07-29 16:44     ` Sinan Kaya
  2018-07-29 17:39       ` Lukas Wunner
  0 siblings, 1 reply; 8+ messages in thread
From: Sinan Kaya @ 2018-07-29 16:44 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: linux-pci, Bjorn Helgaas, Mika Westerberg, Keith Busch, Oza Pawandeep

On 7/29/18, Lukas Wunner <lukas@wunner.de> wrote:
> On Sat, Jul 28, 2018 at 02:13:24AM -0700, Sinan Kaya wrote:
>> We need to figure out how to gracefully return inside hotplug driver
>> if link down happened and there is an error pending. Fatal error needs
>> to be serviced by AER/DPC drivers. Hotplug driver is observing link
>> events while AER/DPC drivers are performing link recovery today and
>> are causing confusion for the hotplug statemachine.
>>
>> 1. check if there is a fatal error pending in the device_status register
>> of the PCI Express capability on the root port.
>> 2. bail out from hotplug routine if this is the case.
>> 3. otherwise, existing behavior.
>>
>> If fatal error is pending and a fatal error service such as DPC or AER
>> is running, it is the responsibility of the fatal error service to
>> recover the link.
> [snip]
>> --- a/drivers/pci/hotplug/pciehp_hpc.c
>> +++ b/drivers/pci/hotplug/pciehp_hpc.c
>> @@ -612,10 +612,15 @@ static irqreturn_t pciehp_isr(int irq, void
>> *dev_id)
>>  	 * and cause the wrong event to queue.
>>  	 */
>>  	if (events & PCI_EXP_SLTSTA_DLLSC) {
>> -		ctrl_info(ctrl, "Slot(%s): Link %s\n", slot_name(slot),
>> -			  link ? "Up" : "Down");
>> -		pciehp_queue_interrupt_event(slot, link ? INT_LINK_UP :
>> -					     INT_LINK_DOWN);
>> +		if (pci_fatal_error_pending(pdev, PCI_ERR_UNC_SURPDN))
>> +			ctrl_info(ctrl, "Slot(%s): Ignoring Link %s event due to detected
>> Fatal Error\n",
>> +				  slot_name(slot), link ? "Up" : "Down");
>> +		else {
>> +			ctrl_info(ctrl, "Slot(%s): Link %s\n", slot_name(slot),
>> +				  link ? "Up" : "Down");
>> +			pciehp_queue_interrupt_event(slot, link ? INT_LINK_UP :
>> +						     INT_LINK_DOWN);
>> +		}
>
> The above is still based on the event handling in v4.18, so the patch
> doesn't apply to what's currently queued on Bjorn's pci/hotplug branch.
>
> That said, a problem I see with the approach you're suggesting is that
> recovery from the fatal error may fail in pcie_do_fatal_recovery().
> The devices below the hotplug port will then have been removed but
> internally in pciehp the slot will remain in ON_STATE and slot power
> will remain enabled.  That feels wrong.
>
> After re-reading all the e-mails we've exchanged since early July,
> the approach Oza suggested in this e-mail seems the most sensible to me:
> https://lkml.kernel.org/r/324f8cf2fe6f7bdc43ca8a646eea908d@codeaurora.org
>
> He suggested skipping removal of devices in pcie_do_fatal_recovery()
> for hotplug ports.
>
> The rationale is that when a PCIe port raises a fatal error, that port
> will normally not have the ability to remove its children from the
> hierarchy unless it's a hotplug port.  Thus, if the port is a hotplug
> port, pcie_do_fatal_recovery() should let pciehp handle removal of the
> devices.  Only if it's not a hotplug port should pcie_do_fatal_recovery()
> remove the devices.  My understanding is that after the error has been
> cleared, the link should automatically come back up, is that correct?
> pciehp will then bring up the slot on its own.
>

Sending to the list. If anybody knows how to send text email from
gmail via mobile app, I am interested to hear from you. Webmail
sucks...

I understand your slot power concerns. I don't have an answer at this moment.

Here is the issue with mixing pciehp link handler and fatal error
recovery paths.

1. fatal error happens
2. Pciehp ISR executes and turns off slot power
3. Fatal error recovery executes either secondary bus reset or dpc
status trigger via reset_link() function to recover the link.
4. Link doesn't recover because power is off.


> Thanks,
>
> Lukas
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v6 1/1] PCI: pciehp: Ignore link events when there is a fatal error pending
  2018-07-29 16:44     ` Sinan Kaya
@ 2018-07-29 17:39       ` Lukas Wunner
  2018-07-29 18:30         ` Sinan Kaya
  0 siblings, 1 reply; 8+ messages in thread
From: Lukas Wunner @ 2018-07-29 17:39 UTC (permalink / raw)
  To: Sinan Kaya
  Cc: linux-pci, Bjorn Helgaas, Mika Westerberg, Keith Busch, Oza Pawandeep

On Sun, Jul 29, 2018 at 09:44:50AM -0700, Sinan Kaya wrote:
> On 7/29/18, Lukas Wunner <lukas@wunner.de> wrote:
> > After re-reading all the e-mails we've exchanged since early July,
> > the approach Oza suggested in this e-mail seems the most sensible to me:
> > https://lkml.kernel.org/r/324f8cf2fe6f7bdc43ca8a646eea908d@codeaurora.org
> >
> > He suggested skipping removal of devices in pcie_do_fatal_recovery()
> > for hotplug ports.
> >
> > The rationale is that when a PCIe port raises a fatal error, that port
> > will normally not have the ability to remove its children from the
> > hierarchy unless it's a hotplug port.  Thus, if the port is a hotplug
> > port, pcie_do_fatal_recovery() should let pciehp handle removal of the
> > devices.  Only if it's not a hotplug port should pcie_do_fatal_recovery()
> > remove the devices.  My understanding is that after the error has been
> > cleared, the link should automatically come back up, is that correct?
> > pciehp will then bring up the slot on its own.
> >
> 
> I understand your slot power concerns. I don't have an answer at this moment.
> 
> Here is the issue with mixing pciehp link handler and fatal error
> recovery paths.
> 
> 1. fatal error happens
> 2. Pciehp ISR executes and turns off slot power
> 3. Fatal error recovery executes either secondary bus reset or dpc
>    status trigger via reset_link() function to recover the link.
> 4. Link doesn't recover because power is off.

With the new code on the pci/hotplug branch, the expected behavior is:

4. Having brought the slot down after the DLLSC event,
   pciehp_handle_presence_or_link_change() notices that the slot is
   occupied ("Slot(...): Card present") and immediately tries to
   bring the slot up again:
   Slot power is re-enabled, green LED is set to blinking.
   pciehp_check_link_status() then waits 1 second for the link to
   go up.  If the link does not come up within 1 second, an error
   message is logged but the procedure does not abort yet.  We then
   wait for 100 ms and poll for 1 second whether the vendor ID of
   slot 0 function 0 becomes readable.  Only if the link has not come
   up at this point is the procedure aborted.

So the question is:

a. Is DPC/AER able to recover from the error if slot power is disabled?
   Or do we need to synchronize with pciehp to keep slot power enabled
   so that the error can be handled?

b. How fast is DPC/AER able to recover from a fatal error?
   If it's more than 2.1 seconds, pciehp will not bring the slot up
   automatically and the user is required to bring it up manually via
   sysfs.  If we know that DPC/AER need a specific amount of time that
   is longer than 2.1 seconds, we can amend board_added() to poll until
   recovery from the fatal error (or use a waitqueue which is woken on
   recovery), with a sufficient timeout.

Does this work for you?

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v6 1/1] PCI: pciehp: Ignore link events when there is a fatal error pending
  2018-07-29 17:39       ` Lukas Wunner
@ 2018-07-29 18:30         ` Sinan Kaya
  2018-07-29 19:07           ` Lukas Wunner
  0 siblings, 1 reply; 8+ messages in thread
From: Sinan Kaya @ 2018-07-29 18:30 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: linux-pci, Bjorn Helgaas, Mika Westerberg, Keith Busch, Oza Pawandeep

On 7/29/2018 10:39 AM, Lukas Wunner wrote:
>> ere is the issue with mixing pciehp link handler and fatal error
>> recovery paths.
>>

Let me put all use cases to the table:

use case A

1. DPC fatal error happens. Link goes down in HW by the DPC.
2. Pciehp ISR executes first and turns off slot power
3. DPC Fatal error recovery executes dpc
     status trigger via reset_link() function to recover the link.
4. Link doesn't recover because power is off.

use case B
1. AER fatal error happens. Link is up.
2. AER ISR executes first.
3. AER Fatal error recovery executes secondary bus reset aka warm
reset via AER error recovery in reset_link() to recovery the link
4. Link goes down.
5. Pciehp ISR executes and turns off slot power
6. AER link recovery fails because power is off.

use case C
1. AER fatal error happens. Link goes down after observing a fatal
error before observing ISR.
2. Pciehp ISR executes and turns off slot power
3. AER ISR executes.
4. AER Fatal error recovery executes secondary bus reset aka warm
reset via AER error recovery in reset_link() to recovery the link
5. AER link recovery fails because power is off.

 > With the new code on the pci/hotplug branch, the expected behavior is:
 >
> 4. Having brought the slot down after the DLLSC event,
>     pciehp_handle_presence_or_link_change() notices that the slot is
>     occupied ("Slot(...): Card present") and immediately tries to
>     bring the slot up again:
>     Slot power is re-enabled, green LED is set to blinking.
>     pciehp_check_link_status() then waits 1 second for the link to
>     go up.  If the link does not come up within 1 second, an error
>     message is logged but the procedure does not abort yet.  We then
>     wait for 100 ms and poll for 1 second whether the vendor ID of
>     slot 0 function 0 becomes readable.  Only if the link has not come
>     up at this point is the procedure aborted.

Thanks for the heads up. I didn't realize HP was trying to recover the
link as well.

This is the situation that I'm trying to avoid. AER and DPC resets are 
known as warm-resets.

HP link recovery is known as cold-reset via power-off and power-on command.

In the middle of a warm-reset operation(AER/DPC), we are:
1. turning off the slow power
2. performing a cold reset
causing Fatal Error recovery to fail.

> 
> So the question is:
> 
> a. Is DPC/AER able to recover from the error if slot power is disabled?
>     Or do we need to synchronize with pciehp to keep slot power enabled
>     so that the error can be handled?

Yes, slot power needs to be kept on.

> 
> b. How fast is DPC/AER able to recover from a fatal error?
>     If it's more than 2.1 seconds, pciehp will not bring the slot up
>     automatically and the user is required to bring it up manually via
>     sysfs.  If we know that DPC/AER need a specific amount of time that
>     is longer than 2.1 seconds, we can amend board_added() to poll until
>     recovery from the fatal error (or use a waitqueue which is woken on
>     recovery), with a sufficient timeout.

pciehp shouldn't attempt recovery.

If link goes down due to a DPC event, it should be recovered by DPC 
status trigger. Injecting a cold reset in the middle can cause a HW
lockup as it is an undefined behavior.

Similarly, If link goes down due to an AER secondary bus reset issue, it 
should be recovered by HW. Injecting a cold reset in the middle of a
secondary bus reset can cause a HW lockup as it is an undefined behavior.

> .
> Does this work for you?

Not much, but along the lines of your thinking, I can propose this
as an alternative.

Maybe, this helps:

1. HP ISR observes link down interrupt.
2. HP ISR checks that there is a fatal error pending, it doesn't touch
the link.
3. HP ISR waits until link recovery happens.
4. HP ISR calls the read vendor id function.

DPC link recovery is very quick (100ms at most). Secondary bus reset
recovery should be contained within 1 seconds for most cases but
spec allows a device to extend vendor id read as much as it wants via
CRS response. We poll up to an additional 60 seconds in read vendor
id function.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v6 1/1] PCI: pciehp: Ignore link events when there is a fatal error pending
  2018-07-29 18:30         ` Sinan Kaya
@ 2018-07-29 19:07           ` Lukas Wunner
  2018-07-29 19:21             ` Sinan Kaya
  0 siblings, 1 reply; 8+ messages in thread
From: Lukas Wunner @ 2018-07-29 19:07 UTC (permalink / raw)
  To: Sinan Kaya
  Cc: linux-pci, Bjorn Helgaas, Mika Westerberg, Keith Busch, Oza Pawandeep

On Sun, Jul 29, 2018 at 11:30:09AM -0700, Sinan Kaya wrote:
> Yes, slot power needs to be kept on.
> 
> pciehp shouldn't attempt recovery.
> 
> If link goes down due to a DPC event, it should be recovered by DPC status
> trigger. Injecting a cold reset in the middle can cause a HW
> lockup as it is an undefined behavior.
> 
> Similarly, If link goes down due to an AER secondary bus reset issue, it
> should be recovered by HW. Injecting a cold reset in the middle of a
> secondary bus reset can cause a HW lockup as it is an undefined behavior.

Thanks a lot for the explanation, understood now.

> Maybe, this helps:
> 
> 1. HP ISR observes link down interrupt.
> 2. HP ISR checks that there is a fatal error pending, it doesn't touch
> the link.
> 3. HP ISR waits until link recovery happens.
> 4. HP ISR calls the read vendor id function.
> 
> DPC link recovery is very quick (100ms at most). Secondary bus reset
> recovery should be contained within 1 seconds for most cases but
> spec allows a device to extend vendor id read as much as it wants via
> CRS response. We poll up to an additional 60 seconds in read vendor
> id function.

Yes, that proposal makes a lot of sense to me.  This should also work
regardless whether pciehp or DPC/AER react first to the Link Down.
Could you rebase your patch on the current pci/hotplug branch
and insert the procedure you've outlined above at the top of
pciehp_handle_presence_or_link_change() in pciehp_ctrl.c,
or put it in a helper that's called at the top of that function.

Your patch "[PATCH v6 1/1] PCI: pciehp: Ignore link events when there
is a fatal error pending" only checks once for a pending fatal error,
it should poll until either the fatal error is gone or a timeout is
hit.  If the fatal error is gone and the link is up, you can just return
from pciehp_handle_presence_or_link_change().  Else (in the timeout case)
fall back to the normal handling of a Link Down, i.e. let it bring down
the slot.

Please add a code comment in pciehp_handle_presence_or_link_change()
along the lines of

	/* If a fatal error is pending, wait for AER or DPC to handle it. */

The information in your e-mail that a cold reset would incorrectly
interfere with error recovery is a crucial piece of information that
should be included at least in the commit message.  (I was unaware
of that.)

If you have any further questions on pciehp, ask away.

Thanks!

Lukas

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v6 1/1] PCI: pciehp: Ignore link events when there is a fatal error pending
  2018-07-29 19:07           ` Lukas Wunner
@ 2018-07-29 19:21             ` Sinan Kaya
  0 siblings, 0 replies; 8+ messages in thread
From: Sinan Kaya @ 2018-07-29 19:21 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: linux-pci, Bjorn Helgaas, Mika Westerberg, Keith Busch, Oza Pawandeep

On 7/29/2018 12:07 PM, Lukas Wunner wrote:
> On Sun, Jul 29, 2018 at 11:30:09AM -0700, Sinan Kaya wrote:
>> Yes, slot power needs to be kept on.
>>
>> pciehp shouldn't attempt recovery.
>>
>> If link goes down due to a DPC event, it should be recovered by DPC status
>> trigger. Injecting a cold reset in the middle can cause a HW
>> lockup as it is an undefined behavior.
>>
>> Similarly, If link goes down due to an AER secondary bus reset issue, it
>> should be recovered by HW. Injecting a cold reset in the middle of a
>> secondary bus reset can cause a HW lockup as it is an undefined behavior.
> 
> Thanks a lot for the explanation, understood now.
> 
> 
>> Maybe, this helps:
>>
>> 1. HP ISR observes link down interrupt.
>> 2. HP ISR checks that there is a fatal error pending, it doesn't touch
>> the link.
>> 3. HP ISR waits until link recovery happens.
>> 4. HP ISR calls the read vendor id function.
>>
>> DPC link recovery is very quick (100ms at most). Secondary bus reset
>> recovery should be contained within 1 seconds for most cases but
>> spec allows a device to extend vendor id read as much as it wants via
>> CRS response. We poll up to an additional 60 seconds in read vendor
>> id function.
> 
> Yes, that proposal makes a lot of sense to me.  This should also work
> regardless whether pciehp or DPC/AER react first to the Link Down.
> Could you rebase your patch on the current pci/hotplug branch
> and insert the procedure you've outlined above at the top of
> pciehp_handle_presence_or_link_change() in pciehp_ctrl.c,
> or put it in a helper that's called at the top of that function.
> 
> Your patch "[PATCH v6 1/1] PCI: pciehp: Ignore link events when there
> is a fatal error pending" only checks once for a pending fatal error,
> it should poll until either the fatal error is gone or a timeout is
> hit.  If the fatal error is gone and the link is up, you can just return
> from pciehp_handle_presence_or_link_change().  Else (in the timeout case)
> fall back to the normal handling of a Link Down, i.e. let it bring down
> the slot.
> 
> Please add a code comment in pciehp_handle_presence_or_link_change()
> along the lines of
> 
> 	/* If a fatal error is pending, wait for AER or DPC to handle it. */
> 
> The information in your e-mail that a cold reset would incorrectly
> interfere with error recovery is a crucial piece of information that
> should be included at least in the commit message.  (I was unaware
> of that.)
> 
> If you have any further questions on pciehp, ask away.
>

Great, let me take one more iteration on this after rebasing to the
new branch.


> Thanks!
> 
> Lukas
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2018-07-29 20:52 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-28  9:13 [PATCH v6 0/1] PCI: Mask and unmask hotplug interrupts during reset Sinan Kaya
2018-07-28  9:13 ` [PATCH v6 1/1] PCI: pciehp: Ignore link events when there is a fatal error pending Sinan Kaya
2018-07-29 11:57   ` Lukas Wunner
2018-07-29 16:44     ` Sinan Kaya
2018-07-29 17:39       ` Lukas Wunner
2018-07-29 18:30         ` Sinan Kaya
2018-07-29 19:07           ` Lukas Wunner
2018-07-29 19:21             ` Sinan Kaya

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.