All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RESEND] nvme-pci: Fix EEH failure on ppc after subsystem reset
@ 2024-02-09  5:02 Nilay Shroff
  2024-02-22 21:00 ` Greg Joyce
                   ` (3 more replies)
  0 siblings, 4 replies; 17+ messages in thread
From: Nilay Shroff @ 2024-02-09  5:02 UTC (permalink / raw)
  To: kbusch, axboe, hch, sagi; +Cc: linux-nvme, linux-block, gjoyce, nilay

If the nvme subsyetm reset causes the loss of communication to the nvme
adapter then EEH could potnetially recover the adapter. The detection of
comminication loss to the adapter only happens when the nvme driver
attempts to read an MMIO register.

The nvme subsystem reset command writes 0x4E564D65 to NSSR register and
schedule adapter reset.In the case nvme subsystem reset caused the loss
of communication to the nvme adapter then either IO timeout event or
adapter reset handler could detect it. If IO timeout even could detect
loss of communication then EEH handler is able to recover the
communication to the adapter. This change was implemented in 651438bb0af5
(nvme-pci: Fix EEH failure on ppc). However if the adapter communication
loss is detected in nvme reset work handler then EEH is unable to
successfully finish the adapter recovery.

This patch ensures that,
- nvme driver reset handler would observer pci channel was offline after
  a failed MMIO read and avoids marking the controller state to DEAD and
  thus gives a fair chance to EEH handler to recover the nvme adapter.

- if nvme controller is already in RESETTNG state and pci channel frozen
  error is detected then  nvme driver pci-error-handler code sends the
  correct error code (PCI_ERS_RESULT_NEED_RESET) back to the EEH handler
  so that EEH handler could proceed with the pci slot reset.

Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>

[  131.415601] EEH: Recovering PHB#40-PE#10000
[  131.415619] EEH: PE location: N/A, PHB location: N/A
[  131.415623] EEH: Frozen PHB#40-PE#10000 detected
[  131.415627] EEH: Call Trace:
[  131.415629] EEH: [c000000000051078] __eeh_send_failure_event+0x7c/0x15c
[  131.415782] EEH: [c000000000049bdc] eeh_dev_check_failure.part.0+0x27c/0x6b0
[  131.415789] EEH: [c000000000cb665c] nvme_pci_reg_read32+0x78/0x9c
[  131.415802] EEH: [c000000000ca07f8] nvme_wait_ready+0xa8/0x18c
[  131.415814] EEH: [c000000000cb7070] nvme_dev_disable+0x368/0x40c
[  131.415823] EEH: [c000000000cb9970] nvme_reset_work+0x198/0x348
[  131.415830] EEH: [c00000000017b76c] process_one_work+0x1f0/0x4f4
[  131.415841] EEH: [c00000000017be2c] worker_thread+0x3bc/0x590
[  131.415846] EEH: [c00000000018a46c] kthread+0x138/0x140
[  131.415854] EEH: [c00000000000dd58] start_kernel_thread+0x14/0x18
[  131.415864] EEH: This PCI device has failed 1 times in the last hour and will be permanently disabled after 5 failures.
[  131.415874] EEH: Notify device drivers to shutdown
[  131.415882] EEH: Beginning: 'error_detected(IO frozen)'
[  131.415888] PCI 0040:01:00.0#10000: EEH: Invoking nvme->error_detected(IO frozen)
[  131.415891] nvme nvme1: frozen state error detected, reset controller
[  131.515358] nvme 0040:01:00.0: enabling device (0000 -> 0002)
[  131.515778] nvme nvme1: Disabling device after reset failure: -19
[  131.555336] PCI 0040:01:00.0#10000: EEH: nvme driver reports: 'disconnect'
[  131.555343] EEH: Finished:'error_detected(IO frozen)' with aggregate recovery state:'disconnect'
[  131.555371] EEH: Unable to recover from failure from PHB#40-PE#10000.
[  131.555371] Please try reseating or replacing it
[  131.556296] EEH: of node=0040:01:00.0
[  131.556351] EEH: PCI device/vendor: 00251e0f
[  131.556421] EEH: PCI cmd/status register: 00100142
[  131.556428] EEH: PCI-E capabilities and status follow:
[  131.556678] EEH: PCI-E 00: 0002b010 10008fe3 00002910 00436044
[  131.556859] EEH: PCI-E 10: 10440000 00000000 00000000 00000000
[  131.556869] EEH: PCI-E 20: 00000000
[  131.556875] EEH: PCI-E AER capability register set follows:
[  131.557115] EEH: PCI-E AER 00: 14820001 00000000 00400000 00462030
[  131.557294] EEH: PCI-E AER 10: 00000000 0000e000 000002a0 00000000
[  131.557469] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000
[  131.557523] EEH: PCI-E AER 30: 00000000 00000000
[  131.558807] EEH: Beginning: 'error_detected(permanent failure)'
[  131.558815] PCI 0040:01:00.0#10000: EEH: Invoking nvme->error_detected(permanent failure)
[  131.558818] nvme nvme1: failure state error detected, request disconnect
[  131.558839] PCI 0040:01:00.0#10000: EEH: nvme driver reports: 'disconnect'
---
 drivers/nvme/host/pci.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index c1d6357ec98a..a6ba46e727ba 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2776,6 +2776,14 @@ static void nvme_reset_work(struct work_struct *work)
  out_unlock:
 	mutex_unlock(&dev->shutdown_lock);
  out:
+	/*
+	 * If PCI recovery is ongoing then let it finish first
+	 */
+	if (pci_channel_offline(to_pci_dev(dev->dev))) {
+		dev_warn(dev->ctrl.device, "PCI recovery is ongoing so let it finish\n");
+		return;
+	}
+
 	/*
 	 * Set state to deleting now to avoid blocking nvme_wait_reset(), which
 	 * may be holding this pci_dev's device lock.
@@ -3295,9 +3303,11 @@ static pci_ers_result_t nvme_error_detected(struct pci_dev *pdev,
 	case pci_channel_io_frozen:
 		dev_warn(dev->ctrl.device,
 			"frozen state error detected, reset controller\n");
-		if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) {
-			nvme_dev_disable(dev, true);
-			return PCI_ERS_RESULT_DISCONNECT;
+		if (nvme_ctrl_state(&dev->ctrl) != NVME_CTRL_RESETTING) {
+			if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) {
+				nvme_dev_disable(dev, true);
+				return PCI_ERS_RESULT_DISCONNECT;
+			}
 		}
 		nvme_dev_disable(dev, false);
 		return PCI_ERS_RESULT_NEED_RESET;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH RESEND] nvme-pci: Fix EEH failure on ppc after subsystem reset
  2024-02-09  5:02 [PATCH RESEND] nvme-pci: Fix EEH failure on ppc after subsystem reset Nilay Shroff
@ 2024-02-22 21:00 ` Greg Joyce
       [not found] ` <2c76725c-7bb6-4827-b45a-dbe1acbefba7@imap.linux.ibm.com>
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 17+ messages in thread
From: Greg Joyce @ 2024-02-22 21:00 UTC (permalink / raw)
  To: Nilay Shroff, kbusch, axboe, hch, sagi, hare, dwagner, Wendy Xiong
  Cc: linux-nvme, linux-block

On Fri, 2024-02-09 at 10:32 +0530, Nilay Shroff wrote:
> If the nvme subsyetm reset causes the loss of communication to the
> nvme
> adapter then EEH could potnetially recover the adapter. The detection
> of
> comminication loss to the adapter only happens when the nvme driver
> attempts to read an MMIO register.
> 
> The nvme subsystem reset command writes 0x4E564D65 to NSSR register
> and
> schedule adapter reset.In the case nvme subsystem reset caused the
> loss
> of communication to the nvme adapter then either IO timeout event or
> adapter reset handler could detect it. If IO timeout even could
> detect
> loss of communication then EEH handler is able to recover the
> communication to the adapter. This change was implemented in
> 651438bb0af5
> (nvme-pci: Fix EEH failure on ppc). However if the adapter
> communication
> loss is detected in nvme reset work handler then EEH is unable to
> successfully finish the adapter recovery.
> 
> This patch ensures that,
> - nvme driver reset handler would observer pci channel was offline
> after
>   a failed MMIO read and avoids marking the controller state to DEAD
> and
>   thus gives a fair chance to EEH handler to recover the nvme
> adapter.
> 
> - if nvme controller is already in RESETTNG state and pci channel
> frozen
>   error is detected then  nvme driver pci-error-handler code sends
> the
>   correct error code (PCI_ERS_RESULT_NEED_RESET) back to the EEH
> handler
>   so that EEH handler could proceed with the pci slot reset.
> 
> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>

Note that while in this case the issue was discovered via the Power EEH
handler, this is not a Power specific issue. The problem and fix is
applicable to all architectures.
 

> 
> [  131.415601] EEH: Recovering PHB#40-PE#10000
> [  131.415619] EEH: PE location: N/A, PHB location: N/A
> [  131.415623] EEH: Frozen PHB#40-PE#10000 detected
> [  131.415627] EEH: Call Trace:
> [  131.415629] EEH: [c000000000051078]
> __eeh_send_failure_event+0x7c/0x15c
> [  131.415782] EEH: [c000000000049bdc]
> eeh_dev_check_failure.part.0+0x27c/0x6b0
> [  131.415789] EEH: [c000000000cb665c] nvme_pci_reg_read32+0x78/0x9c
> [  131.415802] EEH: [c000000000ca07f8] nvme_wait_ready+0xa8/0x18c
> [  131.415814] EEH: [c000000000cb7070] nvme_dev_disable+0x368/0x40c
> [  131.415823] EEH: [c000000000cb9970] nvme_reset_work+0x198/0x348
> [  131.415830] EEH: [c00000000017b76c] process_one_work+0x1f0/0x4f4
> [  131.415841] EEH: [c00000000017be2c] worker_thread+0x3bc/0x590
> [  131.415846] EEH: [c00000000018a46c] kthread+0x138/0x140
> [  131.415854] EEH: [c00000000000dd58] start_kernel_thread+0x14/0x18
> [  131.415864] EEH: This PCI device has failed 1 times in the last
> hour and will be permanently disabled after 5 failures.
> [  131.415874] EEH: Notify device drivers to shutdown
> [  131.415882] EEH: Beginning: 'error_detected(IO frozen)'
> [  131.415888] PCI 0040:01:00.0#10000: EEH: Invoking nvme-
> >error_detected(IO frozen)
> [  131.415891] nvme nvme1: frozen state error detected, reset
> controller
> [  131.515358] nvme 0040:01:00.0: enabling device (0000 -> 0002)
> [  131.515778] nvme nvme1: Disabling device after reset failure: -19
> [  131.555336] PCI 0040:01:00.0#10000: EEH: nvme driver reports:
> 'disconnect'
> [  131.555343] EEH: Finished:'error_detected(IO frozen)' with
> aggregate recovery state:'disconnect'
> [  131.555371] EEH: Unable to recover from failure from PHB#40-
> PE#10000.
> [  131.555371] Please try reseating or replacing it
> [  131.556296] EEH: of node=0040:01:00.0
> [  131.556351] EEH: PCI device/vendor: 00251e0f
> [  131.556421] EEH: PCI cmd/status register: 00100142
> [  131.556428] EEH: PCI-E capabilities and status follow:
> [  131.556678] EEH: PCI-E 00: 0002b010 10008fe3 00002910 00436044
> [  131.556859] EEH: PCI-E 10: 10440000 00000000 00000000 00000000
> [  131.556869] EEH: PCI-E 20: 00000000
> [  131.556875] EEH: PCI-E AER capability register set follows:
> [  131.557115] EEH: PCI-E AER 00: 14820001 00000000 00400000 00462030
> [  131.557294] EEH: PCI-E AER 10: 00000000 0000e000 000002a0 00000000
> [  131.557469] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000
> [  131.557523] EEH: PCI-E AER 30: 00000000 00000000
> [  131.558807] EEH: Beginning: 'error_detected(permanent failure)'
> [  131.558815] PCI 0040:01:00.0#10000: EEH: Invoking nvme-
> >error_detected(permanent failure)
> [  131.558818] nvme nvme1: failure state error detected, request
> disconnect
> [  131.558839] PCI 0040:01:00.0#10000: EEH: nvme driver reports:
> 'disconnect'
> ---
>  drivers/nvme/host/pci.c | 16 +++++++++++++---
>  1 file changed, 13 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index c1d6357ec98a..a6ba46e727ba 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -2776,6 +2776,14 @@ static void nvme_reset_work(struct work_struct
> *work)
>   out_unlock:
>  	mutex_unlock(&dev->shutdown_lock);
>   out:
> +	/*
> +	 * If PCI recovery is ongoing then let it finish first
> +	 */
> +	if (pci_channel_offline(to_pci_dev(dev->dev))) {
> +		dev_warn(dev->ctrl.device, "PCI recovery is ongoing so
> let it finish\n");
> +		return;
> +	}
> +
>  	/*
>  	 * Set state to deleting now to avoid blocking
> nvme_wait_reset(), which
>  	 * may be holding this pci_dev's device lock.
> @@ -3295,9 +3303,11 @@ static pci_ers_result_t
> nvme_error_detected(struct pci_dev *pdev,
>  	case pci_channel_io_frozen:
>  		dev_warn(dev->ctrl.device,
>  			"frozen state error detected, reset
> controller\n");
> -		if (!nvme_change_ctrl_state(&dev->ctrl,
> NVME_CTRL_RESETTING)) {
> -			nvme_dev_disable(dev, true);
> -			return PCI_ERS_RESULT_DISCONNECT;
> +		if (nvme_ctrl_state(&dev->ctrl) != NVME_CTRL_RESETTING)
> {
> +			if (!nvme_change_ctrl_state(&dev->ctrl,
> NVME_CTRL_RESETTING)) {
> +				nvme_dev_disable(dev, true);
> +				return PCI_ERS_RESULT_DISCONNECT;
> +			}
>  		}
>  		nvme_dev_disable(dev, false);
>  		return PCI_ERS_RESULT_NEED_RESET;



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RESEND] nvme-pci: Fix EEH failure on ppc after subsystem reset
       [not found] ` <2c76725c-7bb6-4827-b45a-dbe1acbefba7@imap.linux.ibm.com>
@ 2024-02-27 18:14   ` Nilay Shroff
  0 siblings, 0 replies; 17+ messages in thread
From: Nilay Shroff @ 2024-02-27 18:14 UTC (permalink / raw)
  To: kbusch, axboe, hch, sagi
  Cc: linux-nvme, linux-block, gjoyce, Srimannarayana Murthy Maram

Hi,

A gentle ping...

Thanks,
--Nilay

On 2/16/24 18:07, Srimannarayana Murthy Maram wrote:
> Hi all,
> 
> Tested patch with upstream kernel "6.8.0-rc4"
> 
> Issue verified on IBM power systems with manual test case has following steps
> 1. Note nvme controllers for a nvme subsystem using
>     "nvme list-subsys"
> 2. Perform nvme subsystem reset on each listed controller
>    under nvme subsystem one after the other, only after successful recovery.
> 
> Verified it on power system with NVME device on normal and multipath (2 paths) configuration.
> 
> Provided patch successfully recovered single controller(normal) and both controller(multipath) listed under nvme subsystem.
> 
> Tested-by: Maram Srimannarayana Murthy<msmurthy@linux.vnet.ibm.com>
> 
> Thank you,
> Maram Srimannarayana Murthy
> Sr. Test Engineer | IBM
> 
> On 2/9/24 10:32, Nilay Shroff wrote:
>> If the nvme subsyetm reset causes the loss of communication to the nvme
>> adapter then EEH could potnetially recover the adapter. The detection of
>> comminication loss to the adapter only happens when the nvme driver
>> attempts to read an MMIO register.
>>
>> The nvme subsystem reset command writes 0x4E564D65 to NSSR register and
>> schedule adapter reset.In the case nvme subsystem reset caused the loss
>> of communication to the nvme adapter then either IO timeout event or
>> adapter reset handler could detect it. If IO timeout even could detect
>> loss of communication then EEH handler is able to recover the
>> communication to the adapter. This change was implemented in 651438bb0af5
>> (nvme-pci: Fix EEH failure on ppc). However if the adapter communication
>> loss is detected in nvme reset work handler then EEH is unable to
>> successfully finish the adapter recovery.
>>
>> This patch ensures that,
>> - nvme driver reset handler would observer pci channel was offline after
>>    a failed MMIO read and avoids marking the controller state to DEAD and
>>    thus gives a fair chance to EEH handler to recover the nvme adapter.
>>
>> - if nvme controller is already in RESETTNG state and pci channel frozen
>>    error is detected then  nvme driver pci-error-handler code sends the
>>    correct error code (PCI_ERS_RESULT_NEED_RESET) back to the EEH handler
>>    so that EEH handler could proceed with the pci slot reset.
>>
>> Signed-off-by: Nilay Shroff<nilay@linux.ibm.com>
>>
>> [  131.415601] EEH: Recovering PHB#40-PE#10000
>> [  131.415619] EEH: PE location: N/A, PHB location: N/A
>> [  131.415623] EEH: Frozen PHB#40-PE#10000 detected
>> [  131.415627] EEH: Call Trace:
>> [  131.415629] EEH: [c000000000051078] __eeh_send_failure_event+0x7c/0x15c
>> [  131.415782] EEH: [c000000000049bdc] eeh_dev_check_failure.part.0+0x27c/0x6b0
>> [  131.415789] EEH: [c000000000cb665c] nvme_pci_reg_read32+0x78/0x9c
>> [  131.415802] EEH: [c000000000ca07f8] nvme_wait_ready+0xa8/0x18c
>> [  131.415814] EEH: [c000000000cb7070] nvme_dev_disable+0x368/0x40c
>> [  131.415823] EEH: [c000000000cb9970] nvme_reset_work+0x198/0x348
>> [  131.415830] EEH: [c00000000017b76c] process_one_work+0x1f0/0x4f4
>> [  131.415841] EEH: [c00000000017be2c] worker_thread+0x3bc/0x590
>> [  131.415846] EEH: [c00000000018a46c] kthread+0x138/0x140
>> [  131.415854] EEH: [c00000000000dd58] start_kernel_thread+0x14/0x18
>> [  131.415864] EEH: This PCI device has failed 1 times in the last hour and will be permanently disabled after 5 failures.
>> [  131.415874] EEH: Notify device drivers to shutdown
>> [  131.415882] EEH: Beginning: 'error_detected(IO frozen)'
>> [  131.415888] PCI 0040:01:00.0#10000: EEH: Invoking nvme->error_detected(IO frozen)
>> [  131.415891] nvme nvme1: frozen state error detected, reset controller
>> [  131.515358] nvme 0040:01:00.0: enabling device (0000 -> 0002)
>> [  131.515778] nvme nvme1: Disabling device after reset failure: -19
>> [  131.555336] PCI 0040:01:00.0#10000: EEH: nvme driver reports: 'disconnect'
>> [  131.555343] EEH: Finished:'error_detected(IO frozen)' with aggregate recovery state:'disconnect'
>> [  131.555371] EEH: Unable to recover from failure from PHB#40-PE#10000.
>> [  131.555371] Please try reseating or replacing it
>> [  131.556296] EEH: of node=0040:01:00.0
>> [  131.556351] EEH: PCI device/vendor: 00251e0f
>> [  131.556421] EEH: PCI cmd/status register: 00100142
>> [  131.556428] EEH: PCI-E capabilities and status follow:
>> [  131.556678] EEH: PCI-E 00: 0002b010 10008fe3 00002910 00436044
>> [  131.556859] EEH: PCI-E 10: 10440000 00000000 00000000 00000000
>> [  131.556869] EEH: PCI-E 20: 00000000
>> [  131.556875] EEH: PCI-E AER capability register set follows:
>> [  131.557115] EEH: PCI-E AER 00: 14820001 00000000 00400000 00462030
>> [  131.557294] EEH: PCI-E AER 10: 00000000 0000e000 000002a0 00000000
>> [  131.557469] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000
>> [  131.557523] EEH: PCI-E AER 30: 00000000 00000000
>> [  131.558807] EEH: Beginning: 'error_detected(permanent failure)'
>> [  131.558815] PCI 0040:01:00.0#10000: EEH: Invoking nvme->error_detected(permanent failure)
>> [  131.558818] nvme nvme1: failure state error detected, request disconnect
>> [  131.558839] PCI 0040:01:00.0#10000: EEH: nvme driver reports: 'disconnect'
>> ---
>>   drivers/nvme/host/pci.c | 16 +++++++++++++---
>>   1 file changed, 13 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
>> index c1d6357ec98a..a6ba46e727ba 100644
>> --- a/drivers/nvme/host/pci.c
>> +++ b/drivers/nvme/host/pci.c
>> @@ -2776,6 +2776,14 @@ static void nvme_reset_work(struct work_struct *work)
>>    out_unlock:
>>       mutex_unlock(&dev->shutdown_lock);
>>    out:
>> +    /*
>> +     * If PCI recovery is ongoing then let it finish first
>> +     */
>> +    if (pci_channel_offline(to_pci_dev(dev->dev))) {
>> +        dev_warn(dev->ctrl.device, "PCI recovery is ongoing so let it finish\n");
>> +        return;
>> +    }
>> +
>>       /*
>>        * Set state to deleting now to avoid blocking nvme_wait_reset(), which
>>        * may be holding this pci_dev's device lock.
>> @@ -3295,9 +3303,11 @@ static pci_ers_result_t nvme_error_detected(struct pci_dev *pdev,
>>       case pci_channel_io_frozen:
>>           dev_warn(dev->ctrl.device,
>>               "frozen state error detected, reset controller\n");
>> -        if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) {
>> -            nvme_dev_disable(dev, true);
>> -            return PCI_ERS_RESULT_DISCONNECT;
>> +        if (nvme_ctrl_state(&dev->ctrl) != NVME_CTRL_RESETTING) {
>> +            if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) {
>> +                nvme_dev_disable(dev, true);
>> +                return PCI_ERS_RESULT_DISCONNECT;
>> +            }
>>           }
>>           nvme_dev_disable(dev, false);
>>           return PCI_ERS_RESULT_NEED_RESET;

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RESEND] nvme-pci: Fix EEH failure on ppc after subsystem reset
  2024-02-09  5:02 [PATCH RESEND] nvme-pci: Fix EEH failure on ppc after subsystem reset Nilay Shroff
  2024-02-22 21:00 ` Greg Joyce
       [not found] ` <2c76725c-7bb6-4827-b45a-dbe1acbefba7@imap.linux.ibm.com>
@ 2024-02-27 18:29 ` Keith Busch
  2024-02-28 11:19   ` Nilay Shroff
  2024-03-06 11:20   ` Nilay Shroff
  2024-03-08 15:41 ` Keith Busch
  3 siblings, 2 replies; 17+ messages in thread
From: Keith Busch @ 2024-02-27 18:29 UTC (permalink / raw)
  To: Nilay Shroff; +Cc: axboe, hch, sagi, linux-nvme, linux-block, gjoyce

On Fri, Feb 09, 2024 at 10:32:16AM +0530, Nilay Shroff wrote:
> If the nvme subsyetm reset causes the loss of communication to the nvme
> adapter then EEH could potnetially recover the adapter. The detection of
> comminication loss to the adapter only happens when the nvme driver
> attempts to read an MMIO register.
> 
> The nvme subsystem reset command writes 0x4E564D65 to NSSR register and
> schedule adapter reset.In the case nvme subsystem reset caused the loss
> of communication to the nvme adapter then either IO timeout event or
> adapter reset handler could detect it. If IO timeout even could detect
> loss of communication then EEH handler is able to recover the
> communication to the adapter. This change was implemented in 651438bb0af5
> (nvme-pci: Fix EEH failure on ppc). However if the adapter communication
> loss is detected in nvme reset work handler then EEH is unable to
> successfully finish the adapter recovery.
> 
> This patch ensures that,
> - nvme driver reset handler would observer pci channel was offline after
>   a failed MMIO read and avoids marking the controller state to DEAD and
>   thus gives a fair chance to EEH handler to recover the nvme adapter.
> 
> - if nvme controller is already in RESETTNG state and pci channel frozen
>   error is detected then  nvme driver pci-error-handler code sends the
>   correct error code (PCI_ERS_RESULT_NEED_RESET) back to the EEH handler
>   so that EEH handler could proceed with the pci slot reset.

A subsystem reset takes the link down. I'm pretty sure the proper way to
recover from it requires pcie hotplug support.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RESEND] nvme-pci: Fix EEH failure on ppc after subsystem reset
  2024-02-27 18:29 ` Keith Busch
@ 2024-02-28 11:19   ` Nilay Shroff
  2024-02-29 12:27     ` Nilay Shroff
  2024-03-06 11:20   ` Nilay Shroff
  1 sibling, 1 reply; 17+ messages in thread
From: Nilay Shroff @ 2024-02-28 11:19 UTC (permalink / raw)
  To: Keith Busch; +Cc: axboe, hch, sagi, linux-nvme, linux-block, gjoyce



On 2/27/24 23:59, Keith Busch wrote:
> On Fri, Feb 09, 2024 at 10:32:16AM +0530, Nilay Shroff wrote:
>> If the nvme subsyetm reset causes the loss of communication to the nvme
>> adapter then EEH could potnetially recover the adapter. The detection of
>> comminication loss to the adapter only happens when the nvme driver
>> attempts to read an MMIO register.
>>
>> The nvme subsystem reset command writes 0x4E564D65 to NSSR register and
>> schedule adapter reset.In the case nvme subsystem reset caused the loss
>> of communication to the nvme adapter then either IO timeout event or
>> adapter reset handler could detect it. If IO timeout even could detect
>> loss of communication then EEH handler is able to recover the
>> communication to the adapter. This change was implemented in 651438bb0af5
>> (nvme-pci: Fix EEH failure on ppc). However if the adapter communication
>> loss is detected in nvme reset work handler then EEH is unable to
>> successfully finish the adapter recovery.
>>
>> This patch ensures that,
>> - nvme driver reset handler would observer pci channel was offline after
>>   a failed MMIO read and avoids marking the controller state to DEAD and
>>   thus gives a fair chance to EEH handler to recover the nvme adapter.
>>
>> - if nvme controller is already in RESETTNG state and pci channel frozen
>>   error is detected then  nvme driver pci-error-handler code sends the
>>   correct error code (PCI_ERS_RESULT_NEED_RESET) back to the EEH handler
>>   so that EEH handler could proceed with the pci slot reset.
> 
> A subsystem reset takes the link down. I'm pretty sure the proper way to
> recover from it requires pcie hotplug support.
> 
Yes you're correct. We require pcie hotplugging to recover. However powerpc EEH 
handler could able to recover the pcie adapter without physically removing and 
re-inserting the adapter or in another words, it could reset adapter without 
hotplug activity. In fact, powerpc EEH could isolate pcie slot and resets it 
(i.e. resetting the PCI device holding the PCI #RST line high for two seconds), 
followed by setting up the device config space (the base address registers 
(BAR's), latency timer, cache line size, interrupt line, and so on). 

You may find more information about EEH recovery here: 
https://www.kernel.org/doc/Documentation/powerpc/eeh-pci-error-recovery.txt

Typically when pcie error is detected and the EEH is able to recover the device, 
the EEH handler code goes through below sequence (assuming driver is EEH aware):

eeh_handle_normal_event()
  eeh_set_channel_state()-> set state to pci_channel_io_frozen 
     eeh_report_error() 
       nvme_error_detected() -> channel state "pci_channel_io_frozen"; returns PCI_ERS_RESULT_NEED_RESET
         eeh_slot_reset() -> recovery successful 
           nvme_slot_reset() -> returns PCI_ERS_RESULT_RECOVERED
             eeh_set_channel_state()-> set state to pci_channel_io_normal
               nvme_error_resume()

In case pcie erorr is detected and the EEH is unable to recover the device,
the EEH handler code goes through the below sequence:

eeh_handle_normal_event()
  eeh_set_channel_state()-> set state to pci_channel_io_frozen
    eeh_report_error()
      nvme_error_detected() -> channel state pci_channel_io_frozen; returns PCI_ERS_RESULT_NEED_RESET
        eeh_slot_reset() -> recovery failed
          eeh_report_failure()
            nvme_error_detected()-> channel state pci_channel_io_perm_failure; returns PCI_ERS_RESULT_DISCONNECT
              eeh_set_channel_state()-> set state to pci_channel_io_perm_failure
                nvme_remove()

                                           
If we execute the command "nvme subsystem-reset ..." and adapter communication is
lost then in the current code (under nvme_reset_work()) we simply disable the device
 and mark the controller DEAD. However we may have a chance to recover the controller 
if driver is EEH aware and EEH recovery is underway. We already handle one such case 
in nvme_timeout(). So this patch ensures that if we fall through nvme_reset_work() 
post subsystem-reset and the EEH recovery is in progress then we give a chance to the
EEH mechanism to recover the adapter. If in case the EEH recovery is unsuccessful then
we'd anyway fall through code path I mentioned above where we invoke nvme_remove() at 
the end and delete the erring controller.

With the proposed patch, we find that EEH recovery is successful post subsystem-reset. 
Please find below the relevant output:
# lspci 
0524:28:00.0 Non-Volatile memory controller: KIOXIA Corporation NVMe SSD Controller CM7 2.5" (rev 01)

# nvme list-subsys
nvme-subsys0 - NQN=nqn.2019-10.com.kioxia:KCM7DRUG1T92:7DQ0A01206N3
               hostnqn=nqn.2014-08.org.nvmexpress:uuid:41528538-e8ad-4eaf-84a7-9c552917d988
               iopolicy=numa
\
 +- nvme0 pcie 0524:28:00.0 live

# nvme subsystem-reset /dev/nvme0

# nvme list-subsys
nvme-subsys0 - NQN=nqn.2019-10.com.kioxia:KCM7DRUG1T92:7DQ0A01206N3
               hostnqn=nqn.2014-08.org.nvmexpress:uuid:41528538-e8ad-4eaf-84a7-9c552917d988
               iopolicy=numa
\
 +- nvme0 pcie 0524:28:00.0 resetting

[10556.034082] EEH: Recovering PHB#524-PE#280000
[10556.034108] EEH: PE location: N/A, PHB location: N/A
[10556.034112] EEH: Frozen PHB#524-PE#280000 detected
[10556.034115] EEH: Call Trace:
[10556.034117] EEH: [c000000000051068] __eeh_send_failure_event+0x7c/0x15c
[10556.034304] EEH: [c000000000049bcc] eeh_dev_check_failure.part.0+0x27c/0x6b0
[10556.034310] EEH: [c008000004753d3c] nvme_pci_reg_read32+0x80/0xac [nvme]
[10556.034319] EEH: [c0080000045f365c] nvme_wait_ready+0xa4/0x18c [nvme_core]
[10556.034333] EEH: [c008000004754750] nvme_dev_disable+0x370/0x41c [nvme]
[10556.034338] EEH: [c008000004757184] nvme_reset_work+0x1f4/0x3cc [nvme]
[10556.034344] EEH: [c00000000017bb8c] process_one_work+0x1f0/0x4f4
[10556.034350] EEH: [c00000000017c24c] worker_thread+0x3bc/0x590
[10556.034355] EEH: [c00000000018a87c] kthread+0x138/0x140
[10556.034358] EEH: [c00000000000dd58] start_kernel_thread+0x14/0x18
[10556.034363] EEH: This PCI device has failed 1 times in the last hour and will be permanently disabled after 5 failures.
[10556.034368] EEH: Notify device drivers to shutdown
[10556.034371] EEH: Beginning: 'error_detected(IO frozen)'
[10556.034376] PCI 0524:28:00.0#280000: EEH: Invoking nvme->error_detected(IO frozen)
[10556.034379] nvme nvme0: frozen state error detected, reset controller
[10556.102654] nvme 0524:28:00.0: enabling device (0000 -> 0002)
[10556.103171] nvme nvme0: PCI recovery is ongoing so let it finish
[10556.142532] PCI 0524:28:00.0#280000: EEH: nvme driver reports: 'need reset'
[10556.142535] EEH: Finished:'error_detected(IO frozen)' with aggregate recovery state:'need reset'
[...]
[...]
[10556.148172] EEH: Reset without hotplug activity
[10558.298672] EEH: Beginning: 'slot_reset'
[10558.298692] PCI 0524:28:00.0#280000: EEH: Invoking nvme->slot_reset()
[10558.298696] nvme nvme0: restart after slot reset
[10558.301925] PCI 0524:28:00.0#280000: EEH: nvme driver reports: 'recovered'
[10558.301928] EEH: Finished:'slot_reset' with aggregate recovery state:'recovered'
[10558.301939] EEH: Notify device driver to resume
[10558.301944] EEH: Beginning: 'resume'
[10558.301947] PCI 0524:28:00.0#280000: EEH: Invoking nvme->resume()
[10558.331051] nvme nvme0: Shutdown timeout set to 10 seconds
[10558.356679] nvme nvme0: 16/0/0 default/read/poll queues
[10558.357026] PCI 0524:28:00.0#280000: EEH: nvme driver reports: 'none'
[10558.357028] EEH: Finished:'resume'
[10558.357035] EEH: Recovery successful.

# nvme list-subsys
nvme-subsys0 - NQN=nqn.2019-10.com.kioxia:KCM7DRUG1T92:7DQ0A01206N3
               hostnqn=nqn.2014-08.org.nvmexpress:uuid:41528538-e8ad-4eaf-84a7-9c552917d988
               iopolicy=numa
\
 +- nvme0 pcie 0524:28:00.0 live


Thanks,
--Nilay

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RESEND] nvme-pci: Fix EEH failure on ppc after subsystem reset
  2024-02-28 11:19   ` Nilay Shroff
@ 2024-02-29 12:27     ` Nilay Shroff
  0 siblings, 0 replies; 17+ messages in thread
From: Nilay Shroff @ 2024-02-29 12:27 UTC (permalink / raw)
  To: Keith Busch
  Cc: axboe, hch, sagi, linux-nvme, linux-block, gjoyce,
	Srimannarayana Murthy Maram


Hi Keith,

On 2/28/24 16:49, Nilay Shroff wrote:
> 
> 
> On 2/27/24 23:59, Keith Busch wrote:
>> On Fri, Feb 09, 2024 at 10:32:16AM +0530, Nilay Shroff wrote:
>>> If the nvme subsyetm reset causes the loss of communication to the nvme
>>> adapter then EEH could potnetially recover the adapter. The detection of
>>> comminication loss to the adapter only happens when the nvme driver
>>> attempts to read an MMIO register.
>>>
>>> The nvme subsystem reset command writes 0x4E564D65 to NSSR register and
>>> schedule adapter reset.In the case nvme subsystem reset caused the loss
>>> of communication to the nvme adapter then either IO timeout event or
>>> adapter reset handler could detect it. If IO timeout even could detect
>>> loss of communication then EEH handler is able to recover the
>>> communication to the adapter. This change was implemented in 651438bb0af5
>>> (nvme-pci: Fix EEH failure on ppc). However if the adapter communication
>>> loss is detected in nvme reset work handler then EEH is unable to
>>> successfully finish the adapter recovery.
>>>
>>> This patch ensures that,
>>> - nvme driver reset handler would observer pci channel was offline after
>>>   a failed MMIO read and avoids marking the controller state to DEAD and
>>>   thus gives a fair chance to EEH handler to recover the nvme adapter.
>>>
>>> - if nvme controller is already in RESETTNG state and pci channel frozen
>>>   error is detected then  nvme driver pci-error-handler code sends the
>>>   correct error code (PCI_ERS_RESULT_NEED_RESET) back to the EEH handler
>>>   so that EEH handler could proceed with the pci slot reset.
>>
>> A subsystem reset takes the link down. I'm pretty sure the proper way to
>> recover from it requires pcie hotplug support.
>>
> Yes you're correct. We require pcie hotplugging to recover. However powerpc EEH 
> handler could able to recover the pcie adapter without physically removing and 
> re-inserting the adapter or in another words, it could reset adapter without 
> hotplug activity. In fact, powerpc EEH could isolate pcie slot and resets it 
> (i.e. resetting the PCI device holding the PCI #RST line high for two seconds), 
> followed by setting up the device config space (the base address registers 
> (BAR's), latency timer, cache line size, interrupt line, and so on). 
> 
> You may find more information about EEH recovery here: 
> https://www.kernel.org/doc/Documentation/powerpc/eeh-pci-error-recovery.txt
> 
> Typically when pcie error is detected and the EEH is able to recover the device, 
> the EEH handler code goes through below sequence (assuming driver is EEH aware):
> 
> eeh_handle_normal_event()
>   eeh_set_channel_state()-> set state to pci_channel_io_frozen 
>      eeh_report_error() 
>        nvme_error_detected() -> channel state "pci_channel_io_frozen"; returns PCI_ERS_RESULT_NEED_RESET
>          eeh_slot_reset() -> recovery successful 
>            nvme_slot_reset() -> returns PCI_ERS_RESULT_RECOVERED
>              eeh_set_channel_state()-> set state to pci_channel_io_normal
>                nvme_error_resume()
> 
> In case pcie erorr is detected and the EEH is unable to recover the device,
> the EEH handler code goes through the below sequence:
> 
> eeh_handle_normal_event()
>   eeh_set_channel_state()-> set state to pci_channel_io_frozen
>     eeh_report_error()
>       nvme_error_detected() -> channel state pci_channel_io_frozen; returns PCI_ERS_RESULT_NEED_RESET
>         eeh_slot_reset() -> recovery failed
>           eeh_report_failure()
>             nvme_error_detected()-> channel state pci_channel_io_perm_failure; returns PCI_ERS_RESULT_DISCONNECT
>               eeh_set_channel_state()-> set state to pci_channel_io_perm_failure
>                 nvme_remove()
> 
>                                            
> If we execute the command "nvme subsystem-reset ..." and adapter communication is
> lost then in the current code (under nvme_reset_work()) we simply disable the device
>  and mark the controller DEAD. However we may have a chance to recover the controller 
> if driver is EEH aware and EEH recovery is underway. We already handle one such case 
> in nvme_timeout(). So this patch ensures that if we fall through nvme_reset_work() 
> post subsystem-reset and the EEH recovery is in progress then we give a chance to the
> EEH mechanism to recover the adapter. If in case the EEH recovery is unsuccessful then
> we'd anyway fall through code path I mentioned above where we invoke nvme_remove() at 
> the end and delete the erring controller.
> 

BTW, the similar issue was earlier fixed in 651438bb0af5(nvme-pci: Fix EEH failure on ppc).
And that fix was needed due to the controller health check polling was removed in 
b2a0eb1a0ac72869(nvme-pci: Remove watchdog timer). In fact, today we may be able to recover
the NVMe adapter if subsystem-reset or any other PCI error occurs and at the same time some 
I/O request is in flight. The recovery is possible due to the in flight I/O request would 
eventually timeout and the nvme_timeout() has special code (added in 651438bb0af5)  that
gives EEH a chance to recover the adapter. 

However later, in 1e866afd4bcdd(nvme: ensure subsystem reset is single threaded), the 
nvme subsystem reset code was reworked so now when user executes command subsystem-reset, 
kernel first writes 0x4E564D65 to nvme NSSR register and then schedules the adapter reset.  
It's quite possible that when subsytem-reset is executed there were no I/O in flight and 
hence we may never hit the nvme_timeout(). Later when the adapter reset code (under 
nvme_reset_work()) start execution, it accesses MMIO registers. Hence, IMO, potentially 
nvme_reset_work() would also need similar changes as implemented under nvme_timeout() so
that EEH recovery could be possible.

> With the proposed patch, we find that EEH recovery is successful post subsystem-reset. 
> Please find below the relevant output:
> # lspci 
> 0524:28:00.0 Non-Volatile memory controller: KIOXIA Corporation NVMe SSD Controller CM7 2.5" (rev 01)
> 
> # nvme list-subsys
> nvme-subsys0 - NQN=nqn.2019-10.com.kioxia:KCM7DRUG1T92:7DQ0A01206N3
>                hostnqn=nqn.2014-08.org.nvmexpress:uuid:41528538-e8ad-4eaf-84a7-9c552917d988
>                iopolicy=numa
> \
>  +- nvme0 pcie 0524:28:00.0 live
> 
> # nvme subsystem-reset /dev/nvme0
> 
> # nvme list-subsys
> nvme-subsys0 - NQN=nqn.2019-10.com.kioxia:KCM7DRUG1T92:7DQ0A01206N3
>                hostnqn=nqn.2014-08.org.nvmexpress:uuid:41528538-e8ad-4eaf-84a7-9c552917d988
>                iopolicy=numa
> \
>  +- nvme0 pcie 0524:28:00.0 resetting
> 
> [10556.034082] EEH: Recovering PHB#524-PE#280000
> [10556.034108] EEH: PE location: N/A, PHB location: N/A
> [10556.034112] EEH: Frozen PHB#524-PE#280000 detected
> [10556.034115] EEH: Call Trace:
> [10556.034117] EEH: [c000000000051068] __eeh_send_failure_event+0x7c/0x15c
> [10556.034304] EEH: [c000000000049bcc] eeh_dev_check_failure.part.0+0x27c/0x6b0
> [10556.034310] EEH: [c008000004753d3c] nvme_pci_reg_read32+0x80/0xac [nvme]
> [10556.034319] EEH: [c0080000045f365c] nvme_wait_ready+0xa4/0x18c [nvme_core]
> [10556.034333] EEH: [c008000004754750] nvme_dev_disable+0x370/0x41c [nvme]
> [10556.034338] EEH: [c008000004757184] nvme_reset_work+0x1f4/0x3cc [nvme]
> [10556.034344] EEH: [c00000000017bb8c] process_one_work+0x1f0/0x4f4
> [10556.034350] EEH: [c00000000017c24c] worker_thread+0x3bc/0x590
> [10556.034355] EEH: [c00000000018a87c] kthread+0x138/0x140
> [10556.034358] EEH: [c00000000000dd58] start_kernel_thread+0x14/0x18
> [10556.034363] EEH: This PCI device has failed 1 times in the last hour and will be permanently disabled after 5 failures.
> [10556.034368] EEH: Notify device drivers to shutdown
> [10556.034371] EEH: Beginning: 'error_detected(IO frozen)'
> [10556.034376] PCI 0524:28:00.0#280000: EEH: Invoking nvme->error_detected(IO frozen)
> [10556.034379] nvme nvme0: frozen state error detected, reset controller
> [10556.102654] nvme 0524:28:00.0: enabling device (0000 -> 0002)
> [10556.103171] nvme nvme0: PCI recovery is ongoing so let it finish
> [10556.142532] PCI 0524:28:00.0#280000: EEH: nvme driver reports: 'need reset'
> [10556.142535] EEH: Finished:'error_detected(IO frozen)' with aggregate recovery state:'need reset'
> [...]
> [...]
> [10556.148172] EEH: Reset without hotplug activity
> [10558.298672] EEH: Beginning: 'slot_reset'
> [10558.298692] PCI 0524:28:00.0#280000: EEH: Invoking nvme->slot_reset()
> [10558.298696] nvme nvme0: restart after slot reset
> [10558.301925] PCI 0524:28:00.0#280000: EEH: nvme driver reports: 'recovered'
> [10558.301928] EEH: Finished:'slot_reset' with aggregate recovery state:'recovered'
> [10558.301939] EEH: Notify device driver to resume
> [10558.301944] EEH: Beginning: 'resume'
> [10558.301947] PCI 0524:28:00.0#280000: EEH: Invoking nvme->resume()
> [10558.331051] nvme nvme0: Shutdown timeout set to 10 seconds
> [10558.356679] nvme nvme0: 16/0/0 default/read/poll queues
> [10558.357026] PCI 0524:28:00.0#280000: EEH: nvme driver reports: 'none'
> [10558.357028] EEH: Finished:'resume'
> [10558.357035] EEH: Recovery successful.
> 
> # nvme list-subsys
> nvme-subsys0 - NQN=nqn.2019-10.com.kioxia:KCM7DRUG1T92:7DQ0A01206N3
>                hostnqn=nqn.2014-08.org.nvmexpress:uuid:41528538-e8ad-4eaf-84a7-9c552917d988
>                iopolicy=numa
> \
>  +- nvme0 pcie 0524:28:00.0 live
> 
> 
Thanks,
--Nilay

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RESEND] nvme-pci: Fix EEH failure on ppc after subsystem reset
  2024-02-27 18:29 ` Keith Busch
  2024-02-28 11:19   ` Nilay Shroff
@ 2024-03-06 11:20   ` Nilay Shroff
  2024-03-06 15:19     ` Keith Busch
  1 sibling, 1 reply; 17+ messages in thread
From: Nilay Shroff @ 2024-03-06 11:20 UTC (permalink / raw)
  To: Keith Busch, hch; +Cc: axboe, sagi, linux-nvme, linux-block, gjoyce

Hi Keith and Christoph,

On 2/27/24 23:59, Keith Busch wrote:
> On Fri, Feb 09, 2024 at 10:32:16AM +0530, Nilay Shroff wrote:
>> If the nvme subsyetm reset causes the loss of communication to the nvme
>> adapter then EEH could potnetially recover the adapter. The detection of
>> comminication loss to the adapter only happens when the nvme driver
>> attempts to read an MMIO register.
>>
>> The nvme subsystem reset command writes 0x4E564D65 to NSSR register and
>> schedule adapter reset.In the case nvme subsystem reset caused the loss
>> of communication to the nvme adapter then either IO timeout event or
>> adapter reset handler could detect it. If IO timeout even could detect
>> loss of communication then EEH handler is able to recover the
>> communication to the adapter. This change was implemented in 651438bb0af5
>> (nvme-pci: Fix EEH failure on ppc). However if the adapter communication
>> loss is detected in nvme reset work handler then EEH is unable to
>> successfully finish the adapter recovery.
>>
>> This patch ensures that,
>> - nvme driver reset handler would observer pci channel was offline after
>>   a failed MMIO read and avoids marking the controller state to DEAD and
>>   thus gives a fair chance to EEH handler to recover the nvme adapter.
>>
>> - if nvme controller is already in RESETTNG state and pci channel frozen
>>   error is detected then  nvme driver pci-error-handler code sends the
>>   correct error code (PCI_ERS_RESULT_NEED_RESET) back to the EEH handler
>>   so that EEH handler could proceed with the pci slot reset.
> 
> A subsystem reset takes the link down. I'm pretty sure the proper way to
> recover from it requires pcie hotplug support.

This was working earlier in kernel version 6.0.0. We were able to recover the
NVMe pcie adapater on powerpc after nvme subsystem reset assuming some IO were
in flight when subsysetem reset happens. However starting kernel version 6.1.0 
this is broken. 
I 've found the offending commit 1e866afd4bcd(nvme: ensure subsystem reset is 
single threaded) causing this issue on kernel version 6.1.0 and above. So this
seems to be a regression and the proposed patch help fix this bug.

Please find below logs captured for both working and non-working cases:

Working case (kernel version 6.0.0):
-----------------------------------
# uname -r
6.0.0

# nvme list-subsys
nvme-subsys0 - NQN=nqn.1994-11.com.samsung:nvme:PM1735:2.5-inch:S6EUNA0R500358
               hostnqn=nqn.2014-08.org.nvmexpress:uuid:12b49f6e-0276-4746-b10c-56815b7e6dc2
               iopolicy=numa
\
 +- nvme0 pcie 0018:01:00.0 live

# nvme subsystem-reset /dev/nvme0

# dmesg
<snip>
<snip>

[ 3215.658378] EEH: Recovering PHB#18-PE#10000
[ 3215.658401] EEH: PE location: N/A, PHB location: N/A
[ 3215.658406] EEH: Frozen PHB#18-PE#10000 detected
[ 3215.658409] EEH: Call Trace:
[ 3215.658411] EEH: [c00000000005130c] __eeh_send_failure_event+0x7c/0x160
[ 3215.658577] EEH: [c00000000004a104] eeh_dev_check_failure.part.0+0x254/0x670
[ 3215.658583] EEH: [c0080000044e61bc] nvme_timeout+0x254/0x4f0 [nvme]
[ 3215.658591] EEH: [c00000000078d840] blk_mq_check_expired+0xa0/0x130
[ 3215.658602] EEH: [c00000000079a118] bt_iter+0xf8/0x140
[ 3215.658609] EEH: [c00000000079b29c] blk_mq_queue_tag_busy_iter+0x3cc/0x720
[ 3215.658620] EEH: [c00000000078fe74] blk_mq_timeout_work+0x84/0x1c0
[ 3215.658633] EEH: [c000000000173b08] process_one_work+0x2a8/0x570
[ 3215.658644] EEH: [c000000000173e68] worker_thread+0x98/0x5e0
[ 3215.658655] EEH: [c000000000181504] kthread+0x124/0x130
[ 3215.658666] EEH: [c00000000000cbd4] ret_from_kernel_thread+0x5c/0x64
[ 3215.658672] EEH: This PCI device has failed 5 times in the last hour and will be permanently disabled after 5 failures.
[ 3215.658677] EEH: Notify device drivers to shutdown
[ 3215.658681] EEH: Beginning: 'error_detected(IO frozen)'
[ 3215.658688] PCI 0018:01:00.0#10000: EEH: Invoking nvme->error_detected(IO frozen)
[ 3215.658692] nvme nvme0: frozen state error detected, reset controller
[ 3215.788089] PCI 0018:01:00.0#10000: EEH: nvme driver reports: 'need reset'
[ 3215.788092] EEH: Finished:'error_detected(IO frozen)' with aggregate recovery state:'need reset'
<snip>
<snip>
[ 3215.790666] EEH: Reset without hotplug activity
[ 3218.078715] EEH: Beginning: 'slot_reset'
[ 3218.078729] PCI 0018:01:00.0#10000: EEH: Invoking nvme->slot_reset()
[ 3218.078734] nvme nvme0: restart after slot reset
[ 3218.081088] PCI 0018:01:00.0#10000: EEH: nvme driver reports: 'recovered'
[ 3218.081090] EEH: Finished:'slot_reset' with aggregate recovery state:'recovered'
[ 3218.081099] EEH: Notify device driver to resume
[ 3218.081101] EEH: Beginning: 'resume'
<snip>
[ 3218.161027] EEH: Finished:'resume'
[ 3218.161038] EEH: Recovery successful.

# nvme list-subsys
nvme-subsys0 - NQN=nqn.1994-11.com.samsung:nvme:PM1735:2.5-inch:S6EUNA0R500358
               hostnqn=nqn.2014-08.org.nvmexpress:uuid:12b49f6e-0276-4746-b10c-56815b7e6dc2
               iopolicy=numa
\
 +- nvme0 pcie 0018:01:00.0 live

Non-working case (kernel verion 6.1):
------------------------------------
# uname -r
6.1.0

# nvme list-subsys
nvme-subsys0 - NQN=nqn.1994-11.com.samsung:nvme:PM1735:2.5-inch:S6EUNA0R500358
               hostnqn=nqn.2014-08.org.nvmexpress:uuid:12b49f6e-0276-4746-b10c-56815b7e6dc2
               iopolicy=numa
\
 +- nvme0 pcie 0018:01:00.0 live

# nvme subsystem-reset /dev/nvme0

#dmesg
[  177.578828] EEH: Recovering PHB#18-PE#10000
[  177.578852] EEH: PE location: N/A, PHB location: N/A
[  177.578858] EEH: Frozen PHB#18-PE#10000 detected
[  177.578869] EEH: Call Trace:
[  177.578872] EEH: [c0000000000510bc] __eeh_send_failure_event+0x7c/0x160
[  177.579206] EEH: [c000000000049eb4] eeh_dev_check_failure.part.0+0x254/0x670
[  177.579212] EEH: [c008000004c261cc] nvme_timeout+0x254/0x4e0 [nvme]
[  177.579221] EEH: [c00000000079cb00] blk_mq_check_expired+0xa0/0x130
[  177.579226] EEH: [c0000000007a9628] bt_iter+0xf8/0x140
[  177.579231] EEH: [c0000000007aa79c] blk_mq_queue_tag_busy_iter+0x3bc/0x6e0
[  177.579237] EEH: [c00000000079f324] blk_mq_timeout_work+0x84/0x1c0
[  177.579241] EEH: [c000000000174a28] process_one_work+0x2a8/0x570
[  177.579247] EEH: [c000000000174d88] worker_thread+0x98/0x5e0
[  177.579253] EEH: [c000000000182454] kthread+0x124/0x130
[  177.579257] EEH: [c00000000000cddc] ret_from_kernel_thread+0x5c/0x64
[  177.579263] EEH: This PCI device has failed 1 times in the last hour and will be permanently disabled after 5 failures.
[  177.579269] EEH: Notify device drivers to shutdown
[  177.579272] EEH: Beginning: 'error_detected(IO frozen)'
[  177.579276] PCI 0018:01:00.0#10000: EEH: Invoking nvme->error_detected(IO frozen)
[  177.579279] nvme nvme0: frozen state error detected, reset controller
[  177.658746] nvme 0018:01:00.0: enabling device (0000 -> 0002)
[  177.658967] nvme 0018:01:00.0: iommu: 64-bit OK but direct DMA is limited by 800000800000000
[  177.658982] nvme 0018:01:00.0: iommu: 64-bit OK but direct DMA is limited by 800000800000000
[  177.659059] nvme nvme0: Removing after probe failure status: -19
[  177.698719] PCI 0018:01:00.0#10000: EEH: nvme driver reports: 'need reset'
[  177.698723] EEH: Finished:'error_detected(IO frozen)' with aggregate recovery state:'need reset'
<snip>
<snip>
[  179.999828] EEH: Beginning: 'slot_reset'
[  179.999840] PCI 0018:01:00.0#10000: EEH: no driver
[  179.999842] EEH: Finished:'slot_reset' with aggregate recovery state:'none'
[  179.999848] EEH: Notify device driver to resume
[  179.999850] EEH: Beginning: 'resume'
[  179.999853] PCI 0018:01:00.0#10000: EEH: no driver
<snip>

# nvme list-subsys
<empty>

Thanks,
--Nilay

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RESEND] nvme-pci: Fix EEH failure on ppc after subsystem reset
  2024-03-06 11:20   ` Nilay Shroff
@ 2024-03-06 15:19     ` Keith Busch
  0 siblings, 0 replies; 17+ messages in thread
From: Keith Busch @ 2024-03-06 15:19 UTC (permalink / raw)
  To: Nilay Shroff; +Cc: hch, axboe, sagi, linux-nvme, linux-block, gjoyce

On Wed, Mar 06, 2024 at 04:50:10PM +0530, Nilay Shroff wrote:
> Hi Keith and Christoph,

Sorry for the delay, been very busy recently. I'll revisit this this
week.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RESEND] nvme-pci: Fix EEH failure on ppc after subsystem reset
  2024-02-09  5:02 [PATCH RESEND] nvme-pci: Fix EEH failure on ppc after subsystem reset Nilay Shroff
                   ` (2 preceding siblings ...)
  2024-02-27 18:29 ` Keith Busch
@ 2024-03-08 15:41 ` Keith Busch
  2024-03-09 14:29   ` Nilay Shroff
  3 siblings, 1 reply; 17+ messages in thread
From: Keith Busch @ 2024-03-08 15:41 UTC (permalink / raw)
  To: Nilay Shroff; +Cc: axboe, hch, sagi, linux-nvme, linux-block, gjoyce

On Fri, Feb 09, 2024 at 10:32:16AM +0530, Nilay Shroff wrote:
> @@ -2776,6 +2776,14 @@ static void nvme_reset_work(struct work_struct *work)
>   out_unlock:
>  	mutex_unlock(&dev->shutdown_lock);
>   out:
> +	/*
> +	 * If PCI recovery is ongoing then let it finish first
> +	 */
> +	if (pci_channel_offline(to_pci_dev(dev->dev))) {
> +		dev_warn(dev->ctrl.device, "PCI recovery is ongoing so let it finish\n");
> +		return;
> +	}
> +
>  	/*
>  	 * Set state to deleting now to avoid blocking nvme_wait_reset(), which
>  	 * may be holding this pci_dev's device lock.
> @@ -3295,9 +3303,11 @@ static pci_ers_result_t nvme_error_detected(struct pci_dev *pdev,
>  	case pci_channel_io_frozen:
>  		dev_warn(dev->ctrl.device,
>  			"frozen state error detected, reset controller\n");
> -		if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) {
> -			nvme_dev_disable(dev, true);
> -			return PCI_ERS_RESULT_DISCONNECT;
> +		if (nvme_ctrl_state(&dev->ctrl) != NVME_CTRL_RESETTING) {
> +			if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) {
> +				nvme_dev_disable(dev, true);
> +				return PCI_ERS_RESULT_DISCONNECT;
> +			}
>  		}
>  		nvme_dev_disable(dev, false);
>  		return PCI_ERS_RESULT_NEED_RESET;

I get what you're trying to do, but it looks racy. The reset_work may
finish before pci sets channel offline, or the error handling work
happens to see RESETTING state, but then transitions to CONNECTING state
after and deadlocks on the '.resume()' side. You are counting on a very
specific sequence tied to the PCIe error handling module, and maybe you
are able to count on that sequence for your platform in this unique
scenario, but these link errors could happen anytime.

And nvme subsystem reset is just odd, it's not clear how it was intended
to be handled. It takes the links down so seems like it requires
re-enumeration from a pcie hotplug driver, and that's kind of how it was
expected to work here, but your platform has a special way to contain
the link event and bring things back up the way they were before. And
the fact you *require* IO to be in flight just so the timeout handler
can dispatch a non-posted transaction 30 seconds later to trigger EEH is
also odd. Why can't EEH just detect the link down event directly?

This driver unfortunately doesn't handle errors during a reset well.
Trying to handle that has been problematic, so the driver just bails if
anything goes wrong at this critical initialization point. Maybe we need
to address the reset/initialization failure handling more generically
and delegate the teardown or retry decision to something else. Messing
with that is pretty fragile right now, though.

Or you could just re-enumerate the slot.

I don't know, sorry my message is not really helping much to get this
fixed.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RESEND] nvme-pci: Fix EEH failure on ppc after subsystem reset
  2024-03-08 15:41 ` Keith Busch
@ 2024-03-09 14:29   ` Nilay Shroff
  2024-03-09 15:44     ` Keith Busch
  0 siblings, 1 reply; 17+ messages in thread
From: Nilay Shroff @ 2024-03-09 14:29 UTC (permalink / raw)
  To: linux-nvme



On 3/8/24 21:11, Keith Busch wrote:
> On Fri, Feb 09, 2024 at 10:32:16AM +0530, Nilay Shroff wrote:
>> @@ -2776,6 +2776,14 @@ static void nvme_reset_work(struct work_struct *work)
>>   out_unlock:
>>  	mutex_unlock(&dev->shutdown_lock);
>>   out:
>> +	/*
>> +	 * If PCI recovery is ongoing then let it finish first
>> +	 */
>> +	if (pci_channel_offline(to_pci_dev(dev->dev))) {
>> +		dev_warn(dev->ctrl.device, "PCI recovery is ongoing so let it finish\n");
>> +		return;
>> +	}
>> +
>>  	/*
>>  	 * Set state to deleting now to avoid blocking nvme_wait_reset(), which
>>  	 * may be holding this pci_dev's device lock.
>> @@ -3295,9 +3303,11 @@ static pci_ers_result_t nvme_error_detected(struct pci_dev *pdev,
>>  	case pci_channel_io_frozen:
>>  		dev_warn(dev->ctrl.device,
>>  			"frozen state error detected, reset controller\n");
>> -		if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) {
>> -			nvme_dev_disable(dev, true);
>> -			return PCI_ERS_RESULT_DISCONNECT;
>> +		if (nvme_ctrl_state(&dev->ctrl) != NVME_CTRL_RESETTING) {
>> +			if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) {
>> +				nvme_dev_disable(dev, true);
>> +				return PCI_ERS_RESULT_DISCONNECT;
>> +			}
>>  		}
>>  		nvme_dev_disable(dev, false);
>>  		return PCI_ERS_RESULT_NEED_RESET;
> 
> I get what you're trying to do, but it looks racy. The reset_work may
> finish before pci sets channel offline, or the error handling work
> happens to see RESETTING state, but then transitions to CONNECTING state
> after and deadlocks on the '.resume()' side. You are counting on a very
> specific sequence tied to the PCIe error handling module, and maybe you
> are able to count on that sequence for your platform in this unique
> scenario, but these link errors could happen anytime.
> 
I am not sure about the deadlock in '.resume()' side you mentioned above.
Did you mean that deadlock occur due to someone holding this pci_dev's device lock?
Or deadlock occur due to the flush_work() from nvme_error_resume() would never 
return?

If the reset_work finishes before pci sets channel offline and nvme_timeout() detects
the pci channel offline then we would fall through the below sequence of events:
eeh_event_hadnler()
  ->nvme_error_detected() => set ctrl state to RESETTING
    ->nvme_slot_reset() => it schedules reset_work
      ->nvme_error_resume() => it waits until reset_work finishes
      "Device recovers"

If error handling work happens to see RESETTING state (it means that reset_work
has been scheduled or might have started running) then the controller state can't 
transition to CONNECTING state. In this case, the reset_work would not finish as the 
reset_work would try accessing the pci iomem and that would fail before setting
the controller state to CONNECTING.The reset_work invokes nvme_dev_disable() and 
nvme_pci_enable() (before setting controller state to CONNECTING) and both these
functions access pci iomem space.

> And nvme subsystem reset is just odd, it's not clear how it was intended
> to be handled. It takes the links down so seems like it requires
> re-enumeration from a pcie hotplug driver, and that's kind of how it was
> expected to work here, but your platform has a special way to contain
> the link event and bring things back up the way they were before. And
> the fact you *require* IO to be in flight just so the timeout handler
> can dispatch a non-posted transaction 30 seconds later to trigger EEH is
> also odd. Why can't EEH just detect the link down event directly?
I think EEH could detect the link down event however the eeh recovery 
mechanism piggy-backs on the PCI hotplug infrastructure so that device drivers 
do not need to be specifically modified to support EEH recovery on ppc 
architecture. Hence the eeh recovery is triggered when driver tries to 
access pci IO memory space and that starts the kernel generic pci error 
recovery.

> 
> This driver unfortunately doesn't handle errors during a reset well.
> Trying to handle that has been problematic, so the driver just bails if
> anything goes wrong at this critical initialization point. Maybe we need
> to address the reset/initialization failure handling more generically
> and delegate the teardown or retry decision to something else. Messing
> with that is pretty fragile right now, though.
> 
> Or you could just re-enumerate the slot.
Just another thought, how about serializing reset_work and error handling work?
> 
> I don't know, sorry my message is not really helping much to get this
> fixed.
Yeah, I know this is fragile right now and so it would require 
detailed review and discussion.

Thanks,
--Nilay



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RESEND] nvme-pci: Fix EEH failure on ppc after subsystem reset
  2024-03-09 14:29   ` Nilay Shroff
@ 2024-03-09 15:44     ` Keith Busch
  2024-03-09 19:05       ` Nilay Shroff
  0 siblings, 1 reply; 17+ messages in thread
From: Keith Busch @ 2024-03-09 15:44 UTC (permalink / raw)
  To: Nilay Shroff; +Cc: linux-nvme

On Sat, Mar 09, 2024 at 07:59:11PM +0530, Nilay Shroff wrote:
> On 3/8/24 21:11, Keith Busch wrote:
> > On Fri, Feb 09, 2024 at 10:32:16AM +0530, Nilay Shroff wrote:
> >> @@ -2776,6 +2776,14 @@ static void nvme_reset_work(struct work_struct *work)
> >>   out_unlock:
> >>  	mutex_unlock(&dev->shutdown_lock);
> >>   out:
> >> +	/*
> >> +	 * If PCI recovery is ongoing then let it finish first
> >> +	 */
> >> +	if (pci_channel_offline(to_pci_dev(dev->dev))) {
> >> +		dev_warn(dev->ctrl.device, "PCI recovery is ongoing so let it finish\n");
> >> +		return;
> >> +	}
> >> +
> >>  	/*
> >>  	 * Set state to deleting now to avoid blocking nvme_wait_reset(), which
> >>  	 * may be holding this pci_dev's device lock.
> >> @@ -3295,9 +3303,11 @@ static pci_ers_result_t nvme_error_detected(struct pci_dev *pdev,
> >>  	case pci_channel_io_frozen:
> >>  		dev_warn(dev->ctrl.device,
> >>  			"frozen state error detected, reset controller\n");
> >> -		if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) {
> >> -			nvme_dev_disable(dev, true);
> >> -			return PCI_ERS_RESULT_DISCONNECT;
> >> +		if (nvme_ctrl_state(&dev->ctrl) != NVME_CTRL_RESETTING) {
> >> +			if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) {
> >> +				nvme_dev_disable(dev, true);
> >> +				return PCI_ERS_RESULT_DISCONNECT;
> >> +			}
> >>  		}
> >>  		nvme_dev_disable(dev, false);
> >>  		return PCI_ERS_RESULT_NEED_RESET;
> > 
> > I get what you're trying to do, but it looks racy. The reset_work may
> > finish before pci sets channel offline, or the error handling work
> > happens to see RESETTING state, but then transitions to CONNECTING state
> > after and deadlocks on the '.resume()' side. You are counting on a very
> > specific sequence tied to the PCIe error handling module, and maybe you
> > are able to count on that sequence for your platform in this unique
> > scenario, but these link errors could happen anytime.
> > 
> I am not sure about the deadlock in '.resume()' side you mentioned above.
> Did you mean that deadlock occur due to someone holding this pci_dev's device lock?
> Or deadlock occur due to the flush_work() from nvme_error_resume() would never 
> return?

Your patch may observe a ctrl in "RESETTING" state from
error_detected(), then disable the controller, which quiesces the admin
queue. Meanwhile, reset_work may proceed to CONNECTING state and try
nvme_submit_sync_cmd(), which blocks forever because no one is going to
unquiesce that admin queue.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RESEND] nvme-pci: Fix EEH failure on ppc after subsystem reset
  2024-03-09 15:44     ` Keith Busch
@ 2024-03-09 19:05       ` Nilay Shroff
  2024-03-11  4:41         ` Keith Busch
  0 siblings, 1 reply; 17+ messages in thread
From: Nilay Shroff @ 2024-03-09 19:05 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-nvme, axboe, hch, sagi, linux-block, gjoyce



On 3/9/24 21:14, Keith Busch wrote:
> On Sat, Mar 09, 2024 at 07:59:11PM +0530, Nilay Shroff wrote:
>> On 3/8/24 21:11, Keith Busch wrote:
>>> On Fri, Feb 09, 2024 at 10:32:16AM +0530, Nilay Shroff wrote:
>>>> @@ -2776,6 +2776,14 @@ static void nvme_reset_work(struct work_struct *work)
>>>>   out_unlock:
>>>>  	mutex_unlock(&dev->shutdown_lock);
>>>>   out:
>>>> +	/*
>>>> +	 * If PCI recovery is ongoing then let it finish first
>>>> +	 */
>>>> +	if (pci_channel_offline(to_pci_dev(dev->dev))) {
>>>> +		dev_warn(dev->ctrl.device, "PCI recovery is ongoing so let it finish\n");
>>>> +		return;
>>>> +	}
>>>> +
>>>>  	/*
>>>>  	 * Set state to deleting now to avoid blocking nvme_wait_reset(), which
>>>>  	 * may be holding this pci_dev's device lock.
>>>> @@ -3295,9 +3303,11 @@ static pci_ers_result_t nvme_error_detected(struct pci_dev *pdev,
>>>>  	case pci_channel_io_frozen:
>>>>  		dev_warn(dev->ctrl.device,
>>>>  			"frozen state error detected, reset controller\n");
>>>> -		if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) {
>>>> -			nvme_dev_disable(dev, true);
>>>> -			return PCI_ERS_RESULT_DISCONNECT;
>>>> +		if (nvme_ctrl_state(&dev->ctrl) != NVME_CTRL_RESETTING) {
>>>> +			if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) {
>>>> +				nvme_dev_disable(dev, true);
>>>> +				return PCI_ERS_RESULT_DISCONNECT;
>>>> +			}
>>>>  		}
>>>>  		nvme_dev_disable(dev, false);
>>>>  		return PCI_ERS_RESULT_NEED_RESET;
>>>
>>> I get what you're trying to do, but it looks racy. The reset_work may
>>> finish before pci sets channel offline, or the error handling work
>>> happens to see RESETTING state, but then transitions to CONNECTING state
>>> after and deadlocks on the '.resume()' side. You are counting on a very
>>> specific sequence tied to the PCIe error handling module, and maybe you
>>> are able to count on that sequence for your platform in this unique
>>> scenario, but these link errors could happen anytime.
>>>
>> I am not sure about the deadlock in '.resume()' side you mentioned above.
>> Did you mean that deadlock occur due to someone holding this pci_dev's device lock?
>> Or deadlock occur due to the flush_work() from nvme_error_resume() would never 
>> return?
> 
> Your patch may observe a ctrl in "RESETTING" state from
> error_detected(), then disable the controller, which quiesces the admin
> queue. Meanwhile, reset_work may proceed to CONNECTING state and try
> nvme_submit_sync_cmd(), which blocks forever because no one is going to
> unquiesce that admin queue.
> 
OK I think I got your point. However, it seems that even without my patch
the above mentioned deadlock could still be possible. 
Without my patch, if error_detcted() observe a ctrl in "RESETTING" state then 
it still invokes nvme_dev_disable(). The only difference with my patch is that 
error_detected() returns the PCI_ERS_RESULT_NEED_RESET instead of PCI_ERS_RESULT_DISCONNECT.

Regarding the deadlock, it appears to me that reset_work races with nvme_dev_disable()
and we may want to extend the shutdown_lock in reset_work so that nvme_dev_disable() 
can't interfere with admin queue while reset_work accesses the admin queue. I think
we can fix this case. I would send PATCH v2 with this fix for review, however, please let 
me know if you have any other concern before I spin a new patch.

Thanks,
--Nilay







^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RESEND] nvme-pci: Fix EEH failure on ppc after subsystem reset
  2024-03-09 19:05       ` Nilay Shroff
@ 2024-03-11  4:41         ` Keith Busch
  2024-03-11 12:58           ` Nilay Shroff
  0 siblings, 1 reply; 17+ messages in thread
From: Keith Busch @ 2024-03-11  4:41 UTC (permalink / raw)
  To: Nilay Shroff; +Cc: linux-nvme, axboe, hch, sagi, linux-block, gjoyce

On Sun, Mar 10, 2024 at 12:35:06AM +0530, Nilay Shroff wrote:
> On 3/9/24 21:14, Keith Busch wrote:
> > Your patch may observe a ctrl in "RESETTING" state from
> > error_detected(), then disable the controller, which quiesces the admin
> > queue. Meanwhile, reset_work may proceed to CONNECTING state and try
> > nvme_submit_sync_cmd(), which blocks forever because no one is going to
> > unquiesce that admin queue.
> > 
> OK I think I got your point. However, it seems that even without my patch
> the above mentioned deadlock could still be possible. 

I sure hope not. The current design should guarnatee forward progress on
initialization failed devices.

> Without my patch, if error_detcted() observe a ctrl in "RESETTING" state then 
> it still invokes nvme_dev_disable(). The only difference with my patch is that 
> error_detected() returns the PCI_ERS_RESULT_NEED_RESET instead of PCI_ERS_RESULT_DISCONNECT.

There's one more subtle difference: that condition disables with the
'shutdown' parameter set to 'true' which accomplishes a couple things:
all entered requests are flushed to their demise via the final
unquiesce, and all request_queue's are killed which forces error returns
for all new request allocations. No thread will be left waiting for
something that won't happen.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RESEND] nvme-pci: Fix EEH failure on ppc after subsystem reset
  2024-03-11  4:41         ` Keith Busch
@ 2024-03-11 12:58           ` Nilay Shroff
  2024-03-12 14:30             ` Keith Busch
  0 siblings, 1 reply; 17+ messages in thread
From: Nilay Shroff @ 2024-03-11 12:58 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-nvme, axboe, hch, sagi, linux-block, gjoyce



On 3/11/24 10:11, Keith Busch wrote:
> On Sun, Mar 10, 2024 at 12:35:06AM +0530, Nilay Shroff wrote:
>> On 3/9/24 21:14, Keith Busch wrote:
>>> Your patch may observe a ctrl in "RESETTING" state from
>>> error_detected(), then disable the controller, which quiesces the admin
>>> queue. Meanwhile, reset_work may proceed to CONNECTING state and try
>>> nvme_submit_sync_cmd(), which blocks forever because no one is going to
>>> unquiesce that admin queue.
>>>
>> OK I think I got your point. However, it seems that even without my patch
>> the above mentioned deadlock could still be possible. 
> 
> I sure hope not. The current design should guarnatee forward progress on
> initialization failed devices.
> 
>> Without my patch, if error_detcted() observe a ctrl in "RESETTING" state then 
>> it still invokes nvme_dev_disable(). The only difference with my patch is that 
>> error_detected() returns the PCI_ERS_RESULT_NEED_RESET instead of PCI_ERS_RESULT_DISCONNECT.
> 
> There's one more subtle difference: that condition disables with the
> 'shutdown' parameter set to 'true' which accomplishes a couple things:
> all entered requests are flushed to their demise via the final
> unquiesce, and all request_queue's are killed which forces error returns
> for all new request allocations. No thread will be left waiting for
> something that won't happen.
> 
Aaargh, I didn't notice that subtle difference. I got your all points..
After thinking for a while (as you suggested) it seems that we potentially
require to contain the race between reset_work and error_detected code path
as both could run in parallel on different cpu when pci recovery initiates. 
So I'm thinking that if we can hold on the error_detected() code path to proceed
until reset_work is finished ? Particularly, in error_detcted() function if 
we fall through the pci_channel_io_frozen case and the ctrl state is already
RESETTING (so that means that reset_work shall be running) then hold on 
invoking nvme_dev_diable(dev, false) until reset_work is finished. The changes
should be something as below:

@@ -3295,10 +3304,13 @@ static pci_ers_result_t nvme_error_detected(struct pci_dev *pdev,
        case pci_channel_io_frozen:
                dev_warn(dev->ctrl.device,
                        "frozen state error detected, reset controller\n");
-               if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) {
-                       nvme_dev_disable(dev, true);
-                       return PCI_ERS_RESULT_DISCONNECT;
+               if (nvme_ctrl_state(&dev->ctrl) != NVME_CTRL_RESETTING) {
+                       if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) {
+                               nvme_dev_disable(dev, true);
+                               return PCI_ERS_RESULT_DISCONNECT;
+                       }
                }
+               flush_work(&dev->ctrl.reset_work);
                nvme_dev_disable(dev, false);
                return PCI_ERS_RESULT_NEED_RESET;
        case pci_channel_io_perm_failure:

The flush_work() would ensure that we don't disable the ctrl if reset_work 
is running. If the rest_work is *not* running currently then flush_work() should
return immediately. Moreover, if reset_work is scheduled or start running after
flush_work() returns then reset_work should not be able to get upto the CONNECTING
state because pci recovery is in progress and so it should fail early.

On the reset_work side other than detecting pci error recovery, I think we also 
need one another change where in case the ctrl state is set to CONNECTING and we 
detect the pci error recovery in progress then before returning from the reset_work
we set the ctrl state to RESETTING so that error_detected() could forward progress.
The changes should be something as below:

@@ -2776,6 +2776,16 @@ static void nvme_reset_work(struct work_struct *work)
  out_unlock:
        mutex_unlock(&dev->shutdown_lock);
  out:
+       /*
+        * If PCI recovery is ongoing then let it finish first
+        */
+       if (pci_channel_offline(to_pci_dev(dev->dev))) {
+               dev_warn(dev->ctrl.device, "PCI recovery is ongoing so let it finish\n");
+               if (nvme_ctrl_state(&dev->ctrl) != NVME_CTRL_RESETTING)
+                       WRITE_ONCE(dev->ctrl.state, NVME_CTRL_RESETTING);
+               return;
+       }
+
        /*
         * Set state to deleting now to avoid blocking nvme_wait_reset(), which
         * may be holding this pci_dev's device lock.

Please let me know if you find anything not good with the above changes.

Thanks,
--Nilay










^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RESEND] nvme-pci: Fix EEH failure on ppc after subsystem reset
  2024-03-11 12:58           ` Nilay Shroff
@ 2024-03-12 14:30             ` Keith Busch
  2024-03-13 11:59               ` Nilay Shroff
  0 siblings, 1 reply; 17+ messages in thread
From: Keith Busch @ 2024-03-12 14:30 UTC (permalink / raw)
  To: Nilay Shroff; +Cc: linux-nvme, axboe, hch, sagi, linux-block, gjoyce

On Mon, Mar 11, 2024 at 06:28:21PM +0530, Nilay Shroff wrote:
> @@ -3295,10 +3304,13 @@ static pci_ers_result_t nvme_error_detected(struct pci_dev *pdev,
>         case pci_channel_io_frozen:
>                 dev_warn(dev->ctrl.device,
>                         "frozen state error detected, reset controller\n");
> -               if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) {
> -                       nvme_dev_disable(dev, true);
> -                       return PCI_ERS_RESULT_DISCONNECT;
> +               if (nvme_ctrl_state(&dev->ctrl) != NVME_CTRL_RESETTING) {
> +                       if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) {
> +                               nvme_dev_disable(dev, true);
> +                               return PCI_ERS_RESULT_DISCONNECT;
> +                       }
>                 }
> +               flush_work(&dev->ctrl.reset_work);

I was messing with a similar idea. I wasn't sure if EEH calls the error
handler inline with the error, in which case this would try to flush the
work within the same work, which obviously doesn't work. As long as its
called from a different thread, then this should be fine.

>                 nvme_dev_disable(dev, false);
>                 return PCI_ERS_RESULT_NEED_RESET;
>         case pci_channel_io_perm_failure:
> 
> The flush_work() would ensure that we don't disable the ctrl if reset_work 
> is running. If the rest_work is *not* running currently then flush_work() should
> return immediately. Moreover, if reset_work is scheduled or start running after
> flush_work() returns then reset_work should not be able to get upto the CONNECTING
> state because pci recovery is in progress and so it should fail early.
> 
> On the reset_work side other than detecting pci error recovery, I think we also 
> need one another change where in case the ctrl state is set to CONNECTING and we 
> detect the pci error recovery in progress then before returning from the reset_work
> we set the ctrl state to RESETTING so that error_detected() could forward progress.
> The changes should be something as below:
> 
> @@ -2776,6 +2776,16 @@ static void nvme_reset_work(struct work_struct *work)
>   out_unlock:
>         mutex_unlock(&dev->shutdown_lock);
>   out:
> +       /*
> +        * If PCI recovery is ongoing then let it finish first
> +        */
> +       if (pci_channel_offline(to_pci_dev(dev->dev))) {
> +               dev_warn(dev->ctrl.device, "PCI recovery is ongoing so let it finish\n");
> +               if (nvme_ctrl_state(&dev->ctrl) != NVME_CTRL_RESETTING)
> +                       WRITE_ONCE(dev->ctrl.state, NVME_CTRL_RESETTING);

This may break the state machine, like if the device was hot removed
during all this error handling. This will force the state back to
RESETTING when it should be DEAD.

I think what you need is just allow a controller to reset from a
connecting state. Have to be careful that wouldn't break any other
expectations, though.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RESEND] nvme-pci: Fix EEH failure on ppc after subsystem reset
  2024-03-12 14:30             ` Keith Busch
@ 2024-03-13 11:59               ` Nilay Shroff
  2024-03-22  5:02                 ` Nilay Shroff
  0 siblings, 1 reply; 17+ messages in thread
From: Nilay Shroff @ 2024-03-13 11:59 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-nvme, axboe, hch, sagi, linux-block, gjoyce



On 3/12/24 20:00, Keith Busch wrote:
> On Mon, Mar 11, 2024 at 06:28:21PM +0530, Nilay Shroff wrote:
>> @@ -3295,10 +3304,13 @@ static pci_ers_result_t nvme_error_detected(struct pci_dev *pdev,
>>         case pci_channel_io_frozen:
>>                 dev_warn(dev->ctrl.device,
>>                         "frozen state error detected, reset controller\n");
>> -               if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) {
>> -                       nvme_dev_disable(dev, true);
>> -                       return PCI_ERS_RESULT_DISCONNECT;
>> +               if (nvme_ctrl_state(&dev->ctrl) != NVME_CTRL_RESETTING) {
>> +                       if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) {
>> +                               nvme_dev_disable(dev, true);
>> +                               return PCI_ERS_RESULT_DISCONNECT;
>> +                       }
>>                 }
>> +               flush_work(&dev->ctrl.reset_work);
> 
> I was messing with a similar idea. I wasn't sure if EEH calls the error
> handler inline with the error, in which case this would try to flush the
> work within the same work, which obviously doesn't work. As long as its
> called from a different thread, then this should be fine.
The EEH recovery happens from different thread and so flush work should 
work here as expected.

>> @@ -2776,6 +2776,16 @@ static void nvme_reset_work(struct work_struct *work)
>>   out_unlock:
>>         mutex_unlock(&dev->shutdown_lock);
>>   out:
>> +       /*
>> +        * If PCI recovery is ongoing then let it finish first
>> +        */
>> +       if (pci_channel_offline(to_pci_dev(dev->dev))) {
>> +               dev_warn(dev->ctrl.device, "PCI recovery is ongoing so let it finish\n");
>> +               if (nvme_ctrl_state(&dev->ctrl) != NVME_CTRL_RESETTING)
>> +                       WRITE_ONCE(dev->ctrl.state, NVME_CTRL_RESETTING);
> 
> This may break the state machine, like if the device was hot removed
> during all this error handling. This will force the state back to
> RESETTING when it should be DEAD.
Agreed, we shouldn't force reset state to RESETTING here.
> 
> I think what you need is just allow a controller to reset from a
> connecting state. Have to be careful that wouldn't break any other
> expectations, though.
Yeah ok got your point, so I have reworked the ctrl state machine as you 
suggested and tested the changes. The updated state machine code is shown
below:

@@ -546,10 +546,11 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
                break;
        case NVME_CTRL_RESETTING:
                switch (old_state) {
                case NVME_CTRL_NEW:
                case NVME_CTRL_LIVE:
+               case NVME_CTRL_CONNECTING:
                        changed = true;
                        fallthrough;
                default:
                        break;
                }

And accordingly updated reset_work function is show below. Here we ensure that
even though the pci error recovery is in progress, if we couldn't move ctrl state
to RESETTING then we would let rest_work forward progress and set the ctrl state
to DEAD.

@@ -2774,10 +2774,19 @@ static void nvme_reset_work(struct work_struct *work)
        return;
 
  out_unlock:
        mutex_unlock(&dev->shutdown_lock);
  out:
+       /*
+        * If PCI recovery is ongoing then let it finish first
+        */
+       if (pci_channel_offline(to_pci_dev(dev->dev))) {
+               if (nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) {
+                       dev_warn(dev->ctrl.device, "PCI recovery is ongoing, let it finish\n");
+                       return;
+               }
+       }
        /*
         * Set state to deleting now to avoid blocking nvme_wait_reset(), which
         * may be holding this pci_dev's device lock.
         */
        dev_warn(dev->ctrl.device, "Disabling device after reset failure: %d\n",


Now I have also included in my test the hot removal of NVMe adapter while EEH recovery
is in progress. And the  EEH recovery code handles this case well : When EEH recovery 
is in progress and if we hot removes the adapter (which is being recovered) then EEH
handler would stop the recovery, set the PCI channel state to "pci_channel_io_perm_failure".


Collected the logs of this case, (shown below):
-----------------------------------------------
# nvme list-subsys 
nvme-subsys0 - NQN=nqn.1994-11.com.samsung:nvme:PM1735:2.5-inch:S6EUNA0R500358
               hostnqn=nqn.2014-08.org.nvmexpress:uuid:12b49f6e-0276-4746-b10c-56815b7e6dc2
               iopolicy=numa

# lspci 
0018:01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM173X

# dmesg
[  561.639102] EEH: Recovering PHB#18-PE#10000
[  561.639120] EEH: PE location: N/A, PHB location: N/A
[  561.639128] EEH: Frozen PHB#18-PE#10000 detected
<snip>
<snip>
[  561.639428] EEH: This PCI device has failed 2 times in the last hour and will be permanently disabled after 5 failures.
[  561.639441] EEH: Notify device drivers to shutdown
[  561.639450] EEH: Beginning: 'error_detected(IO frozen)'
[  561.639458] PCI 0018:01:00.0#10000: EEH: Invoking nvme->error_detected(IO frozen)
[  561.639462] nvme nvme0: frozen state error detected, reset controller
[  561.719078] nvme 0018:01:00.0: enabling device (0000 -> 0002)
[  561.719318] nvme nvme0: PCI recovery is ongoing so let it finish

<!!!! WHILE EEH RECOVERY IS IN PROGRESS WE HOT REMOVE THE NVMe ADAPTER !!!!>

[  563.850328] rpaphp: RPA HOT Plug PCI Controller Driver version: 0.1
<snip>
[  571.879092] PCI 0018:01:00.0#10000: EEH: nvme driver reports: 'need reset'
[  571.879097] EEH: Finished:'error_detected(IO frozen)' with aggregate recovery state:'need reset'
<snip>
<snip>
[  571.881761] EEH: Reset without hotplug activity
[  574.039807] EEH: PHB#18-PE#10000: Slot inactive after reset: 0x2 (attempt 1)
[  574.309091] EEH: Failure 2 resetting PHB#18-PE#10000 (attempt 2)
[  574.309091] 
[  574.579094] EEH: Failure 2 resetting PHB#18-PE#10000 (attempt 3)
[  574.579094] 
[  574.579101] eeh_handle_normal_event: Cannot reset, err=-5
[  574.579104] EEH: Unable to recover from failure from PHB#18-PE#10000.
[  574.579104] Please try reseating or replacing it
<snip>
<snip>
[  574.581314] EEH: Beginning: 'error_detected(permanent failure)'
[  574.581320] PCI 0018:01:00.0#10000: EEH: no device

# lspci
<empty>

# nvme list-subsys
<empty>


Another case tested, when the reset_work is ongoing post subsystem-reset command
and if user immediately hot removes the NVMe adapter then EEH recovery is *not* 
triggered and ctrl forward progress to the "DEAD" state.

Collected the logs of this case, (shown below):
-----------------------------------------------
# nvme list-subsys 
nvme-subsys0 - NQN=nqn.1994-11.com.samsung:nvme:PM1735:2.5-inch:S6EUNA0R500358
               hostnqn=nqn.2014-08.org.nvmexpress:uuid:12b49f6e-0276-4746-b10c-56815b7e6dc2
               iopolicy=numa

# lspci 
0018:01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM173X

# nvme subsystem-reset /dev/nvme0

<!!!! HOT REMOVE THE NVMe ADAPTER !!!!>

# dmesg
[ 9967.143886] eeh_handle_normal_event: Cannot find PCI bus for PHB#18-PE#10000
[ 9967.224078] eeh_handle_normal_event: Cannot find PCI bus for PHB#18-PE#10000
<snip>
[ 9967.223858] nvme 0018:01:00.0: enabling device (0000 -> 0002)
[ 9967.224140] nvme nvme0: Disabling device after reset failure: -19

# lspci
<empty>

# nvme list-subsys
<empty>


Please let me know if the above changes look good to you. If it looks good then 
I would spin a new version of the patch and send for a review.

Thanks,
--Nilay



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH RESEND] nvme-pci: Fix EEH failure on ppc after subsystem reset
  2024-03-13 11:59               ` Nilay Shroff
@ 2024-03-22  5:02                 ` Nilay Shroff
  0 siblings, 0 replies; 17+ messages in thread
From: Nilay Shroff @ 2024-03-22  5:02 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-nvme, axboe, hch, sagi, linux-block, gjoyce

Hi Keith,

A gentle ping. I don't know whether you got a chance to review the last email on this subject.
Please let me know your feedback/thoughts.

Thanks,
--Nilay

On 3/13/24 17:29, Nilay Shroff wrote:
> 
> 
> On 3/12/24 20:00, Keith Busch wrote:
>> On Mon, Mar 11, 2024 at 06:28:21PM +0530, Nilay Shroff wrote:
>>> @@ -3295,10 +3304,13 @@ static pci_ers_result_t nvme_error_detected(struct pci_dev *pdev,
>>>         case pci_channel_io_frozen:
>>>                 dev_warn(dev->ctrl.device,
>>>                         "frozen state error detected, reset controller\n");
>>> -               if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) {
>>> -                       nvme_dev_disable(dev, true);
>>> -                       return PCI_ERS_RESULT_DISCONNECT;
>>> +               if (nvme_ctrl_state(&dev->ctrl) != NVME_CTRL_RESETTING) {
>>> +                       if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) {
>>> +                               nvme_dev_disable(dev, true);
>>> +                               return PCI_ERS_RESULT_DISCONNECT;
>>> +                       }
>>>                 }
>>> +               flush_work(&dev->ctrl.reset_work);
>>
>> I was messing with a similar idea. I wasn't sure if EEH calls the error
>> handler inline with the error, in which case this would try to flush the
>> work within the same work, which obviously doesn't work. As long as its
>> called from a different thread, then this should be fine.
> The EEH recovery happens from different thread and so flush work should 
> work here as expected.
> 
>>> @@ -2776,6 +2776,16 @@ static void nvme_reset_work(struct work_struct *work)
>>>   out_unlock:
>>>         mutex_unlock(&dev->shutdown_lock);
>>>   out:
>>> +       /*
>>> +        * If PCI recovery is ongoing then let it finish first
>>> +        */
>>> +       if (pci_channel_offline(to_pci_dev(dev->dev))) {
>>> +               dev_warn(dev->ctrl.device, "PCI recovery is ongoing so let it finish\n");
>>> +               if (nvme_ctrl_state(&dev->ctrl) != NVME_CTRL_RESETTING)
>>> +                       WRITE_ONCE(dev->ctrl.state, NVME_CTRL_RESETTING);
>>
>> This may break the state machine, like if the device was hot removed
>> during all this error handling. This will force the state back to
>> RESETTING when it should be DEAD.
> Agreed, we shouldn't force reset state to RESETTING here.
>>
>> I think what you need is just allow a controller to reset from a
>> connecting state. Have to be careful that wouldn't break any other
>> expectations, though.
> Yeah ok got your point, so I have reworked the ctrl state machine as you 
> suggested and tested the changes. The updated state machine code is shown
> below:
> 
> @@ -546,10 +546,11 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
>                 break;
>         case NVME_CTRL_RESETTING:
>                 switch (old_state) {
>                 case NVME_CTRL_NEW:
>                 case NVME_CTRL_LIVE:
> +               case NVME_CTRL_CONNECTING:
>                         changed = true;
>                         fallthrough;
>                 default:
>                         break;
>                 }
> 
> And accordingly updated reset_work function is show below. Here we ensure that
> even though the pci error recovery is in progress, if we couldn't move ctrl state
> to RESETTING then we would let rest_work forward progress and set the ctrl state
> to DEAD.
> 
> @@ -2774,10 +2774,19 @@ static void nvme_reset_work(struct work_struct *work)
>         return;
>  
>   out_unlock:
>         mutex_unlock(&dev->shutdown_lock);
>   out:
> +       /*
> +        * If PCI recovery is ongoing then let it finish first
> +        */
> +       if (pci_channel_offline(to_pci_dev(dev->dev))) {
> +               if (nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) {
> +                       dev_warn(dev->ctrl.device, "PCI recovery is ongoing, let it finish\n");
> +                       return;
> +               }
> +       }
>         /*
>          * Set state to deleting now to avoid blocking nvme_wait_reset(), which
>          * may be holding this pci_dev's device lock.
>          */
>         dev_warn(dev->ctrl.device, "Disabling device after reset failure: %d\n",
> 
> 
> Now I have also included in my test the hot removal of NVMe adapter while EEH recovery
> is in progress. And the  EEH recovery code handles this case well : When EEH recovery 
> is in progress and if we hot removes the adapter (which is being recovered) then EEH
> handler would stop the recovery, set the PCI channel state to "pci_channel_io_perm_failure".
> 
> 
> Collected the logs of this case, (shown below):
> -----------------------------------------------
> # nvme list-subsys 
> nvme-subsys0 - NQN=nqn.1994-11.com.samsung:nvme:PM1735:2.5-inch:S6EUNA0R500358
>                hostnqn=nqn.2014-08.org.nvmexpress:uuid:12b49f6e-0276-4746-b10c-56815b7e6dc2
>                iopolicy=numa
> 
> # lspci 
> 0018:01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM173X
> 
> # dmesg
> [  561.639102] EEH: Recovering PHB#18-PE#10000
> [  561.639120] EEH: PE location: N/A, PHB location: N/A
> [  561.639128] EEH: Frozen PHB#18-PE#10000 detected
> <snip>
> <snip>
> [  561.639428] EEH: This PCI device has failed 2 times in the last hour and will be permanently disabled after 5 failures.
> [  561.639441] EEH: Notify device drivers to shutdown
> [  561.639450] EEH: Beginning: 'error_detected(IO frozen)'
> [  561.639458] PCI 0018:01:00.0#10000: EEH: Invoking nvme->error_detected(IO frozen)
> [  561.639462] nvme nvme0: frozen state error detected, reset controller
> [  561.719078] nvme 0018:01:00.0: enabling device (0000 -> 0002)
> [  561.719318] nvme nvme0: PCI recovery is ongoing so let it finish
> 
> <!!!! WHILE EEH RECOVERY IS IN PROGRESS WE HOT REMOVE THE NVMe ADAPTER !!!!>
> 
> [  563.850328] rpaphp: RPA HOT Plug PCI Controller Driver version: 0.1
> <snip>
> [  571.879092] PCI 0018:01:00.0#10000: EEH: nvme driver reports: 'need reset'
> [  571.879097] EEH: Finished:'error_detected(IO frozen)' with aggregate recovery state:'need reset'
> <snip>
> <snip>
> [  571.881761] EEH: Reset without hotplug activity
> [  574.039807] EEH: PHB#18-PE#10000: Slot inactive after reset: 0x2 (attempt 1)
> [  574.309091] EEH: Failure 2 resetting PHB#18-PE#10000 (attempt 2)
> [  574.309091] 
> [  574.579094] EEH: Failure 2 resetting PHB#18-PE#10000 (attempt 3)
> [  574.579094] 
> [  574.579101] eeh_handle_normal_event: Cannot reset, err=-5
> [  574.579104] EEH: Unable to recover from failure from PHB#18-PE#10000.
> [  574.579104] Please try reseating or replacing it
> <snip>
> <snip>
> [  574.581314] EEH: Beginning: 'error_detected(permanent failure)'
> [  574.581320] PCI 0018:01:00.0#10000: EEH: no device
> 
> # lspci
> <empty>
> 
> # nvme list-subsys
> <empty>
> 
> 
> Another case tested, when the reset_work is ongoing post subsystem-reset command
> and if user immediately hot removes the NVMe adapter then EEH recovery is *not* 
> triggered and ctrl forward progress to the "DEAD" state.
> 
> Collected the logs of this case, (shown below):
> -----------------------------------------------
> # nvme list-subsys 
> nvme-subsys0 - NQN=nqn.1994-11.com.samsung:nvme:PM1735:2.5-inch:S6EUNA0R500358
>                hostnqn=nqn.2014-08.org.nvmexpress:uuid:12b49f6e-0276-4746-b10c-56815b7e6dc2
>                iopolicy=numa
> 
> # lspci 
> 0018:01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM173X
> 
> # nvme subsystem-reset /dev/nvme0
> 
> <!!!! HOT REMOVE THE NVMe ADAPTER !!!!>
> 
> # dmesg
> [ 9967.143886] eeh_handle_normal_event: Cannot find PCI bus for PHB#18-PE#10000
> [ 9967.224078] eeh_handle_normal_event: Cannot find PCI bus for PHB#18-PE#10000
> <snip>
> [ 9967.223858] nvme 0018:01:00.0: enabling device (0000 -> 0002)
> [ 9967.224140] nvme nvme0: Disabling device after reset failure: -19
> 
> # lspci
> <empty>
> 
> # nvme list-subsys
> <empty>
> 
> 
> Please let me know if the above changes look good to you. If it looks good then 
> I would spin a new version of the patch and send for a review.
> 
> Thanks,
> --Nilay
> 
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2024-03-22  5:02 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-09  5:02 [PATCH RESEND] nvme-pci: Fix EEH failure on ppc after subsystem reset Nilay Shroff
2024-02-22 21:00 ` Greg Joyce
     [not found] ` <2c76725c-7bb6-4827-b45a-dbe1acbefba7@imap.linux.ibm.com>
2024-02-27 18:14   ` Nilay Shroff
2024-02-27 18:29 ` Keith Busch
2024-02-28 11:19   ` Nilay Shroff
2024-02-29 12:27     ` Nilay Shroff
2024-03-06 11:20   ` Nilay Shroff
2024-03-06 15:19     ` Keith Busch
2024-03-08 15:41 ` Keith Busch
2024-03-09 14:29   ` Nilay Shroff
2024-03-09 15:44     ` Keith Busch
2024-03-09 19:05       ` Nilay Shroff
2024-03-11  4:41         ` Keith Busch
2024-03-11 12:58           ` Nilay Shroff
2024-03-12 14:30             ` Keith Busch
2024-03-13 11:59               ` Nilay Shroff
2024-03-22  5:02                 ` Nilay Shroff

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.