All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 1/1] nvme: handle persistent internal error AER from NVMe controller
@ 2022-06-08 18:52 Michael Kelley
  2022-06-09  0:22 ` Chaitanya Kulkarni
  2022-06-13 18:08 ` Christoph Hellwig
  0 siblings, 2 replies; 6+ messages in thread
From: Michael Kelley @ 2022-06-08 18:52 UTC (permalink / raw)
  To: kbusch, axboe, hch, sagi, linux-nvme, linux-kernel
  Cc: mikelley, caroline.subramoney, riwurd, nathan.obr

In the NVM Express Revision 1.4 spec, Figure 145 describes possible
values for an AER with event type "Error" (value 000b). For a
Persistent Internal Error (value 03h), the host should perform a
controller reset.

Add support for this error using code that already exists for
doing a controller reset. As part of this support, introduce
two utility functions for parsing the AER type and subtype.

This new support was tested in a lab environment where we can
generate the persistent internal error on demand, and observe
both the Linux side and NVMe controller side to see that the
controller reset has been done.

Signed-off-by: Michael Kelley <mikelley@microsoft.com>
---
Changes since v3:
* Removed call to nvme_should_reset() and dropped the original
  patch 1/2 that moved nvme_should_reset() from pci.c to core.c
  [Christoph Hellwig]

Changes since v2:
* Instead of reading CSTS, use a constant value as input to
  nvme_should_reset() [Keith Busch]
* Introduce helper functions for parsing the AER result fields
  [Keith Busch]

 drivers/nvme/host/core.c | 31 +++++++++++++++++++++++++++++--
 include/linux/nvme.h     |  4 ++++
 2 files changed, 33 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 72f7c95..bb8c91e 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -4502,9 +4502,19 @@ static void nvme_fw_act_work(struct work_struct *work)
 	nvme_get_fw_slot_info(ctrl);
 }
 
+static u32 nvme_aer_type(u32 result)
+{
+	return result & 0x7;
+}
+
+static u32 nvme_aer_subtype(u32 result)
+{
+	return (result & 0xff00) >> 8;
+}
+
 static void nvme_handle_aen_notice(struct nvme_ctrl *ctrl, u32 result)
 {
-	u32 aer_notice_type = (result & 0xff00) >> 8;
+	u32 aer_notice_type = nvme_aer_subtype(result);
 
 	trace_nvme_async_event(ctrl, aer_notice_type);
 
@@ -4537,11 +4547,19 @@ static void nvme_handle_aen_notice(struct nvme_ctrl *ctrl, u32 result)
 	}
 }
 
+static void nvme_handle_aer_persistent_error(struct nvme_ctrl *ctrl)
+{
+	trace_nvme_async_event(ctrl, NVME_AER_ERROR);
+	dev_warn(ctrl->device, "resetting controller due to AER\n");
+	nvme_reset_ctrl(ctrl);
+}
+
 void nvme_complete_async_event(struct nvme_ctrl *ctrl, __le16 status,
 		volatile union nvme_result *res)
 {
 	u32 result = le32_to_cpu(res->u32);
-	u32 aer_type = result & 0x07;
+	u32 aer_type = nvme_aer_type(result);
+	u32 aer_subtype = nvme_aer_subtype(result);
 
 	if (le16_to_cpu(status) >> 1 != NVME_SC_SUCCESS)
 		return;
@@ -4551,6 +4569,15 @@ void nvme_complete_async_event(struct nvme_ctrl *ctrl, __le16 status,
 		nvme_handle_aen_notice(ctrl, result);
 		break;
 	case NVME_AER_ERROR:
+		/*
+		 * For a persistent internal error, don't run async_event_work
+		 * to submit a new AER. The controller reset will do it.
+		 */
+		if (aer_subtype == NVME_AER_ERROR_PERSIST_INT_ERR) {
+			nvme_handle_aer_persistent_error(ctrl);
+			return;
+		}
+		fallthrough;
 	case NVME_AER_SMART:
 	case NVME_AER_CSS:
 	case NVME_AER_VS:
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index 29ec3e3..8ced243 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -712,6 +712,10 @@ enum {
 };
 
 enum {
+	NVME_AER_ERROR_PERSIST_INT_ERR	= 0x03,
+};
+
+enum {
 	NVME_AER_NOTICE_NS_CHANGED	= 0x00,
 	NVME_AER_NOTICE_FW_ACT_STARTING = 0x01,
 	NVME_AER_NOTICE_ANA		= 0x03,
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v4 1/1] nvme: handle persistent internal error AER from NVMe controller
  2022-06-08 18:52 [PATCH v4 1/1] nvme: handle persistent internal error AER from NVMe controller Michael Kelley
@ 2022-06-09  0:22 ` Chaitanya Kulkarni
  2022-06-09  0:28   ` Chaitanya Kulkarni
  2022-06-13 18:08 ` Christoph Hellwig
  1 sibling, 1 reply; 6+ messages in thread
From: Chaitanya Kulkarni @ 2022-06-09  0:22 UTC (permalink / raw)
  To: Michael Kelley, kbusch, axboe, hch, sagi, linux-nvme, linux-kernel
  Cc: caroline.subramoney, riwurd, nathan.obr

On 6/8/22 11:52, Michael Kelley wrote:
> In the NVM Express Revision 1.4 spec, Figure 145 describes possible
> values for an AER with event type "Error" (value 000b). For a
> Persistent Internal Error (value 03h), the host should perform a
> controller reset.
> 
> Add support for this error using code that already exists for
> doing a controller reset. As part of this support, introduce
> two utility functions for parsing the AER type and subtype.
> 
> This new support was tested in a lab environment where we can
> generate the persistent internal error on demand, and observe
> both the Linux side and NVMe controller side to see that the
> controller reset has been done.
> 
> Signed-off-by: Michael Kelley <mikelley@microsoft.com>
> ---


Looks good. Thanks a lot for testing this, perhaps consider
writing the testcase for it in  the blktests under nvme
category, that way it will get tested by everyone else.

Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>

-ck



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v4 1/1] nvme: handle persistent internal error AER from NVMe controller
  2022-06-09  0:22 ` Chaitanya Kulkarni
@ 2022-06-09  0:28   ` Chaitanya Kulkarni
  2022-06-09  1:30     ` Michael Kelley (LINUX)
  0 siblings, 1 reply; 6+ messages in thread
From: Chaitanya Kulkarni @ 2022-06-09  0:28 UTC (permalink / raw)
  To: Michael Kelley
  Cc: caroline.subramoney, linux-nvme, linux-kernel, axboe, riwurd,
	nathan.obr, sagi, kbusch, hch

On 6/8/22 17:22, Chaitanya Kulkarni wrote:
> On 6/8/22 11:52, Michael Kelley wrote:
>> In the NVM Express Revision 1.4 spec, Figure 145 describes possible
>> values for an AER with event type "Error" (value 000b). For a
>> Persistent Internal Error (value 03h), the host should perform a
>> controller reset.
>>
>> Add support for this error using code that already exists for
>> doing a controller reset. As part of this support, introduce
>> two utility functions for parsing the AER type and subtype.
>>
>> This new support was tested in a lab environment where we can
>> generate the persistent internal error on demand, and observe
>> both the Linux side and NVMe controller side to see that the
>> controller reset has been done.
>>
>>

Can you please clarify that which transports you have tested
such as RDMA, TCP, and PCIe ?

-ck



^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: [PATCH v4 1/1] nvme: handle persistent internal error AER from NVMe controller
  2022-06-09  0:28   ` Chaitanya Kulkarni
@ 2022-06-09  1:30     ` Michael Kelley (LINUX)
  2022-06-09  5:06       ` Chaitanya Kulkarni
  0 siblings, 1 reply; 6+ messages in thread
From: Michael Kelley (LINUX) @ 2022-06-09  1:30 UTC (permalink / raw)
  To: Chaitanya Kulkarni
  Cc: Caroline Subramoney, linux-nvme, linux-kernel, axboe,
	Richard Wurdack, Nathan Obr, sagi, kbusch, hch

From: Chaitanya Kulkarni <chaitanyak@nvidia.com>
> 
> On 6/8/22 17:22, Chaitanya Kulkarni wrote:
> > On 6/8/22 11:52, Michael Kelley wrote:
> >> In the NVM Express Revision 1.4 spec, Figure 145 describes possible
> >> values for an AER with event type "Error" (value 000b). For a
> >> Persistent Internal Error (value 03h), the host should perform a
> >> controller reset.
> >>
> >> Add support for this error using code that already exists for
> >> doing a controller reset. As part of this support, introduce
> >> two utility functions for parsing the AER type and subtype.
> >>
> >> This new support was tested in a lab environment where we can
> >> generate the persistent internal error on demand, and observe
> >> both the Linux side and NVMe controller side to see that the
> >> controller reset has been done.
> >>
> >>
> 
> Can you please clarify that which transports you have tested
> such as RDMA, TCP, and PCIe ?
> 

I've tested PCIe only -- that's all I have access to.  I can tweak
the commit message to be more specific.

Michael

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v4 1/1] nvme: handle persistent internal error AER from NVMe controller
  2022-06-09  1:30     ` Michael Kelley (LINUX)
@ 2022-06-09  5:06       ` Chaitanya Kulkarni
  0 siblings, 0 replies; 6+ messages in thread
From: Chaitanya Kulkarni @ 2022-06-09  5:06 UTC (permalink / raw)
  To: Michael Kelley (LINUX)
  Cc: Caroline Subramoney, linux-nvme, linux-kernel, axboe,
	Richard Wurdack, Nathan Obr, sagi, kbusch, hch

On 6/8/2022 6:30 PM, Michael Kelley (LINUX) wrote:
> From: Chaitanya Kulkarni <chaitanyak@nvidia.com>
>>
>> On 6/8/22 17:22, Chaitanya Kulkarni wrote:
>>> On 6/8/22 11:52, Michael Kelley wrote:
>>>> In the NVM Express Revision 1.4 spec, Figure 145 describes possible
>>>> values for an AER with event type "Error" (value 000b). For a
>>>> Persistent Internal Error (value 03h), the host should perform a
>>>> controller reset.
>>>>
>>>> Add support for this error using code that already exists for
>>>> doing a controller reset. As part of this support, introduce
>>>> two utility functions for parsing the AER type and subtype.
>>>>
>>>> This new support was tested in a lab environment where we can
>>>> generate the persistent internal error on demand, and observe
>>>> both the Linux side and NVMe controller side to see that the
>>>> controller reset has been done.
>>>>
>>>>
>>
>> Can you please clarify that which transports you have tested
>> such as RDMA, TCP, and PCIe ?
>>
> 
> I've tested PCIe only -- that's all I have access to.  I can tweak
> the commit message to be more specific.
> 
> Michael

It's okay we have it documented now, thanks again.

-ck



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v4 1/1] nvme: handle persistent internal error AER from NVMe controller
  2022-06-08 18:52 [PATCH v4 1/1] nvme: handle persistent internal error AER from NVMe controller Michael Kelley
  2022-06-09  0:22 ` Chaitanya Kulkarni
@ 2022-06-13 18:08 ` Christoph Hellwig
  1 sibling, 0 replies; 6+ messages in thread
From: Christoph Hellwig @ 2022-06-13 18:08 UTC (permalink / raw)
  To: Michael Kelley
  Cc: kbusch, axboe, hch, sagi, linux-nvme, linux-kernel,
	caroline.subramoney, riwurd, nathan.obr

Thanks,

applied to nvme-5.20.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2022-06-13 19:41 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-08 18:52 [PATCH v4 1/1] nvme: handle persistent internal error AER from NVMe controller Michael Kelley
2022-06-09  0:22 ` Chaitanya Kulkarni
2022-06-09  0:28   ` Chaitanya Kulkarni
2022-06-09  1:30     ` Michael Kelley (LINUX)
2022-06-09  5:06       ` Chaitanya Kulkarni
2022-06-13 18:08 ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.