[bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests

All of lore.kernel.org
 help / color / mirror / Atom feed

* [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests
@ 2022-04-03 14:28 Yi Zhang
  2022-04-04 16:39 ` Alan Adamson
  2022-04-05 20:00 ` Jason A. Donenfeld
  0 siblings, 2 replies; 31+ messages in thread
From: Yi Zhang @ 2022-04-03 14:28 UTC (permalink / raw)
  To: open list:NVM EXPRESS DRIVER; +Cc: Sagi Grimberg, alan.adamson

Hello

I found this error log during blktests[1] with the latest
linux-block/for-next, seems it was introduced after commit[2], is that
expected?

[1]
[   96.931911] run blktests nvme/004 at 2022-04-02 20:53:16
[   97.024693] loop: module loaded
[   97.029319] loop0: detected capacity change from 0 to 2097152
[   97.039641] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
[   97.068847] nvmet: creating nvm controller 1 for subsystem
blktests-subsystem-1 for NQN
nqn.2014-08.org.nvmexpress:uuid:4c4c4544-0035-4b10-8044-b9c04f463333.
[   97.083215] nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR
[   97.090108] nvme nvme0: creating 32 I/O queues.
[   97.098283] nvme nvme0: new ctrl: "blktests-subsystem-1"
[   98.136900] nvme nvme0: Removing ctrl: NQN "blktests-subsystem-1"

[2]
commit bd83fe6f2cd2133beaac7c423fd36c3515048fc8
Author: Alan Adamson <alan.adamson@oracle.com>
Date:   Thu Feb 3 00:11:53 2022 -0800

    nvme: add verbose error logging

-- 
Best Regards,
  Yi Zhang



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests
  2022-04-03 14:28 [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests Yi Zhang
@ 2022-04-04 16:39 ` Alan Adamson
  2022-04-04 17:02   ` Keith Busch
  2022-04-05 20:00 ` Jason A. Donenfeld
  1 sibling, 1 reply; 31+ messages in thread
From: Alan Adamson @ 2022-04-04 16:39 UTC (permalink / raw)
  To: Yi Zhang; +Cc: open list:NVM EXPRESS DRIVER, Sagi Grimberg


[   97.083215] nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR

An error from the device (status=2/Invalid Field) when an Identify (0x6) command was issued. Prior to the patch,
the nvme driver didn’t display the error.

Alan


> On Apr 3, 2022, at 7:28 AM, Yi Zhang <yi.zhang@redhat.com> wrote:
> 
> Hello
> 
> I found this error log during blktests[1] with the latest
> linux-block/for-next, seems it was introduced after commit[2], is that
> expected?
> 
> [1]
> [   96.931911] run blktests nvme/004 at 2022-04-02 20:53:16
> [   97.024693] loop: module loaded
> [   97.029319] loop0: detected capacity change from 0 to 2097152
> [   97.039641] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
> [   97.068847] nvmet: creating nvm controller 1 for subsystem
> blktests-subsystem-1 for NQN
> nqn.2014-08.org.nvmexpress:uuid:4c4c4544-0035-4b10-8044-b9c04f463333.
> [   97.083215] nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR
> [   97.090108] nvme nvme0: creating 32 I/O queues.
> [   97.098283] nvme nvme0: new ctrl: "blktests-subsystem-1"
> [   98.136900] nvme nvme0: Removing ctrl: NQN "blktests-subsystem-1"
> 
> [2]
> commit bd83fe6f2cd2133beaac7c423fd36c3515048fc8
> Author: Alan Adamson <alan.adamson@oracle.com>
> Date:   Thu Feb 3 00:11:53 2022 -0800
> 
>    nvme: add verbose error logging
> 
> -- 
> Best Regards,
>  Yi Zhang
> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests
  2022-04-04 16:39 ` Alan Adamson
@ 2022-04-04 17:02   ` Keith Busch
  2022-04-04 19:34     ` Jonathan Derrick
  0 siblings, 1 reply; 31+ messages in thread
From: Keith Busch @ 2022-04-04 17:02 UTC (permalink / raw)
  To: Alan Adamson; +Cc: Yi Zhang, open list:NVM EXPRESS DRIVER, Sagi Grimberg

On Mon, Apr 04, 2022 at 04:39:06PM +0000, Alan Adamson wrote:
> 
> [   97.083215] nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR
> 
> An error from the device (status=2/Invalid Field) when an Identify (0x6) command was issued. Prior to the patch,
> the nvme driver didn’t display the error.
 
And it's harmless. The driver is querying an optional identification and it's
okay if the device doesn't support it, but the driver doesn't know if the
device supports until it tries it.
 
> > On Apr 3, 2022, at 7:28 AM, Yi Zhang <yi.zhang@redhat.com> wrote:
> > 
> > Hello
> > 
> > I found this error log during blktests[1] with the latest
> > linux-block/for-next, seems it was introduced after commit[2], is that
> > expected?
> > 
> > [1]
> > [   96.931911] run blktests nvme/004 at 2022-04-02 20:53:16
> > [   97.024693] loop: module loaded
> > [   97.029319] loop0: detected capacity change from 0 to 2097152
> > [   97.039641] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
> > [   97.068847] nvmet: creating nvm controller 1 for subsystem
> > blktests-subsystem-1 for NQN
> > nqn.2014-08.org.nvmexpress:uuid:4c4c4544-0035-4b10-8044-b9c04f463333.
> > [   97.083215] nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR
> > [   97.090108] nvme nvme0: creating 32 I/O queues.
> > [   97.098283] nvme nvme0: new ctrl: "blktests-subsystem-1"
> > [   98.136900] nvme nvme0: Removing ctrl: NQN "blktests-subsystem-1"
> > 
> > [2]
> > commit bd83fe6f2cd2133beaac7c423fd36c3515048fc8
> > Author: Alan Adamson <alan.adamson@oracle.com>
> > Date:   Thu Feb 3 00:11:53 2022 -0800
> > 
> >    nvme: add verbose error logging
> > 
> > -- 
> > Best Regards,
> >  Yi Zhang
> > 
> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests
  2022-04-04 17:02   ` Keith Busch
@ 2022-04-04 19:34     ` Jonathan Derrick
  2022-04-04 20:30       ` Keith Busch
  0 siblings, 1 reply; 31+ messages in thread
From: Jonathan Derrick @ 2022-04-04 19:34 UTC (permalink / raw)
  To: linux-nvme



On 4/4/2022 11:02 AM, Keith Busch wrote:
> On Mon, Apr 04, 2022 at 04:39:06PM +0000, Alan Adamson wrote:
>>
>> [   97.083215] nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR
>>
>> An error from the device (status=2/Invalid Field) when an Identify (0x6) command was issued. Prior to the patch,
>> the nvme driver didn’t display the error.
>   
> And it's harmless. The driver is querying an optional identification and it's
> okay if the device doesn't support it, but the driver doesn't know if the
> device supports until it tries it.
I've had to field bug reports like this before, where it says error in 
dmesg but is actually harmless.

Maybe we could suppress the error and add a harmless print?
Eg, nvme0: blah blah command set not supported


>   
>>> On Apr 3, 2022, at 7:28 AM, Yi Zhang <yi.zhang@redhat.com> wrote:
>>>
>>> Hello
>>>
>>> I found this error log during blktests[1] with the latest
>>> linux-block/for-next, seems it was introduced after commit[2], is that
>>> expected?
>>>
>>> [1]
>>> [   96.931911] run blktests nvme/004 at 2022-04-02 20:53:16
>>> [   97.024693] loop: module loaded
>>> [   97.029319] loop0: detected capacity change from 0 to 2097152
>>> [   97.039641] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
>>> [   97.068847] nvmet: creating nvm controller 1 for subsystem
>>> blktests-subsystem-1 for NQN
>>> nqn.2014-08.org.nvmexpress:uuid:4c4c4544-0035-4b10-8044-b9c04f463333.
>>> [   97.083215] nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR
>>> [   97.090108] nvme nvme0: creating 32 I/O queues.
>>> [   97.098283] nvme nvme0: new ctrl: "blktests-subsystem-1"
>>> [   98.136900] nvme nvme0: Removing ctrl: NQN "blktests-subsystem-1"
>>>
>>> [2]
>>> commit bd83fe6f2cd2133beaac7c423fd36c3515048fc8
>>> Author: Alan Adamson <alan.adamson@oracle.com>
>>> Date:   Thu Feb 3 00:11:53 2022 -0800
>>>
>>>     nvme: add verbose error logging
>>>
>>> -- 
>>> Best Regards,
>>>   Yi Zhang
>>>
>>
> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests
  2022-04-04 19:34     ` Jonathan Derrick
@ 2022-04-04 20:30       ` Keith Busch
  2022-04-05  0:43         ` Jonathan Derrick
  2022-04-05  6:14         ` Christoph Hellwig
  0 siblings, 2 replies; 31+ messages in thread
From: Keith Busch @ 2022-04-04 20:30 UTC (permalink / raw)
  To: Jonathan Derrick; +Cc: linux-nvme

On Mon, Apr 04, 2022 at 01:34:38PM -0600, Jonathan Derrick wrote:
> 
> 
> On 4/4/2022 11:02 AM, Keith Busch wrote:
> > On Mon, Apr 04, 2022 at 04:39:06PM +0000, Alan Adamson wrote:
> > > 
> > > [   97.083215] nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR
> > > 
> > > An error from the device (status=2/Invalid Field) when an Identify (0x6) command was issued. Prior to the patch,
> > > the nvme driver didn’t display the error.
> > And it's harmless. The driver is querying an optional identification and it's
> > okay if the device doesn't support it, but the driver doesn't know if the
> > device supports until it tries it.
> I've had to field bug reports like this before, where it says error in dmesg
> but is actually harmless.

Even before the kernel message was added, some implementations trigger error
log events, which alerts smartd. :(
 
> Maybe we could suppress the error and add a harmless print?
> Eg, nvme0: blah blah command set not supported

The new print in the completion handler is pretty generic. I don't think it can
readily tell the difference from a harmless error. Maybe pr_err is too high?

Or maybe since enough people have been concerned about *this* specific
identify, maybe it should be restricted to 2.0 devices where it's mandatory. I
was reluctant to do that at first since the initial device I tested was 1.4,
but it was a prototype and we should be fine without the non-mdts limits
anyway.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests
  2022-04-04 20:30       ` Keith Busch
@ 2022-04-05  0:43         ` Jonathan Derrick
  2022-04-06 17:34           ` Keith Busch
  2022-04-05  6:14         ` Christoph Hellwig
  1 sibling, 1 reply; 31+ messages in thread
From: Jonathan Derrick @ 2022-04-05  0:43 UTC (permalink / raw)
  To: Keith Busch; +Cc: linux-nvme



On 4/4/2022 2:30 PM, Keith Busch wrote:
> On Mon, Apr 04, 2022 at 01:34:38PM -0600, Jonathan Derrick wrote:
>>
>>
>> On 4/4/2022 11:02 AM, Keith Busch wrote:
>>> On Mon, Apr 04, 2022 at 04:39:06PM +0000, Alan Adamson wrote:
>>>>
>>>> [   97.083215] nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR
>>>>
>>>> An error from the device (status=2/Invalid Field) when an Identify (0x6) command was issued. Prior to the patch,
>>>> the nvme driver didn’t display the error.
>>> And it's harmless. The driver is querying an optional identification and it's
>>> okay if the device doesn't support it, but the driver doesn't know if the
>>> device supports until it tries it.
>> I've had to field bug reports like this before, where it says error in dmesg
>> but is actually harmless.
> 
> Even before the kernel message was added, some implementations trigger error
> log events, which alerts smartd. :(
I haven't had too many smartd reports come my way, but 'error' in dmesg 
is a typical one. Usually due a language barrier issue or uneducated AE. 
But I suspect a lay-user without an AE might resort to kernel bz due to 
things like this.

>   
>> Maybe we could suppress the error and add a harmless print?
>> Eg, nvme0: blah blah command set not supported
> 
> The new print in the completion handler is pretty generic. I don't think it can
> readily tell the difference from a harmless error. Maybe pr_err is too high?
Could we pack RQF_FAILED in the req flags and reduce the severity in 
nvme_log_error?

> 
> Or maybe since enough people have been concerned about *this* specific
> identify, maybe it should be restricted to 2.0 devices where it's mandatory. I
> was reluctant to do that at first since the initial device I tested was 1.4,
> but it was a prototype and we should be fine without the non-mdts limits
> anyway.
Well I'm not sure about that. I'm not honestly sure what this specific 
identify is, but (I know you know) NVMe being forward compatible allows 
eg, 1.4 compliant devices/targets to support 2.0 features.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests
  2022-04-04 20:30       ` Keith Busch
  2022-04-05  0:43         ` Jonathan Derrick
@ 2022-04-05  6:14         ` Christoph Hellwig
  2022-04-05 18:21           ` Jonathan Derrick
  2022-04-05 19:09           ` Alan Adamson
  1 sibling, 2 replies; 31+ messages in thread
From: Christoph Hellwig @ 2022-04-05  6:14 UTC (permalink / raw)
  To: Keith Busch; +Cc: Jonathan Derrick, linux-nvme

On Mon, Apr 04, 2022 at 02:30:12PM -0600, Keith Busch wrote:
> > Eg, nvme0: blah blah command set not supported
> 
> The new print in the completion handler is pretty generic. I don't think it can
> readily tell the difference from a harmless error. Maybe pr_err is too high?
> 
> Or maybe since enough people have been concerned about *this* specific
> identify, maybe it should be restricted to 2.0 devices where it's mandatory. I
> was reluctant to do that at first since the initial device I tested was 1.4,
> but it was a prototype and we should be fine without the non-mdts limits
> anyway.

What SCSI does is to add RQF_QUIET to all internal passthrough commands,
and then skips printing the SCSI specific error messages in addition
if that flag is set.

This would be the nvme version of that:

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 7e07dd69262a7..9346cd4cf5820 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -366,7 +366,8 @@ static inline void nvme_end_req(struct request *req)
 {
 	blk_status_t status = nvme_error_status(nvme_req(req)->status);
 
-	if (unlikely(nvme_req(req)->status != NVME_SC_SUCCESS))
+	if (unlikely(nvme_req(req)->status != NVME_SC_SUCCESS &&
+		     !(req->rq_flags & RQF_QUIET)))
 		nvme_log_error(req);
 	nvme_end_req_zoned(req);
 	nvme_trace_bio_complete(req);
@@ -648,6 +649,7 @@ void nvme_init_request(struct request *req, struct nvme_command *cmd)
 	cmd->common.flags &= ~NVME_CMD_SGL_ALL;
 
 	req->cmd_flags |= REQ_FAILFAST_DRIVER;
+	req->rq_flags |= RQF_QUIET;
 	if (req->mq_hctx->type == HCTX_TYPE_POLL)
 		req->cmd_flags |= REQ_POLLED;
 	nvme_clear_nvme_request(req);


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests
  2022-04-05  6:14         ` Christoph Hellwig
@ 2022-04-05 18:21           ` Jonathan Derrick
  2022-04-05 20:51             ` Alan Adamson
  2022-04-05 19:09           ` Alan Adamson
  1 sibling, 1 reply; 31+ messages in thread
From: Jonathan Derrick @ 2022-04-05 18:21 UTC (permalink / raw)
  To: Christoph Hellwig, Keith Busch; +Cc: linux-nvme



On 4/5/2022 12:14 AM, Christoph Hellwig wrote:
> On Mon, Apr 04, 2022 at 02:30:12PM -0600, Keith Busch wrote:
>>> Eg, nvme0: blah blah command set not supported
>>
>> The new print in the completion handler is pretty generic. I don't think it can
>> readily tell the difference from a harmless error. Maybe pr_err is too high?
>>
>> Or maybe since enough people have been concerned about *this* specific
>> identify, maybe it should be restricted to 2.0 devices where it's mandatory. I
>> was reluctant to do that at first since the initial device I tested was 1.4,
>> but it was a prototype and we should be fine without the non-mdts limits
>> anyway.
> 
> What SCSI does is to add RQF_QUIET to all internal passthrough commands,
> and then skips printing the SCSI specific error messages in addition
> if that flag is set.
> 
> This would be the nvme version of that:
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 7e07dd69262a7..9346cd4cf5820 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -366,7 +366,8 @@ static inline void nvme_end_req(struct request *req)
>  {
>  	blk_status_t status = nvme_error_status(nvme_req(req)->status);
>  
> -	if (unlikely(nvme_req(req)->status != NVME_SC_SUCCESS))
> +	if (unlikely(nvme_req(req)->status != NVME_SC_SUCCESS &&
> +		     !(req->rq_flags & RQF_QUIET)))
>  		nvme_log_error(req);
>  	nvme_end_req_zoned(req);
>  	nvme_trace_bio_complete(req);
> @@ -648,6 +649,7 @@ void nvme_init_request(struct request *req, struct nvme_command *cmd)
>  	cmd->common.flags &= ~NVME_CMD_SGL_ALL;
>  
>  	req->cmd_flags |= REQ_FAILFAST_DRIVER;
> +	req->rq_flags |= RQF_QUIET;
>  	if (req->mq_hctx->type == HCTX_TYPE_POLL)
>  		req->cmd_flags |= REQ_POLLED;
>  	nvme_clear_nvme_request(req);


That's good too.
How about this so it's limited to debug loglevels:

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index f204c6f78b5b..871ad2421284 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -303,9 +303,10 @@ static void nvme_log_error(struct request *req)
 {
        struct nvme_ns *ns = req->q->queuedata;
        struct nvme_request *nr = nvme_req(req);
+       int level = req->rq_flags & RQF_QUIET ? KERN_DEBUG : KERN_ERR;

        if (ns) {
-               pr_err_ratelimited("%s: %s(0x%x) @ LBA %llu, %llu blocks, %s (sct 0x%x / sc 0x%x) %s%s\n",
+               printk_ratelimited(level "%s: %s(0x%x) @ LBA %llu, %llu blocks, %s (sct 0x%x / sc 0x%x) %s%s\n",
                       ns->disk ? ns->disk->disk_name : "?",
                       nvme_get_opcode_str(nr->cmd->common.opcode),
                       nr->cmd->common.opcode,
@@ -319,7 +320,7 @@ static void nvme_log_error(struct request *req)
                return;
        }

-       pr_err_ratelimited("%s: %s(0x%x), %s (sct 0x%x / sc 0x%x) %s%s\n",
+       printk_ratelimited(level "%s: %s(0x%x), %s (sct 0x%x / sc 0x%x) %s%s\n",
                           dev_name(nr->ctrl->device),
                           nvme_get_admin_opcode_str(nr->cmd->common.opcode),
                           nr->cmd->common.opcode,
@@ -651,6 +652,7 @@ void nvme_init_request(struct request *req, struct nvme_command *cmd)
        cmd->common.flags &= ~NVME_CMD_SGL_ALL;

        req->cmd_flags |= REQ_FAILFAST_DRIVER;
+       req->rq_flags |= RQF_QUIET;
        if (req->mq_hctx->type == HCTX_TYPE_POLL)
                req->cmd_flags |= REQ_POLLED;
        nvme_clear_nvme_request(req);


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests
  2022-04-05  6:14         ` Christoph Hellwig
  2022-04-05 18:21           ` Jonathan Derrick
@ 2022-04-05 19:09           ` Alan Adamson
  1 sibling, 0 replies; 31+ messages in thread
From: Alan Adamson @ 2022-04-05 19:09 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Keith Busch, Jonathan Derrick, linux-nvme



> On Apr 4, 2022, at 11:14 PM, Christoph Hellwig <hch@infradead.org> wrote:
> 
> On Mon, Apr 04, 2022 at 02:30:12PM -0600, Keith Busch wrote:
>>> Eg, nvme0: blah blah command set not supported
>> 
>> The new print in the completion handler is pretty generic. I don't think it can
>> readily tell the difference from a harmless error. Maybe pr_err is too high?
>> 
>> Or maybe since enough people have been concerned about *this* specific
>> identify, maybe it should be restricted to 2.0 devices where it's mandatory. I
>> was reluctant to do that at first since the initial device I tested was 1.4,
>> but it was a prototype and we should be fine without the non-mdts limits
>> anyway.
> 
> What SCSI does is to add RQF_QUIET to all internal passthrough commands,
> and then skips printing the SCSI specific error messages in addition
> if that flag is set.
> 
> This would be the nvme version of that:
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 7e07dd69262a7..9346cd4cf5820 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -366,7 +366,8 @@ static inline void nvme_end_req(struct request *req)
> {
> 	blk_status_t status = nvme_error_status(nvme_req(req)->status);
> 
> -	if (unlikely(nvme_req(req)->status != NVME_SC_SUCCESS))
> +	if (unlikely(nvme_req(req)->status != NVME_SC_SUCCESS &&
> +		     !(req->rq_flags & RQF_QUIET)))
> 		nvme_log_error(req);
> 	nvme_end_req_zoned(req);
> 	nvme_trace_bio_complete(req);
> @@ -648,6 +649,7 @@ void nvme_init_request(struct request *req, struct nvme_command *cmd)
> 	cmd->common.flags &= ~NVME_CMD_SGL_ALL;
> 
> 	req->cmd_flags |= REQ_FAILFAST_DRIVER;
> +	req->rq_flags |= RQF_QUIET;
> 	if (req->mq_hctx->type == HCTX_TYPE_POLL)
> 		req->cmd_flags |= REQ_POLLED;
> 	nvme_clear_nvme_request(req);
> 

I tested the change with an instrumented qemu emulated device.

Without the change:

# dmesg | grep -i nvme
[    3.610944] nvme nvme0: pci function 0000:00:04.0
[    3.685467] nvme0: Identify(0x6), Invalid Field in Command (sct 0x0 / sc 0x2) DNR
[    3.687850] nvme nvme0: 1/0/0 default/read/poll queues


With the change:

# dmesg | grep -i nvme
[    3.437014] nvme nvme0: pci function 0000:00:04.0
[    3.506572] nvme nvme0: 1/0/0 default/read/poll queues


Would you like me to submit the patch?

Alan






^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests
  2022-04-03 14:28 [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests Yi Zhang
  2022-04-04 16:39 ` Alan Adamson
@ 2022-04-05 20:00 ` Jason A. Donenfeld
  2022-04-05 20:48   ` Alan Adamson
  2022-06-09  8:20   ` 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests] Jason A. Donenfeld
  1 sibling, 2 replies; 31+ messages in thread
From: Jason A. Donenfeld @ 2022-04-05 20:00 UTC (permalink / raw)
  To: Yi Zhang, alan.adamson
  Cc: open list:NVM EXPRESS DRIVER, Sagi Grimberg, alan.adamson, linux-kernel

Hi Alan,

I too am seeing this. Tracking it down to the same commit, I decided to
enable NVME_VERBOSE_ERRORS to get some more information. Now on boot and
everytime I wake up from sleep, I see:

[   89.098578] nvme nvme0: Shutdown timeout set to 8 seconds
[   89.098683] nvme0: Identify(0x6), Invalid Field in Command (sct 0x0 / sc 0x2) MORE
[   89.119363] nvme nvme0: 16/0/0 default/read/poll queues

With that middle line in red.

Question is: is this actually an error? If not, maybe it shouldn't be
printed as a KERN_ERR. And if it's printed as a KERN_INFO, maybe it
should only do so when CONFIG_NVME_VERBOSE_ERRORS=y? Or do you think
there is actually some other diagnostic value in having this print
always?

Using a Samsung SSD 970 EVO Plus 2TB, firmware version 2B2QEXM7, in case
that's useful info.

I also noticed a ~2 second boot delay on 5.18-rc1:

[    0.917631] pstore: Using crash dump compression: deflate
[    0.917807] Key type encrypted registered
[    0.951840] ACPI: battery: Slot [BAT0] (battery present)
[    3.146765] nvme nvme0: Shutdown timeout set to 8 seconds
[    3.146918] nvme0: Identify(0x6), Invalid Field in Command (sct 0x0 / sc 0x2) MORE
[    3.188852] nvme nvme0: 16/0/0 default/read/poll queues
[    3.198163]  nvme0n1: p1 p2
[    3.199554] Freeing unused kernel image (initmem) memory: 12952K

I haven't looked into it much, but I assume it's also NVMe related? Or
maybe the vconsole is just initializing faster so I see text where
before I didn't. Not sure.

Regards,
Jason

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests
  2022-04-05 20:00 ` Jason A. Donenfeld
@ 2022-04-05 20:48   ` Alan Adamson
  2022-06-09  8:20   ` 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests] Jason A. Donenfeld
  1 sibling, 0 replies; 31+ messages in thread
From: Alan Adamson @ 2022-04-05 20:48 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Yi Zhang, open list:NVM EXPRESS DRIVER, Sagi Grimberg, linux-kernel

Hi Jason,

It’s a harmless error, but we are looking at suppressing it.

Alan

> On Apr 5, 2022, at 1:00 PM, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> 
> Hi Alan,
> 
> I too am seeing this. Tracking it down to the same commit, I decided to
> enable NVME_VERBOSE_ERRORS to get some more information. Now on boot and
> everytime I wake up from sleep, I see:
> 
> [   89.098578] nvme nvme0: Shutdown timeout set to 8 seconds
> [   89.098683] nvme0: Identify(0x6), Invalid Field in Command (sct 0x0 / sc 0x2) MORE
> [   89.119363] nvme nvme0: 16/0/0 default/read/poll queues
> 
> With that middle line in red.
> 
> Question is: is this actually an error? If not, maybe it shouldn't be
> printed as a KERN_ERR. And if it's printed as a KERN_INFO, maybe it
> should only do so when CONFIG_NVME_VERBOSE_ERRORS=y? Or do you think
> there is actually some other diagnostic value in having this print
> always?
> 
> Using a Samsung SSD 970 EVO Plus 2TB, firmware version 2B2QEXM7, in case
> that's useful info.
> 
> I also noticed a ~2 second boot delay on 5.18-rc1:
> 
> [    0.917631] pstore: Using crash dump compression: deflate
> [    0.917807] Key type encrypted registered
> [    0.951840] ACPI: battery: Slot [BAT0] (battery present)
> [    3.146765] nvme nvme0: Shutdown timeout set to 8 seconds
> [    3.146918] nvme0: Identify(0x6), Invalid Field in Command (sct 0x0 / sc 0x2) MORE
> [    3.188852] nvme nvme0: 16/0/0 default/read/poll queues
> [    3.198163]  nvme0n1: p1 p2
> [    3.199554] Freeing unused kernel image (initmem) memory: 12952K
> 
> I haven't looked into it much, but I assume it's also NVMe related? Or
> maybe the vconsole is just initializing faster so I see text where
> before I didn't. Not sure.
> 
> Regards,
> Jason


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests
  2022-04-05 18:21           ` Jonathan Derrick
@ 2022-04-05 20:51             ` Alan Adamson
  2022-04-05 21:56               ` Jonathan Derrick
  0 siblings, 1 reply; 31+ messages in thread
From: Alan Adamson @ 2022-04-05 20:51 UTC (permalink / raw)
  To: Jonathan Derrick; +Cc: Christoph Hellwig, Keith Busch, linux-nvme



> On Apr 5, 2022, at 11:21 AM, Jonathan Derrick <jonathan.derrick@linux.dev> wrote:
> 
> 
> 
> On 4/5/2022 12:14 AM, Christoph Hellwig wrote:
>> On Mon, Apr 04, 2022 at 02:30:12PM -0600, Keith Busch wrote:
>>>> Eg, nvme0: blah blah command set not supported
>>> 
>>> The new print in the completion handler is pretty generic. I don't think it can
>>> readily tell the difference from a harmless error. Maybe pr_err is too high?
>>> 
>>> Or maybe since enough people have been concerned about *this* specific
>>> identify, maybe it should be restricted to 2.0 devices where it's mandatory. I
>>> was reluctant to do that at first since the initial device I tested was 1.4,
>>> but it was a prototype and we should be fine without the non-mdts limits
>>> anyway.
>> 
>> What SCSI does is to add RQF_QUIET to all internal passthrough commands,
>> and then skips printing the SCSI specific error messages in addition
>> if that flag is set.
>> 
>> This would be the nvme version of that:
>> 
>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>> index 7e07dd69262a7..9346cd4cf5820 100644
>> --- a/drivers/nvme/host/core.c
>> +++ b/drivers/nvme/host/core.c
>> @@ -366,7 +366,8 @@ static inline void nvme_end_req(struct request *req)
>> {
>> 	blk_status_t status = nvme_error_status(nvme_req(req)->status);
>> 
>> -	if (unlikely(nvme_req(req)->status != NVME_SC_SUCCESS))
>> +	if (unlikely(nvme_req(req)->status != NVME_SC_SUCCESS &&
>> +		     !(req->rq_flags & RQF_QUIET)))
>> 		nvme_log_error(req);
>> 	nvme_end_req_zoned(req);
>> 	nvme_trace_bio_complete(req);
>> @@ -648,6 +649,7 @@ void nvme_init_request(struct request *req, struct nvme_command *cmd)
>> 	cmd->common.flags &= ~NVME_CMD_SGL_ALL;
>> 
>> 	req->cmd_flags |= REQ_FAILFAST_DRIVER;
>> +	req->rq_flags |= RQF_QUIET;
>> 	if (req->mq_hctx->type == HCTX_TYPE_POLL)
>> 		req->cmd_flags |= REQ_POLLED;
>> 	nvme_clear_nvme_request(req);
> 
> 
> That's good too.
> How about this so it's limited to debug loglevels:

I don’t think we want to limit it to debug loglevels.  The main purpose of the patch was to allow for debugging issues of live customer systems.

Alan


> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index f204c6f78b5b..871ad2421284 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -303,9 +303,10 @@ static void nvme_log_error(struct request *req)
> {
>        struct nvme_ns *ns = req->q->queuedata;
>        struct nvme_request *nr = nvme_req(req);
> +       int level = req->rq_flags & RQF_QUIET ? KERN_DEBUG : KERN_ERR;
> 
>        if (ns) {
> -               pr_err_ratelimited("%s: %s(0x%x) @ LBA %llu, %llu blocks, %s (sct 0x%x / sc 0x%x) %s%s\n",
> +               printk_ratelimited(level "%s: %s(0x%x) @ LBA %llu, %llu blocks, %s (sct 0x%x / sc 0x%x) %s%s\n",
>                       ns->disk ? ns->disk->disk_name : "?",
>                       nvme_get_opcode_str(nr->cmd->common.opcode),
>                       nr->cmd->common.opcode,
> @@ -319,7 +320,7 @@ static void nvme_log_error(struct request *req)
>                return;
>        }
> 
> -       pr_err_ratelimited("%s: %s(0x%x), %s (sct 0x%x / sc 0x%x) %s%s\n",
> +       printk_ratelimited(level "%s: %s(0x%x), %s (sct 0x%x / sc 0x%x) %s%s\n",
>                           dev_name(nr->ctrl->device),
>                           nvme_get_admin_opcode_str(nr->cmd->common.opcode),
>                           nr->cmd->common.opcode,
> @@ -651,6 +652,7 @@ void nvme_init_request(struct request *req, struct nvme_command *cmd)
>        cmd->common.flags &= ~NVME_CMD_SGL_ALL;
> 
>        req->cmd_flags |= REQ_FAILFAST_DRIVER;
> +       req->rq_flags |= RQF_QUIET;
>        if (req->mq_hctx->type == HCTX_TYPE_POLL)
>                req->cmd_flags |= REQ_POLLED;
>        nvme_clear_nvme_request(req);
> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests
  2022-04-05 20:51             ` Alan Adamson
@ 2022-04-05 21:56               ` Jonathan Derrick
  0 siblings, 0 replies; 31+ messages in thread
From: Jonathan Derrick @ 2022-04-05 21:56 UTC (permalink / raw)
  To: Alan Adamson; +Cc: Christoph Hellwig, Keith Busch, linux-nvme



On 4/5/2022 2:51 PM, Alan Adamson wrote:
> 
> 
>> On Apr 5, 2022, at 11:21 AM, Jonathan Derrick <jonathan.derrick@linux.dev> wrote:
>>
>>
>>
>> On 4/5/2022 12:14 AM, Christoph Hellwig wrote:
>>> On Mon, Apr 04, 2022 at 02:30:12PM -0600, Keith Busch wrote:
>>>>> Eg, nvme0: blah blah command set not supported
>>>>
>>>> The new print in the completion handler is pretty generic. I don't think it can
>>>> readily tell the difference from a harmless error. Maybe pr_err is too high?
>>>>
>>>> Or maybe since enough people have been concerned about *this* specific
>>>> identify, maybe it should be restricted to 2.0 devices where it's mandatory. I
>>>> was reluctant to do that at first since the initial device I tested was 1.4,
>>>> but it was a prototype and we should be fine without the non-mdts limits
>>>> anyway.
>>>
>>> What SCSI does is to add RQF_QUIET to all internal passthrough commands,
>>> and then skips printing the SCSI specific error messages in addition
>>> if that flag is set.
>>>
>>> This would be the nvme version of that:
>>>
>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>>> index 7e07dd69262a7..9346cd4cf5820 100644
>>> --- a/drivers/nvme/host/core.c
>>> +++ b/drivers/nvme/host/core.c
>>> @@ -366,7 +366,8 @@ static inline void nvme_end_req(struct request *req)
>>> {
>>> 	blk_status_t status = nvme_error_status(nvme_req(req)->status);
>>>
>>> -	if (unlikely(nvme_req(req)->status != NVME_SC_SUCCESS))
>>> +	if (unlikely(nvme_req(req)->status != NVME_SC_SUCCESS &&
>>> +		     !(req->rq_flags & RQF_QUIET)))
>>> 		nvme_log_error(req);
>>> 	nvme_end_req_zoned(req);
>>> 	nvme_trace_bio_complete(req);
>>> @@ -648,6 +649,7 @@ void nvme_init_request(struct request *req, struct nvme_command *cmd)
>>> 	cmd->common.flags &= ~NVME_CMD_SGL_ALL;
>>>
>>> 	req->cmd_flags |= REQ_FAILFAST_DRIVER;
>>> +	req->rq_flags |= RQF_QUIET;
>>> 	if (req->mq_hctx->type == HCTX_TYPE_POLL)
>>> 		req->cmd_flags |= REQ_POLLED;
>>> 	nvme_clear_nvme_request(req);
>>
>>
>> That's good too.
>> How about this so it's limited to debug loglevels:
> 
> I don’t think we want to limit it to debug loglevels.  The main purpose of the patch was to allow for debugging issues of live customer systems.
> 
> Alan

I guess it's your preference if you want that print to be accessible 
during debug or if you don't want that print at all.


> 
> 
>>
>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>> index f204c6f78b5b..871ad2421284 100644
>> --- a/drivers/nvme/host/core.c
>> +++ b/drivers/nvme/host/core.c
>> @@ -303,9 +303,10 @@ static void nvme_log_error(struct request *req)
>> {
>>         struct nvme_ns *ns = req->q->queuedata;
>>         struct nvme_request *nr = nvme_req(req);
>> +       int level = req->rq_flags & RQF_QUIET ? KERN_DEBUG : KERN_ERR;
>>
>>         if (ns) {
>> -               pr_err_ratelimited("%s: %s(0x%x) @ LBA %llu, %llu blocks, %s (sct 0x%x / sc 0x%x) %s%s\n",
>> +               printk_ratelimited(level "%s: %s(0x%x) @ LBA %llu, %llu blocks, %s (sct 0x%x / sc 0x%x) %s%s\n",
>>                        ns->disk ? ns->disk->disk_name : "?",
>>                        nvme_get_opcode_str(nr->cmd->common.opcode),
>>                        nr->cmd->common.opcode,
>> @@ -319,7 +320,7 @@ static void nvme_log_error(struct request *req)
>>                 return;
>>         }
>>
>> -       pr_err_ratelimited("%s: %s(0x%x), %s (sct 0x%x / sc 0x%x) %s%s\n",
>> +       printk_ratelimited(level "%s: %s(0x%x), %s (sct 0x%x / sc 0x%x) %s%s\n",
>>                            dev_name(nr->ctrl->device),
>>                            nvme_get_admin_opcode_str(nr->cmd->common.opcode),
>>                            nr->cmd->common.opcode,
>> @@ -651,6 +652,7 @@ void nvme_init_request(struct request *req, struct nvme_command *cmd)
>>         cmd->common.flags &= ~NVME_CMD_SGL_ALL;
>>
>>         req->cmd_flags |= REQ_FAILFAST_DRIVER;
>> +       req->rq_flags |= RQF_QUIET;
>>         if (req->mq_hctx->type == HCTX_TYPE_POLL)
>>                 req->cmd_flags |= REQ_POLLED;
>>         nvme_clear_nvme_request(req);
>>
> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests
  2022-04-05  0:43         ` Jonathan Derrick
@ 2022-04-06 17:34           ` Keith Busch
  0 siblings, 0 replies; 31+ messages in thread
From: Keith Busch @ 2022-04-06 17:34 UTC (permalink / raw)
  To: Jonathan Derrick; +Cc: linux-nvme

On Mon, Apr 04, 2022 at 06:43:01PM -0600, Jonathan Derrick wrote:
> > Or maybe since enough people have been concerned about *this* specific
> > identify, maybe it should be restricted to 2.0 devices where it's mandatory. I
> > was reluctant to do that at first since the initial device I tested was 1.4,
> > but it was a prototype and we should be fine without the non-mdts limits
> > anyway.
> Well I'm not sure about that. I'm not honestly sure what this specific
> identify is, but (I know you know) NVMe being forward compatible allows eg,
> 1.4 compliant devices/targets to support 2.0 features.

The command triggering the error is the NVM CSI Identify Controller (CNS=6),
which provides the recommended limits for commands that don't transfer block
data (write zeroes, verify, dsm, etc..). The driver sets it up in
nvme_init_non_mdts_limits().

Here are two other user reports of this triggering smartd warnings:

 http://lists.infradead.org/pipermail/linux-nvme/2021-July/028009.html
 https://bugzilla.kernel.org/show_bug.cgi?id=215763

The identification only provides recommended limits, so it's not all that bad
if we skip checking it for <2.0 versions that don't require it.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests]
  2022-04-05 20:00 ` Jason A. Donenfeld
  2022-04-05 20:48   ` Alan Adamson
@ 2022-06-09  8:20   ` Jason A. Donenfeld
  2022-06-09  8:34     ` Jason A. Donenfeld
  1 sibling, 1 reply; 31+ messages in thread
From: Jason A. Donenfeld @ 2022-06-09  8:20 UTC (permalink / raw)
  To: Yi Zhang, alan.adamson; +Cc: open list:NVM EXPRESS DRIVER, Sagi Grimberg, LKML

Hi folks,

On Tue, Apr 5, 2022 at 10:00 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> Using a Samsung SSD 970 EVO Plus 2TB, firmware version 2B2QEXM7, in case
> that's useful info.
>
> I also noticed a ~2 second boot delay on 5.18-rc1:

Just FYI, I am still seeing this delay in 5.19-rc1.

Boot lines from 5.17:

[    0.882680] nvme nvme1: missing or invalid SUBNQN field.
[    0.882719] nvme nvme1: Shutdown timeout set to 10 seconds
[    0.885227] nvme nvme1: 8/0/0 default/read/poll queues
[    0.887910]  nvme1n1: p1 p2 p3
[    0.888317] nvme nvme0: missing or invalid SUBNQN field.
[    0.888361] nvme nvme0: Shutdown timeout set to 8 seconds
[    0.906301] nvme nvme0: 16/0/0 default/read/poll queues
[    0.910087]  nvme0n1: p1 p2

Boot lines from 5.18 & 5.19:

[    0.846827] nvme nvme1: missing or invalid SUBNQN field.
[    0.846857] nvme nvme1: Shutdown timeout set to 10 seconds
[    0.849043] nvme nvme1: 8/0/0 default/read/poll queues
[    0.851595]  nvme1n1: p1 p2 p3
[    3.226962] nvme nvme0: Shutdown timeout set to 8 seconds
[    3.253890] nvme nvme0: 16/0/0 default/read/poll queues
[    3.263778]  nvme0n1: p1 p2

The Samsung 970 EVO Plus has a ~2 second delay that wasn't there in 5.17.

Any idea what's going on?

Thanks,
Jason

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests]
  2022-06-09  8:20   ` 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests] Jason A. Donenfeld
@ 2022-06-09  8:34     ` Jason A. Donenfeld
  2022-06-09  8:40       ` [PATCH] Revert "nvme-pci: add quirks for Samsung X5 SSDs" Jason A. Donenfeld
  2022-06-09  9:32       ` 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests] R, Monish Kumar
  0 siblings, 2 replies; 31+ messages in thread
From: Jason A. Donenfeld @ 2022-06-09  8:34 UTC (permalink / raw)
  To: monish.kumar.r
  Cc: open list:NVM EXPRESS DRIVER, Sagi Grimberg, alan.adamson, LKML,
	Yi Zhang, Keith Busch, axboe, Christoph Hellwig, abhijeet.rao

Hey again,

Figured it out. 2.3 seconds to be exact... It looks like this is caused by:

bc360b0b1611 ("nvme-pci: add quirks for Samsung X5 SSDs")
https://lore.kernel.org/all/20220316075449.18906-1-monish.kumar.r@intel.com/

This commit doesn't have any justification and got applied without
much discussion. Perhaps Monish could supply some more info about why
this is needed here? FTR, I have no issues on my system when reverting
that. Perhaps it should be reverted. (I can send a revert commit for
that if necessary.)

Looking further, however, the PCIe ID is said to be for a "Samsung
X5", which Google says is a portable thunderbolt drive. Is the PCIe ID
correct? On my system, this is the PCIe ID of a Samsung 970 EVO Plus.
Is it possible that Monish copied and pasted the wrong PCIe ID? Or has
Samsung *reused* the same PCIe ID on both devices? In which case, we'd
need some additional data for that quirk to avoid the delay.

Also note that this (potentially errant) commit has been backported to stable.

Jason

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH] Revert "nvme-pci: add quirks for Samsung X5 SSDs"
  2022-06-09  8:34     ` Jason A. Donenfeld
@ 2022-06-09  8:40       ` Jason A. Donenfeld
  2022-06-09  9:32       ` 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests] R, Monish Kumar
  1 sibling, 0 replies; 31+ messages in thread
From: Jason A. Donenfeld @ 2022-06-09  8:40 UTC (permalink / raw)
  To: open list:NVM EXPRESS DRIVER, Sagi Grimberg, alan.adamson, LKML,
	Yi Zhang, Keith Busch, axboe, Christoph Hellwig, abhijeet.rao,
	monish.kumar.r
  Cc: Jason A. Donenfeld, stable

This reverts commit bc360b0b1611566e1bd47384daf49af6a1c51837.

This matches the hardware identifier of the Samsung 970 EVO Plus, a very
popular internal laptop NVMe drive, which is not the Samsung X5, an
external thunderbolt drive. In particular, this causes:

1) a 2.3 second boot time delay; and
2) disabling of deep power saving states.

So just revert this until whatever funny business can be worked out
regarding Samsung's PCI IDs.

Fixes: bc360b0b1611 ("nvme-pci: add quirks for Samsung X5 SSDs")
Cc: stable@vger.kernel.org
Cc: Monish Kumar R <monish.kumar.r@intel.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
---
 drivers/nvme/host/pci.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 48f4f6eb877b..47b9e3e0ea5a 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -3483,10 +3483,7 @@ static const struct pci_device_id nvme_id_table[] = {
 				NVME_QUIRK_128_BYTES_SQES |
 				NVME_QUIRK_SHARED_TAGS |
 				NVME_QUIRK_SKIP_CID_GEN },
-	{ PCI_DEVICE(0x144d, 0xa808),   /* Samsung X5 */
-		.driver_data =  NVME_QUIRK_DELAY_BEFORE_CHK_RDY|
-				NVME_QUIRK_NO_DEEPEST_PS |
-				NVME_QUIRK_IGNORE_DEV_SUBNQN, },
+
 	{ PCI_DEVICE_CLASS(PCI_CLASS_STORAGE_EXPRESS, 0xffffff) },
 	{ 0, }
 };
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* RE: 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests]
  2022-06-09  8:34     ` Jason A. Donenfeld
  2022-06-09  8:40       ` [PATCH] Revert "nvme-pci: add quirks for Samsung X5 SSDs" Jason A. Donenfeld
@ 2022-06-09  9:32       ` R, Monish Kumar
  2022-06-09  9:38         ` Jason A. Donenfeld
  1 sibling, 1 reply; 31+ messages in thread
From: R, Monish Kumar @ 2022-06-09  9:32 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: open list:NVM EXPRESS DRIVER, Sagi Grimberg, alan.adamson, LKML,
	Yi Zhang, Keith Busch, axboe, Christoph Hellwig, Rao, Abhijeet

Hi Jason,

I would like to provide justification for this Samsung X5 SSD fix added.
We were facing SSD enumeration issue after cold / warm reboot with device 
connected ends up with probe failures.

When I debug on this issue, I could find that this device was not enumerating 
once the system got booted. Moreover, we were facing this enumeration issue
specific to this device. 

Based on analysis, due to deep power state of the device fails to enumerate.
So, added the following quirks as a workaround fixe and it helps to enumerate the device after cold/warm reboot. If new Samsung X5 SSD's are working fine as expected, we can remove those 
fix. 

Regarding the PCI-Id's, I have confirmed from the logs and it shows as vendor ID : 0x144d 
device ID : 0xa808. I am not sure about why Samsung 970 EVO Plus have the same PCI-Ids.

Logs for reference : 
After connecting Samsung X5 SSD.

lspci
04:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983

dmesg
Line 1478: <6>[  112.838998] pci 0000:04:00.0: [144d:a808] type 00 class 0x010802
Line 1479: <6>[  112.845765] pci 0000:04:00.0: reg 0x10: [mem 0x00000000-0x00003fff 64bit]
Line 1480: <6>[  112.853715] pci 0000:04:00.0: 8.000 Gb/s available PCIe bandwidth, limited by 2.5 GT/s PCIe x4 link at 0000:00:07.0 (capable of 31.504 Gb/s with 8.0 GT/s PCIe x4 link)
Line 1481: <6>[  112.870536] pci 0000:04:00.0: Adding to iommu group 22
Line 1498: <6>[  113.019698] pci 0000:04:00.0: BAR 0: assigned [mem 0x83000000-0x83003fff 64bit]

Regards,
Monish Kumar R

-----Original Message-----
From: Jason A. Donenfeld <Jason@zx2c4.com> 
Sent: 09 June 2022 14:04
To: R, Monish Kumar <monish.kumar.r@intel.com>
Cc: open list:NVM EXPRESS DRIVER <linux-nvme@lists.infradead.org>; Sagi Grimberg <sagi@grimberg.me>; alan.adamson@oracle.com; LKML <linux-kernel@vger.kernel.org>; Yi Zhang <yi.zhang@redhat.com>; Keith Busch <kbusch@kernel.org>; axboe@fb.com; Christoph Hellwig <hch@lst.de>; Rao, Abhijeet <abhijeet.rao@intel.com>
Subject: Re: 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests]

Hey again,

Figured it out. 2.3 seconds to be exact... It looks like this is caused by:

bc360b0b1611 ("nvme-pci: add quirks for Samsung X5 SSDs") https://lore.kernel.org/all/20220316075449.18906-1-monish.kumar.r@intel.com/

This commit doesn't have any justification and got applied without much discussion. Perhaps Monish could supply some more info about why this is needed here? FTR, I have no issues on my system when reverting that. Perhaps it should be reverted. (I can send a revert commit for that if necessary.)

Looking further, however, the PCIe ID is said to be for a "Samsung X5", which Google says is a portable thunderbolt drive. Is the PCIe ID correct? On my system, this is the PCIe ID of a Samsung 970 EVO Plus.
Is it possible that Monish copied and pasted the wrong PCIe ID? Or has Samsung *reused* the same PCIe ID on both devices? In which case, we'd need some additional data for that quirk to avoid the delay.

Also note that this (potentially errant) commit has been backported to stable.

Jason

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests]
  2022-06-09  9:32       ` 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests] R, Monish Kumar
@ 2022-06-09  9:38         ` Jason A. Donenfeld
  2022-06-10  6:14           ` Christoph Hellwig
  0 siblings, 1 reply; 31+ messages in thread
From: Jason A. Donenfeld @ 2022-06-09  9:38 UTC (permalink / raw)
  To: R, Monish Kumar
  Cc: open list:NVM EXPRESS DRIVER, Sagi Grimberg, alan.adamson, LKML,
	Yi Zhang, Keith Busch, axboe, Christoph Hellwig, Rao, Abhijeet

Hi Monish,

On Thu, Jun 09, 2022 at 09:32:02AM +0000, R, Monish Kumar wrote:
> Hi Jason,
> 
> I would like to provide justification for this Samsung X5 SSD fix added.
> We were facing SSD enumeration issue after cold / warm reboot with device 
> connected ends up with probe failures.
> 
> When I debug on this issue, I could find that this device was not enumerating 
> once the system got booted. Moreover, we were facing this enumeration issue
> specific to this device. 
> 
> Based on analysis, due to deep power state of the device fails to enumerate.
> So, added the following quirks as a workaround fixe and it helps to enumerate the device after cold/warm reboot. If new Samsung X5 SSD's are working fine as expected, we can remove those 
> fix. 

FWIW, all of that should have been in the commit message. Also, "based
on analysis" - what analysis exactly? I have no way of thinking more
about the issue at hand other than, "Monish said things are like this in
a lab".

In any case, I believe the 970 ID predates that of the X5, and
destroying battery on those laptops and introducing boot time delays
isn't really okay. So let's just revert this until somebody can work out
better how to differentiate drives that need a quirk from drives that
don't need a quirk.

I sent this in: https://lore.kernel.org/lkml/20220609084051.4445-1-Jason@zx2c4.com/

Jason

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests]
  2022-06-09  9:38         ` Jason A. Donenfeld
@ 2022-06-10  6:14           ` Christoph Hellwig
  2022-06-10  9:19             ` Jason A. Donenfeld
  2022-06-10 12:05             ` Pankaj Raghav
  0 siblings, 2 replies; 31+ messages in thread
From: Christoph Hellwig @ 2022-06-10  6:14 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: R, Monish Kumar, open list:NVM EXPRESS DRIVER, Sagi Grimberg,
	alan.adamson, LKML, Yi Zhang, Keith Busch, axboe,
	Christoph Hellwig, Rao, Abhijeet

On Thu, Jun 09, 2022 at 11:38:47AM +0200, Jason A. Donenfeld wrote:
> FWIW, all of that should have been in the commit message. Also, "based
> on analysis" - what analysis exactly? I have no way of thinking more
> about the issue at hand other than, "Monish said things are like this in
> a lab".

Please calm down a bit.  His report is at least as good as your new
report here..

> In any case, I believe the 970 ID predates that of the X5, and

Huh?

The 970 seems to actually be very slightly newer than the X5.  What
I suspect is that they actually are the same m.2 SSD or at least a
very similar one and Samsung decided to ship it in the thunderbolt
attached versions first.  Maybe one of the Samsung folks here can
confirm.

That leaves us with two plausible theories:

 - the problems could be due to an earlier firmware version or
   ASIC stepping
 - the problems are due to the thunderbolt attachment

Monish and Jason, can you please send me the output of nvme id-ctrl
/dev/nvmeX (where /dev/nvmeX is the actual device number)?

Monish, can you check if you are using the latest available firmware
and if not update it and check if you still need the quirks.

> destroying battery on those laptops and introducing boot time delays
> isn't really okay. So let's just revert this until somebody can work out
> better how to differentiate drives that need a quirk from drives that
> don't need a quirk.

While I'd really like to fix those issue, they are less severe than
not being able to use a device at all.  And just as a reminder: if you
want to get anything please be nice to people and try work with them
productively.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests]
  2022-06-10  6:14           ` Christoph Hellwig
@ 2022-06-10  9:19             ` Jason A. Donenfeld
  2022-06-13  6:36               ` R, Monish Kumar
  2022-06-13 13:55               ` Christoph Hellwig
  2022-06-10 12:05             ` Pankaj Raghav
  1 sibling, 2 replies; 31+ messages in thread
From: Jason A. Donenfeld @ 2022-06-10  9:19 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: R, Monish Kumar, open list:NVM EXPRESS DRIVER, Sagi Grimberg,
	alan.adamson, LKML, Yi Zhang, Keith Busch, axboe, Rao, Abhijeet

Hi Christoph,

On Fri, Jun 10, 2022 at 08:14:49AM +0200, Christoph Hellwig wrote:
> That leaves us with two plausible theories:
> 
>  - the problems could be due to an earlier firmware version or
>    ASIC stepping
>  - the problems are due to the thunderbolt attachment

Right, that seems like the set of variance we're dealing with. If it's a
firmware version issue, then we revert because people can update? Or can
we quirk firmware version numbers too? If it's ASIC stepping, I guess we
need to quirk that. And likewise thunderbolt, but that seems more
awkward to quirk around, because afaik, it all just appears as PCIe?

> Monish and Jason, can you please send me the output of nvme id-ctrl
> /dev/nvmeX (where /dev/nvmeX is the actual device number)?

NVME Identify Controller:
vid       : 0x144d
ssvid     : 0x144d
sn        : <redacted>
mn        : Samsung SSD 970 EVO Plus 2TB            
fr        : 2B2QEXM7
rab       : 2
ieee      : 002538
cmic      : 0
mdts      : 9
cntlid    : 0x4
ver       : 0x10300
rtd3r     : 0x30d40
rtd3e     : 0x7a1200
oaes      : 0
ctratt    : 0
rrls      : 0
cntrltype : 0
fguid     : 
crdt1     : 0
crdt2     : 0
crdt3     : 0
nvmsr     : 0
vwci      : 0
mec       : 0
oacs      : 0x17
acl       : 7
aerl      : 3
frmw      : 0x16
lpa       : 0x3
elpe      : 63
npss      : 4
avscc     : 0x1
apsta     : 0x1
wctemp    : 358
cctemp    : 358
mtfa      : 0
hmpre     : 0
hmmin     : 0
tnvmcap   : 2000398934016
unvmcap   : 0
rpmbs     : 0
edstt     : 35
dsto      : 0
fwug      : 0
kas       : 0
hctma     : 0x1
mntmt     : 356
mxtmt     : 358
sanicap   : 0
hmminds   : 0
hmmaxd    : 0
nsetidmax : 0
endgidmax : 0
anatt     : 0
anacap    : 0
anagrpmax : 0
nanagrpid : 0
pels      : 0
domainid  : 0
megcap    : 0
sqes      : 0x66
cqes      : 0x44
maxcmd    : 0
nn        : 1
oncs      : 0x5f
fuses     : 0
fna       : 0x5
vwc       : 0x1
awun      : 1023
awupf     : 0
icsvscc     : 1
nwpc      : 0
acwu      : 0
ocfs      : 0
sgls      : 0
mnan      : 0
maxdna    : 0
maxcna    : 0
subnqn    : 
ioccsz    : 0
iorcsz    : 0
icdoff    : 0
fcatt     : 0
msdbd     : 0
ofcs      : 0
ps    0 : mp:7.50W operational enlat:0 exlat:0 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:- active_power:-
ps    1 : mp:5.90W operational enlat:0 exlat:0 rrt:1 rrl:1
          rwt:1 rwl:1 idle_power:- active_power:-
ps    2 : mp:3.60W operational enlat:0 exlat:0 rrt:2 rrl:2
          rwt:2 rwl:2 idle_power:- active_power:-
ps    3 : mp:0.0700W non-operational enlat:210 exlat:1200 rrt:3 rrl:3
          rwt:3 rwl:3 idle_power:- active_power:-
ps    4 : mp:0.0050W non-operational enlat:2000 exlat:8000 rrt:4 rrl:4
          rwt:4 rwl:4 idle_power:- active_power:-

Jason

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests]
  2022-06-10  6:14           ` Christoph Hellwig
  2022-06-10  9:19             ` Jason A. Donenfeld
@ 2022-06-10 12:05             ` Pankaj Raghav
  1 sibling, 0 replies; 31+ messages in thread
From: Pankaj Raghav @ 2022-06-10 12:05 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jason A. Donenfeld, R, Monish Kumar,
	open list:NVM EXPRESS DRIVER, Sagi Grimberg, alan.adamson, LKML,
	Yi Zhang, Keith Busch, axboe, Rao, Abhijeet, Pankaj Raghav,
	Javier González

On Fri, Jun 10, 2022 at 08:14:49AM +0200, Christoph Hellwig wrote:
> The 970 seems to actually be very slightly newer than the X5.  What
> I suspect is that they actually are the same m.2 SSD or at least a
> very similar one and Samsung decided to ship it in the thunderbolt
> attached versions first.  Maybe one of the Samsung folks here can
> confirm.
> 
> That leaves us with two plausible theories:
> 
>  - the problems could be due to an earlier firmware version or
>    ASIC stepping
>  - the problems are due to the thunderbolt attachment
> 
I have forwarded this report internally within Samsung and I will post an
update once I have more information about this issue.

Cheers,
Pankaj

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests]
  2022-06-10  9:19             ` Jason A. Donenfeld
@ 2022-06-13  6:36               ` R, Monish Kumar
  2022-06-13 12:20                 ` Jason A. Donenfeld
  2022-06-13 13:55               ` Christoph Hellwig
  1 sibling, 1 reply; 31+ messages in thread
From: R, Monish Kumar @ 2022-06-13  6:36 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: open list:NVM EXPRESS DRIVER, Jason A. Donenfeld, Sagi Grimberg,
	alan.adamson, LKML, Yi Zhang, Keith Busch, axboe, Rao, Abhijeet

Hi Christoph,

Please see the below nvme id-ctrl response of Samsung X5 SSD.

NVME Identify Controller:
vid       : 0x144d
ssvid     : 0x144d
sn        : <redacted>
mn        : Samsung Portable SSD X5
fr        : 1P3QEXE7
rab       : 2
ieee      : 002538
cmic      : 0
mdts      : 9
cntlid    : 4
ver       : 10300
rtd3r     : 30d40
rtd3e     : 7a1200
oaes      : 0
ctratt    : 0
rrls      : 0
oacs      : 0x7
acl       : 7
aerl      : 3
frmw      : 0x16
lpa       : 0x3
elpe      : 63
npss      : 4
avscc     : 0x1
apsta     : 0x1
wctemp    : 329
cctemp    : 330
mtfa      : 0
hmpre     : 0
hmmin     : 0
tnvmcap   : 500107862016
unvmcap   : 0
rpmbs     : 0
edstt     : 0
dsto      : 0
fwug      : 0
kas       : 0
hctma     : 0
mntmt     : 0
mxtmt     : 0
sanicap   : 0
hmminds   : 0
hmmaxd    : 0
nsetidmax : 0
sqes      : 0x66
cqes      : 0x44
maxcmd    : 0
nn        : 1
oncs      : 0x1f
fuses     : 0
fna       : 0x5
vwc       : 0x1
awun      : 127
awupf     : 0
nvscc     : 1
acwu      : 0
sgls      : 0
subnqn    :
ioccsz    : 0
iorcsz    : 0
icdoff    : 0
ctrattr   : 0
msdbd     : 0
ps    0 : mp:6.20W operational enlat:0 exlat:0 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:- active_power:-
ps    1 : mp:4.30W operational enlat:0 exlat:0 rrt:1 rrl:1
          rwt:1 rwl:1 idle_power:- active_power:-
ps    2 : mp:2.10W operational enlat:0 exlat:0 rrt:2 rrl:2
          rwt:2 rwl:2 idle_power:- active_power:-
ps    3 : mp:0.0400W non-operational enlat:210 exlat:1200 rrt:3 rrl:3
          rwt:3 rwl:3 idle_power:- active_power:-
ps    4 : mp:0.0050W non-operational enlat:2000 exlat:8000 rrt:4 rrl:4
          rwt:4 rwl:4 idle_power:- active_power:-

Regards,
Monish Kumar R

-----Original Message-----
From: Jason A. Donenfeld <Jason@zx2c4.com> 
Sent: 10 June 2022 14:50
To: Christoph Hellwig <hch@lst.de>
Cc: R, Monish Kumar <monish.kumar.r@intel.com>; open list:NVM EXPRESS DRIVER <linux-nvme@lists.infradead.org>; Sagi Grimberg <sagi@grimberg.me>; alan.adamson@oracle.com; LKML <linux-kernel@vger.kernel.org>; Yi Zhang <yi.zhang@redhat.com>; Keith Busch <kbusch@kernel.org>; axboe@fb.com; Rao, Abhijeet <abhijeet.rao@intel.com>
Subject: Re: 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests]

Hi Christoph,

On Fri, Jun 10, 2022 at 08:14:49AM +0200, Christoph Hellwig wrote:
> That leaves us with two plausible theories:
> 
>  - the problems could be due to an earlier firmware version or
>    ASIC stepping
>  - the problems are due to the thunderbolt attachment

Right, that seems like the set of variance we're dealing with. If it's a firmware version issue, then we revert because people can update? Or can we quirk firmware version numbers too? If it's ASIC stepping, I guess we need to quirk that. And likewise thunderbolt, but that seems more awkward to quirk around, because afaik, it all just appears as PCIe?

> Monish and Jason, can you please send me the output of nvme id-ctrl 
> /dev/nvmeX (where /dev/nvmeX is the actual device number)?

NVME Identify Controller:
vid       : 0x144d
ssvid     : 0x144d
sn        : <redacted>
mn        : Samsung SSD 970 EVO Plus 2TB            
fr        : 2B2QEXM7
rab       : 2
ieee      : 002538
cmic      : 0
mdts      : 9
cntlid    : 0x4
ver       : 0x10300
rtd3r     : 0x30d40
rtd3e     : 0x7a1200
oaes      : 0
ctratt    : 0
rrls      : 0
cntrltype : 0
fguid     : 
crdt1     : 0
crdt2     : 0
crdt3     : 0
nvmsr     : 0
vwci      : 0
mec       : 0
oacs      : 0x17
acl       : 7
aerl      : 3
frmw      : 0x16
lpa       : 0x3
elpe      : 63
npss      : 4
avscc     : 0x1
apsta     : 0x1
wctemp    : 358
cctemp    : 358
mtfa      : 0
hmpre     : 0
hmmin     : 0
tnvmcap   : 2000398934016
unvmcap   : 0
rpmbs     : 0
edstt     : 35
dsto      : 0
fwug      : 0
kas       : 0
hctma     : 0x1
mntmt     : 356
mxtmt     : 358
sanicap   : 0
hmminds   : 0
hmmaxd    : 0
nsetidmax : 0
endgidmax : 0
anatt     : 0
anacap    : 0
anagrpmax : 0
nanagrpid : 0
pels      : 0
domainid  : 0
megcap    : 0
sqes      : 0x66
cqes      : 0x44
maxcmd    : 0
nn        : 1
oncs      : 0x5f
fuses     : 0
fna       : 0x5
vwc       : 0x1
awun      : 1023
awupf     : 0
icsvscc     : 1
nwpc      : 0
acwu      : 0
ocfs      : 0
sgls      : 0
mnan      : 0
maxdna    : 0
maxcna    : 0
subnqn    : 
ioccsz    : 0
iorcsz    : 0
icdoff    : 0
fcatt     : 0
msdbd     : 0
ofcs      : 0
ps    0 : mp:7.50W operational enlat:0 exlat:0 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:- active_power:-
ps    1 : mp:5.90W operational enlat:0 exlat:0 rrt:1 rrl:1
          rwt:1 rwl:1 idle_power:- active_power:-
ps    2 : mp:3.60W operational enlat:0 exlat:0 rrt:2 rrl:2
          rwt:2 rwl:2 idle_power:- active_power:-
ps    3 : mp:0.0700W non-operational enlat:210 exlat:1200 rrt:3 rrl:3
          rwt:3 rwl:3 idle_power:- active_power:-
ps    4 : mp:0.0050W non-operational enlat:2000 exlat:8000 rrt:4 rrl:4
          rwt:4 rwl:4 idle_power:- active_power:-

Jason

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests]
  2022-06-13  6:36               ` R, Monish Kumar
@ 2022-06-13 12:20                 ` Jason A. Donenfeld
  2022-06-13 13:02                   ` R, Monish Kumar
  0 siblings, 1 reply; 31+ messages in thread
From: Jason A. Donenfeld @ 2022-06-13 12:20 UTC (permalink / raw)
  To: R, Monish Kumar
  Cc: Christoph Hellwig, open list:NVM EXPRESS DRIVER, Sagi Grimberg,
	alan.adamson, LKML, Yi Zhang, Keith Busch, axboe, Rao, Abhijeet

On Mon, Jun 13, 2022 at 06:36:44AM +0000, R, Monish Kumar wrote:
> mn        : Samsung Portable SSD X5
> fr        : 1P3QEXE7

Isn't the latest firmware for the QEXE7 line 2B2QEXE7?

Jason

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests]
  2022-06-13 12:20                 ` Jason A. Donenfeld
@ 2022-06-13 13:02                   ` R, Monish Kumar
  0 siblings, 0 replies; 31+ messages in thread
From: R, Monish Kumar @ 2022-06-13 13:02 UTC (permalink / raw)
  To: Jason A. Donenfeld, Pankaj Raghav
  Cc: Christoph Hellwig, open list:NVM EXPRESS DRIVER, Sagi Grimberg,
	alan.adamson, LKML, Yi Zhang, Keith Busch, axboe, Rao, Abhijeet,
	p.raghav

Jason,

I am not sure, whether this running firmware (fr: 1P3QEXE7) was the latest one.
Moreover, this device is a newer one and we bought couple of months back.

When I am tried to update SSD firmware using Samsung Portable SSD Software, 
it failed to communicate with their Samsung servers.

Let @Pankaj Raghav <pankydev8@gmail.com> can get back on this and confirm 
about this firmware version, as he reported this issue internally to Samsung. 

Regards,
Monish Kumar R

-----Original Message-----
From: Jason A. Donenfeld <Jason@zx2c4.com> 
Sent: 13 June 2022 17:51
To: R, Monish Kumar <monish.kumar.r@intel.com>
Cc: Christoph Hellwig <hch@lst.de>; open list:NVM EXPRESS DRIVER <linux-nvme@lists.infradead.org>; Sagi Grimberg <sagi@grimberg.me>; alan.adamson@oracle.com; LKML <linux-kernel@vger.kernel.org>; Yi Zhang <yi.zhang@redhat.com>; Keith Busch <kbusch@kernel.org>; axboe@fb.com; Rao, Abhijeet <abhijeet.rao@intel.com>
Subject: Re: 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests]

On Mon, Jun 13, 2022 at 06:36:44AM +0000, R, Monish Kumar wrote:
> mn        : Samsung Portable SSD X5
> fr        : 1P3QEXE7

Isn't the latest firmware for the QEXE7 line 2B2QEXE7?

Jason

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests]
  2022-06-10  9:19             ` Jason A. Donenfeld
  2022-06-13  6:36               ` R, Monish Kumar
@ 2022-06-13 13:55               ` Christoph Hellwig
  2022-06-15 10:27                 ` Pankaj Raghav
  1 sibling, 1 reply; 31+ messages in thread
From: Christoph Hellwig @ 2022-06-13 13:55 UTC (permalink / raw)
  To: Jason A. Donenfeld
  Cc: Christoph Hellwig, R, Monish Kumar, open list:NVM EXPRESS DRIVER,
	Sagi Grimberg, alan.adamson, LKML, Yi Zhang, Keith Busch, axboe,
	Rao, Abhijeet

On Fri, Jun 10, 2022 at 11:19:31AM +0200, Jason A. Donenfeld wrote:
> Right, that seems like the set of variance we're dealing with. If it's a
> firmware version issue, then we revert because people can update? Or can
> we quirk firmware version numbers too?

We can quirk on firmware version and model number as well.  Those quirks
need to go into the core nvme module and not just the PCI driver, though.

> If it's ASIC stepping, I guess we
> need to quirk that. And likewise thunderbolt, but that seems more
> awkward to quirk around, because afaik, it all just appears as PCIe?

It all appears as PCIe, but the pci_dev has an is_thunderbolt flag.

Thanks to both of you for the information.  I'd like to wait until the
end of the week or so if we can hear something from Samsung, and if we
don't we'll have to quirk based on the model number.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests]
  2022-06-13 13:55               ` Christoph Hellwig
@ 2022-06-15 10:27                 ` Pankaj Raghav
  2022-06-15 11:31                   ` Christoph Hellwig
  0 siblings, 1 reply; 31+ messages in thread
From: Pankaj Raghav @ 2022-06-15 10:27 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jason A. Donenfeld, R, Monish Kumar,
	open list:NVM EXPRESS DRIVER, Sagi Grimberg, alan.adamson, LKML,
	Yi Zhang, Keith Busch, axboe, Rao, Abhijeet

Hi Christoph,
On Mon, Jun 13, 2022 at 03:55:49PM +0200, Christoph Hellwig wrote:
> It all appears as PCIe, but the pci_dev has an is_thunderbolt flag.
> 
> Thanks to both of you for the information.  I'd like to wait until the
> end of the week or so if we can hear something from Samsung, and if we
> don't we'll have to quirk based on the model number.
Our FW team has started looking into the issue. They said they will try
to come up with a solution before 5.20. If not, we can add this quirk
based on the FW ver. and a proper solution can be added by 5.21.
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests]
  2022-06-15 10:27                 ` Pankaj Raghav
@ 2022-06-15 11:31                   ` Christoph Hellwig
  2022-06-20  3:36                     ` onenowy
  2022-06-20  4:34                     ` SungHwan Jung
  0 siblings, 2 replies; 31+ messages in thread
From: Christoph Hellwig @ 2022-06-15 11:31 UTC (permalink / raw)
  To: Pankaj Raghav
  Cc: Christoph Hellwig, Jason A. Donenfeld, R, Monish Kumar,
	open list:NVM EXPRESS DRIVER, Sagi Grimberg, alan.adamson, LKML,
	Yi Zhang, Keith Busch, axboe, Rao, Abhijeet

On Wed, Jun 15, 2022 at 12:27:57PM +0200, Pankaj Raghav wrote:
> Hi Christoph,
> On Mon, Jun 13, 2022 at 03:55:49PM +0200, Christoph Hellwig wrote:
> > It all appears as PCIe, but the pci_dev has an is_thunderbolt flag.
> > 
> > Thanks to both of you for the information.  I'd like to wait until the
> > end of the week or so if we can hear something from Samsung, and if we
> > don't we'll have to quirk based on the model number.
> Our FW team has started looking into the issue. They said they will try
> to come up with a solution before 5.20. If not, we can add this quirk
> based on the FW ver. and a proper solution can be added by 5.21.

I don't think we can wait for 5.20 - the "offending" commit is in
5.19-rc and -stable.  So I'll plan to prepare a patch based on the model
number for now, still hoping we can come up with something better
eventually.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests]
  2022-06-15 11:31                   ` Christoph Hellwig
@ 2022-06-20  3:36                     ` onenowy
  2022-06-20  7:09                       ` Christoph Hellwig
  2022-06-20  4:34                     ` SungHwan Jung
  1 sibling, 1 reply; 31+ messages in thread
From: onenowy @ 2022-06-20  3:36 UTC (permalink / raw)
  To: hch
  Cc: Jason, abhijeet.rao, alan.adamson, axboe, kbusch, linux-kernel,
	linux-nvme, monish.kumar.r, pankydev8, sagi, yi.zhang

> I don't think we can wait for 5.20 - the "offending" commit is in
> 5.19-rc and -stable.  So I'll plan to prepare a patch based on the model
> number for now, still hoping we can come up with something better
> eventually.

Hi ,
Some samsung SSD for OEM also have the identical PCI-ids and are affected by this quirk.
But they have different subsystem-ids.

For example,

model number: MZQLB1T9HAJR-00V3 for lenovo
vendor: 144d
device: a808
subvendor: 1d49
subdevice: 403b

model number: MZVLB256HBHQ-00000
vendor: 144d
device: a808
subvendor: 144d
subdevice: a801

Addtition of subsystem-ids of X5 to pci_device_id(as below) may solve this problem.

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 17aeb7d5c48522..92fd3b1d88fc95 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -3475,7 +3475,7 @@ static const struct pci_device_id nvme_id_table[] = {
 				NVME_QUIRK_128_BYTES_SQES |
 				NVME_QUIRK_SHARED_TAGS |
 				NVME_QUIRK_SKIP_CID_GEN },
-	{ PCI_DEVICE(0x144d, 0xa808),   /* Samsung X5 */
+	{ PCI_DEVICE_SUB(0x144d, 0xa808, {X5 subvendor?}, {X5 subdevice?}),   /* Samsung X5 */
 		.driver_data =  NVME_QUIRK_DELAY_BEFORE_CHK_RDY|
 				NVME_QUIRK_NO_DEEPEST_PS |
 				NVME_QUIRK_IGNORE_DEV_SUBNQN, },


But I don't know X5's subsystem ids, Can someone provide subsystem-ids of X5?

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests]
  2022-06-15 11:31                   ` Christoph Hellwig
  2022-06-20  3:36                     ` onenowy
@ 2022-06-20  4:34                     ` SungHwan Jung
  1 sibling, 0 replies; 31+ messages in thread
From: SungHwan Jung @ 2022-06-20  4:34 UTC (permalink / raw)
  To: hch
  Cc: Jason, abhijeet.rao, alan.adamson, axboe, kbusch, linux-kernel,
	linux-nvme, monish.kumar.r, pankydev8, sagi, yi.zhang

Hi again,
I'm sorry to send mail again.

I think it would be to confirm that subsystem-ids for all X5 model and these ids are not used for other ssd by contact with samsung.

But I have no idea how to contant with them, please fowards this to them or if you already have these information, please tests the patch in previous mail.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests]
  2022-06-20  3:36                     ` onenowy
@ 2022-06-20  7:09                       ` Christoph Hellwig
  0 siblings, 0 replies; 31+ messages in thread
From: Christoph Hellwig @ 2022-06-20  7:09 UTC (permalink / raw)
  To: onenowy
  Cc: hch, Jason, abhijeet.rao, alan.adamson, axboe, kbusch,
	linux-kernel, linux-nvme, monish.kumar.r, pankydev8, sagi,
	yi.zhang

On Mon, Jun 20, 2022 at 12:36:27PM +0900, onenowy wrote:
> Some samsung SSD for OEM also have the identical PCI-ids and are affected by this quirk.
> But they have different subsystem-ids.

> Addtition of subsystem-ids of X5 to pci_device_id(as below) may solve this problem.

Monish, can you look into that?

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2022-06-20  7:09 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-03 14:28 [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests Yi Zhang
2022-04-04 16:39 ` Alan Adamson
2022-04-04 17:02   ` Keith Busch
2022-04-04 19:34     ` Jonathan Derrick
2022-04-04 20:30       ` Keith Busch
2022-04-05  0:43         ` Jonathan Derrick
2022-04-06 17:34           ` Keith Busch
2022-04-05  6:14         ` Christoph Hellwig
2022-04-05 18:21           ` Jonathan Derrick
2022-04-05 20:51             ` Alan Adamson
2022-04-05 21:56               ` Jonathan Derrick
2022-04-05 19:09           ` Alan Adamson
2022-04-05 20:00 ` Jason A. Donenfeld
2022-04-05 20:48   ` Alan Adamson
2022-06-09  8:20   ` 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests] Jason A. Donenfeld
2022-06-09  8:34     ` Jason A. Donenfeld
2022-06-09  8:40       ` [PATCH] Revert "nvme-pci: add quirks for Samsung X5 SSDs" Jason A. Donenfeld
2022-06-09  9:32       ` 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests] R, Monish Kumar
2022-06-09  9:38         ` Jason A. Donenfeld
2022-06-10  6:14           ` Christoph Hellwig
2022-06-10  9:19             ` Jason A. Donenfeld
2022-06-13  6:36               ` R, Monish Kumar
2022-06-13 12:20                 ` Jason A. Donenfeld
2022-06-13 13:02                   ` R, Monish Kumar
2022-06-13 13:55               ` Christoph Hellwig
2022-06-15 10:27                 ` Pankaj Raghav
2022-06-15 11:31                   ` Christoph Hellwig
2022-06-20  3:36                     ` onenowy
2022-06-20  7:09                       ` Christoph Hellwig
2022-06-20  4:34                     ` SungHwan Jung
2022-06-10 12:05             ` Pankaj Raghav

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.