* [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests @ 2022-04-03 14:28 Yi Zhang 2022-04-04 16:39 ` Alan Adamson 2022-04-05 20:00 ` Jason A. Donenfeld 0 siblings, 2 replies; 31+ messages in thread From: Yi Zhang @ 2022-04-03 14:28 UTC (permalink / raw) To: open list:NVM EXPRESS DRIVER; +Cc: Sagi Grimberg, alan.adamson Hello I found this error log during blktests[1] with the latest linux-block/for-next, seems it was introduced after commit[2], is that expected? [1] [ 96.931911] run blktests nvme/004 at 2022-04-02 20:53:16 [ 97.024693] loop: module loaded [ 97.029319] loop0: detected capacity change from 0 to 2097152 [ 97.039641] nvmet: adding nsid 1 to subsystem blktests-subsystem-1 [ 97.068847] nvmet: creating nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:4c4c4544-0035-4b10-8044-b9c04f463333. [ 97.083215] nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR [ 97.090108] nvme nvme0: creating 32 I/O queues. [ 97.098283] nvme nvme0: new ctrl: "blktests-subsystem-1" [ 98.136900] nvme nvme0: Removing ctrl: NQN "blktests-subsystem-1" [2] commit bd83fe6f2cd2133beaac7c423fd36c3515048fc8 Author: Alan Adamson <alan.adamson@oracle.com> Date: Thu Feb 3 00:11:53 2022 -0800 nvme: add verbose error logging -- Best Regards, Yi Zhang ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests 2022-04-03 14:28 [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests Yi Zhang @ 2022-04-04 16:39 ` Alan Adamson 2022-04-04 17:02 ` Keith Busch 2022-04-05 20:00 ` Jason A. Donenfeld 1 sibling, 1 reply; 31+ messages in thread From: Alan Adamson @ 2022-04-04 16:39 UTC (permalink / raw) To: Yi Zhang; +Cc: open list:NVM EXPRESS DRIVER, Sagi Grimberg [ 97.083215] nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR An error from the device (status=2/Invalid Field) when an Identify (0x6) command was issued. Prior to the patch, the nvme driver didn’t display the error. Alan > On Apr 3, 2022, at 7:28 AM, Yi Zhang <yi.zhang@redhat.com> wrote: > > Hello > > I found this error log during blktests[1] with the latest > linux-block/for-next, seems it was introduced after commit[2], is that > expected? > > [1] > [ 96.931911] run blktests nvme/004 at 2022-04-02 20:53:16 > [ 97.024693] loop: module loaded > [ 97.029319] loop0: detected capacity change from 0 to 2097152 > [ 97.039641] nvmet: adding nsid 1 to subsystem blktests-subsystem-1 > [ 97.068847] nvmet: creating nvm controller 1 for subsystem > blktests-subsystem-1 for NQN > nqn.2014-08.org.nvmexpress:uuid:4c4c4544-0035-4b10-8044-b9c04f463333. > [ 97.083215] nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR > [ 97.090108] nvme nvme0: creating 32 I/O queues. > [ 97.098283] nvme nvme0: new ctrl: "blktests-subsystem-1" > [ 98.136900] nvme nvme0: Removing ctrl: NQN "blktests-subsystem-1" > > [2] > commit bd83fe6f2cd2133beaac7c423fd36c3515048fc8 > Author: Alan Adamson <alan.adamson@oracle.com> > Date: Thu Feb 3 00:11:53 2022 -0800 > > nvme: add verbose error logging > > -- > Best Regards, > Yi Zhang > ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests 2022-04-04 16:39 ` Alan Adamson @ 2022-04-04 17:02 ` Keith Busch 2022-04-04 19:34 ` Jonathan Derrick 0 siblings, 1 reply; 31+ messages in thread From: Keith Busch @ 2022-04-04 17:02 UTC (permalink / raw) To: Alan Adamson; +Cc: Yi Zhang, open list:NVM EXPRESS DRIVER, Sagi Grimberg On Mon, Apr 04, 2022 at 04:39:06PM +0000, Alan Adamson wrote: > > [ 97.083215] nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR > > An error from the device (status=2/Invalid Field) when an Identify (0x6) command was issued. Prior to the patch, > the nvme driver didn’t display the error. And it's harmless. The driver is querying an optional identification and it's okay if the device doesn't support it, but the driver doesn't know if the device supports until it tries it. > > On Apr 3, 2022, at 7:28 AM, Yi Zhang <yi.zhang@redhat.com> wrote: > > > > Hello > > > > I found this error log during blktests[1] with the latest > > linux-block/for-next, seems it was introduced after commit[2], is that > > expected? > > > > [1] > > [ 96.931911] run blktests nvme/004 at 2022-04-02 20:53:16 > > [ 97.024693] loop: module loaded > > [ 97.029319] loop0: detected capacity change from 0 to 2097152 > > [ 97.039641] nvmet: adding nsid 1 to subsystem blktests-subsystem-1 > > [ 97.068847] nvmet: creating nvm controller 1 for subsystem > > blktests-subsystem-1 for NQN > > nqn.2014-08.org.nvmexpress:uuid:4c4c4544-0035-4b10-8044-b9c04f463333. > > [ 97.083215] nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR > > [ 97.090108] nvme nvme0: creating 32 I/O queues. > > [ 97.098283] nvme nvme0: new ctrl: "blktests-subsystem-1" > > [ 98.136900] nvme nvme0: Removing ctrl: NQN "blktests-subsystem-1" > > > > [2] > > commit bd83fe6f2cd2133beaac7c423fd36c3515048fc8 > > Author: Alan Adamson <alan.adamson@oracle.com> > > Date: Thu Feb 3 00:11:53 2022 -0800 > > > > nvme: add verbose error logging > > > > -- > > Best Regards, > > Yi Zhang > > > ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests 2022-04-04 17:02 ` Keith Busch @ 2022-04-04 19:34 ` Jonathan Derrick 2022-04-04 20:30 ` Keith Busch 0 siblings, 1 reply; 31+ messages in thread From: Jonathan Derrick @ 2022-04-04 19:34 UTC (permalink / raw) To: linux-nvme On 4/4/2022 11:02 AM, Keith Busch wrote: > On Mon, Apr 04, 2022 at 04:39:06PM +0000, Alan Adamson wrote: >> >> [ 97.083215] nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR >> >> An error from the device (status=2/Invalid Field) when an Identify (0x6) command was issued. Prior to the patch, >> the nvme driver didn’t display the error. > > And it's harmless. The driver is querying an optional identification and it's > okay if the device doesn't support it, but the driver doesn't know if the > device supports until it tries it. I've had to field bug reports like this before, where it says error in dmesg but is actually harmless. Maybe we could suppress the error and add a harmless print? Eg, nvme0: blah blah command set not supported > >>> On Apr 3, 2022, at 7:28 AM, Yi Zhang <yi.zhang@redhat.com> wrote: >>> >>> Hello >>> >>> I found this error log during blktests[1] with the latest >>> linux-block/for-next, seems it was introduced after commit[2], is that >>> expected? >>> >>> [1] >>> [ 96.931911] run blktests nvme/004 at 2022-04-02 20:53:16 >>> [ 97.024693] loop: module loaded >>> [ 97.029319] loop0: detected capacity change from 0 to 2097152 >>> [ 97.039641] nvmet: adding nsid 1 to subsystem blktests-subsystem-1 >>> [ 97.068847] nvmet: creating nvm controller 1 for subsystem >>> blktests-subsystem-1 for NQN >>> nqn.2014-08.org.nvmexpress:uuid:4c4c4544-0035-4b10-8044-b9c04f463333. >>> [ 97.083215] nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR >>> [ 97.090108] nvme nvme0: creating 32 I/O queues. >>> [ 97.098283] nvme nvme0: new ctrl: "blktests-subsystem-1" >>> [ 98.136900] nvme nvme0: Removing ctrl: NQN "blktests-subsystem-1" >>> >>> [2] >>> commit bd83fe6f2cd2133beaac7c423fd36c3515048fc8 >>> Author: Alan Adamson <alan.adamson@oracle.com> >>> Date: Thu Feb 3 00:11:53 2022 -0800 >>> >>> nvme: add verbose error logging >>> >>> -- >>> Best Regards, >>> Yi Zhang >>> >> > ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests 2022-04-04 19:34 ` Jonathan Derrick @ 2022-04-04 20:30 ` Keith Busch 2022-04-05 0:43 ` Jonathan Derrick 2022-04-05 6:14 ` Christoph Hellwig 0 siblings, 2 replies; 31+ messages in thread From: Keith Busch @ 2022-04-04 20:30 UTC (permalink / raw) To: Jonathan Derrick; +Cc: linux-nvme On Mon, Apr 04, 2022 at 01:34:38PM -0600, Jonathan Derrick wrote: > > > On 4/4/2022 11:02 AM, Keith Busch wrote: > > On Mon, Apr 04, 2022 at 04:39:06PM +0000, Alan Adamson wrote: > > > > > > [ 97.083215] nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR > > > > > > An error from the device (status=2/Invalid Field) when an Identify (0x6) command was issued. Prior to the patch, > > > the nvme driver didn’t display the error. > > And it's harmless. The driver is querying an optional identification and it's > > okay if the device doesn't support it, but the driver doesn't know if the > > device supports until it tries it. > I've had to field bug reports like this before, where it says error in dmesg > but is actually harmless. Even before the kernel message was added, some implementations trigger error log events, which alerts smartd. :( > Maybe we could suppress the error and add a harmless print? > Eg, nvme0: blah blah command set not supported The new print in the completion handler is pretty generic. I don't think it can readily tell the difference from a harmless error. Maybe pr_err is too high? Or maybe since enough people have been concerned about *this* specific identify, maybe it should be restricted to 2.0 devices where it's mandatory. I was reluctant to do that at first since the initial device I tested was 1.4, but it was a prototype and we should be fine without the non-mdts limits anyway. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests 2022-04-04 20:30 ` Keith Busch @ 2022-04-05 0:43 ` Jonathan Derrick 2022-04-06 17:34 ` Keith Busch 2022-04-05 6:14 ` Christoph Hellwig 1 sibling, 1 reply; 31+ messages in thread From: Jonathan Derrick @ 2022-04-05 0:43 UTC (permalink / raw) To: Keith Busch; +Cc: linux-nvme On 4/4/2022 2:30 PM, Keith Busch wrote: > On Mon, Apr 04, 2022 at 01:34:38PM -0600, Jonathan Derrick wrote: >> >> >> On 4/4/2022 11:02 AM, Keith Busch wrote: >>> On Mon, Apr 04, 2022 at 04:39:06PM +0000, Alan Adamson wrote: >>>> >>>> [ 97.083215] nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR >>>> >>>> An error from the device (status=2/Invalid Field) when an Identify (0x6) command was issued. Prior to the patch, >>>> the nvme driver didn’t display the error. >>> And it's harmless. The driver is querying an optional identification and it's >>> okay if the device doesn't support it, but the driver doesn't know if the >>> device supports until it tries it. >> I've had to field bug reports like this before, where it says error in dmesg >> but is actually harmless. > > Even before the kernel message was added, some implementations trigger error > log events, which alerts smartd. :( I haven't had too many smartd reports come my way, but 'error' in dmesg is a typical one. Usually due a language barrier issue or uneducated AE. But I suspect a lay-user without an AE might resort to kernel bz due to things like this. > >> Maybe we could suppress the error and add a harmless print? >> Eg, nvme0: blah blah command set not supported > > The new print in the completion handler is pretty generic. I don't think it can > readily tell the difference from a harmless error. Maybe pr_err is too high? Could we pack RQF_FAILED in the req flags and reduce the severity in nvme_log_error? > > Or maybe since enough people have been concerned about *this* specific > identify, maybe it should be restricted to 2.0 devices where it's mandatory. I > was reluctant to do that at first since the initial device I tested was 1.4, > but it was a prototype and we should be fine without the non-mdts limits > anyway. Well I'm not sure about that. I'm not honestly sure what this specific identify is, but (I know you know) NVMe being forward compatible allows eg, 1.4 compliant devices/targets to support 2.0 features. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests 2022-04-05 0:43 ` Jonathan Derrick @ 2022-04-06 17:34 ` Keith Busch 0 siblings, 0 replies; 31+ messages in thread From: Keith Busch @ 2022-04-06 17:34 UTC (permalink / raw) To: Jonathan Derrick; +Cc: linux-nvme On Mon, Apr 04, 2022 at 06:43:01PM -0600, Jonathan Derrick wrote: > > Or maybe since enough people have been concerned about *this* specific > > identify, maybe it should be restricted to 2.0 devices where it's mandatory. I > > was reluctant to do that at first since the initial device I tested was 1.4, > > but it was a prototype and we should be fine without the non-mdts limits > > anyway. > Well I'm not sure about that. I'm not honestly sure what this specific > identify is, but (I know you know) NVMe being forward compatible allows eg, > 1.4 compliant devices/targets to support 2.0 features. The command triggering the error is the NVM CSI Identify Controller (CNS=6), which provides the recommended limits for commands that don't transfer block data (write zeroes, verify, dsm, etc..). The driver sets it up in nvme_init_non_mdts_limits(). Here are two other user reports of this triggering smartd warnings: http://lists.infradead.org/pipermail/linux-nvme/2021-July/028009.html https://bugzilla.kernel.org/show_bug.cgi?id=215763 The identification only provides recommended limits, so it's not all that bad if we skip checking it for <2.0 versions that don't require it. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests 2022-04-04 20:30 ` Keith Busch 2022-04-05 0:43 ` Jonathan Derrick @ 2022-04-05 6:14 ` Christoph Hellwig 2022-04-05 18:21 ` Jonathan Derrick 2022-04-05 19:09 ` Alan Adamson 1 sibling, 2 replies; 31+ messages in thread From: Christoph Hellwig @ 2022-04-05 6:14 UTC (permalink / raw) To: Keith Busch; +Cc: Jonathan Derrick, linux-nvme On Mon, Apr 04, 2022 at 02:30:12PM -0600, Keith Busch wrote: > > Eg, nvme0: blah blah command set not supported > > The new print in the completion handler is pretty generic. I don't think it can > readily tell the difference from a harmless error. Maybe pr_err is too high? > > Or maybe since enough people have been concerned about *this* specific > identify, maybe it should be restricted to 2.0 devices where it's mandatory. I > was reluctant to do that at first since the initial device I tested was 1.4, > but it was a prototype and we should be fine without the non-mdts limits > anyway. What SCSI does is to add RQF_QUIET to all internal passthrough commands, and then skips printing the SCSI specific error messages in addition if that flag is set. This would be the nvme version of that: diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 7e07dd69262a7..9346cd4cf5820 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -366,7 +366,8 @@ static inline void nvme_end_req(struct request *req) { blk_status_t status = nvme_error_status(nvme_req(req)->status); - if (unlikely(nvme_req(req)->status != NVME_SC_SUCCESS)) + if (unlikely(nvme_req(req)->status != NVME_SC_SUCCESS && + !(req->rq_flags & RQF_QUIET))) nvme_log_error(req); nvme_end_req_zoned(req); nvme_trace_bio_complete(req); @@ -648,6 +649,7 @@ void nvme_init_request(struct request *req, struct nvme_command *cmd) cmd->common.flags &= ~NVME_CMD_SGL_ALL; req->cmd_flags |= REQ_FAILFAST_DRIVER; + req->rq_flags |= RQF_QUIET; if (req->mq_hctx->type == HCTX_TYPE_POLL) req->cmd_flags |= REQ_POLLED; nvme_clear_nvme_request(req); ^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests 2022-04-05 6:14 ` Christoph Hellwig @ 2022-04-05 18:21 ` Jonathan Derrick 2022-04-05 20:51 ` Alan Adamson 2022-04-05 19:09 ` Alan Adamson 1 sibling, 1 reply; 31+ messages in thread From: Jonathan Derrick @ 2022-04-05 18:21 UTC (permalink / raw) To: Christoph Hellwig, Keith Busch; +Cc: linux-nvme On 4/5/2022 12:14 AM, Christoph Hellwig wrote: > On Mon, Apr 04, 2022 at 02:30:12PM -0600, Keith Busch wrote: >>> Eg, nvme0: blah blah command set not supported >> >> The new print in the completion handler is pretty generic. I don't think it can >> readily tell the difference from a harmless error. Maybe pr_err is too high? >> >> Or maybe since enough people have been concerned about *this* specific >> identify, maybe it should be restricted to 2.0 devices where it's mandatory. I >> was reluctant to do that at first since the initial device I tested was 1.4, >> but it was a prototype and we should be fine without the non-mdts limits >> anyway. > > What SCSI does is to add RQF_QUIET to all internal passthrough commands, > and then skips printing the SCSI specific error messages in addition > if that flag is set. > > This would be the nvme version of that: > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c > index 7e07dd69262a7..9346cd4cf5820 100644 > --- a/drivers/nvme/host/core.c > +++ b/drivers/nvme/host/core.c > @@ -366,7 +366,8 @@ static inline void nvme_end_req(struct request *req) > { > blk_status_t status = nvme_error_status(nvme_req(req)->status); > > - if (unlikely(nvme_req(req)->status != NVME_SC_SUCCESS)) > + if (unlikely(nvme_req(req)->status != NVME_SC_SUCCESS && > + !(req->rq_flags & RQF_QUIET))) > nvme_log_error(req); > nvme_end_req_zoned(req); > nvme_trace_bio_complete(req); > @@ -648,6 +649,7 @@ void nvme_init_request(struct request *req, struct nvme_command *cmd) > cmd->common.flags &= ~NVME_CMD_SGL_ALL; > > req->cmd_flags |= REQ_FAILFAST_DRIVER; > + req->rq_flags |= RQF_QUIET; > if (req->mq_hctx->type == HCTX_TYPE_POLL) > req->cmd_flags |= REQ_POLLED; > nvme_clear_nvme_request(req); That's good too. How about this so it's limited to debug loglevels: diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index f204c6f78b5b..871ad2421284 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -303,9 +303,10 @@ static void nvme_log_error(struct request *req) { struct nvme_ns *ns = req->q->queuedata; struct nvme_request *nr = nvme_req(req); + int level = req->rq_flags & RQF_QUIET ? KERN_DEBUG : KERN_ERR; if (ns) { - pr_err_ratelimited("%s: %s(0x%x) @ LBA %llu, %llu blocks, %s (sct 0x%x / sc 0x%x) %s%s\n", + printk_ratelimited(level "%s: %s(0x%x) @ LBA %llu, %llu blocks, %s (sct 0x%x / sc 0x%x) %s%s\n", ns->disk ? ns->disk->disk_name : "?", nvme_get_opcode_str(nr->cmd->common.opcode), nr->cmd->common.opcode, @@ -319,7 +320,7 @@ static void nvme_log_error(struct request *req) return; } - pr_err_ratelimited("%s: %s(0x%x), %s (sct 0x%x / sc 0x%x) %s%s\n", + printk_ratelimited(level "%s: %s(0x%x), %s (sct 0x%x / sc 0x%x) %s%s\n", dev_name(nr->ctrl->device), nvme_get_admin_opcode_str(nr->cmd->common.opcode), nr->cmd->common.opcode, @@ -651,6 +652,7 @@ void nvme_init_request(struct request *req, struct nvme_command *cmd) cmd->common.flags &= ~NVME_CMD_SGL_ALL; req->cmd_flags |= REQ_FAILFAST_DRIVER; + req->rq_flags |= RQF_QUIET; if (req->mq_hctx->type == HCTX_TYPE_POLL) req->cmd_flags |= REQ_POLLED; nvme_clear_nvme_request(req); ^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests 2022-04-05 18:21 ` Jonathan Derrick @ 2022-04-05 20:51 ` Alan Adamson 2022-04-05 21:56 ` Jonathan Derrick 0 siblings, 1 reply; 31+ messages in thread From: Alan Adamson @ 2022-04-05 20:51 UTC (permalink / raw) To: Jonathan Derrick; +Cc: Christoph Hellwig, Keith Busch, linux-nvme > On Apr 5, 2022, at 11:21 AM, Jonathan Derrick <jonathan.derrick@linux.dev> wrote: > > > > On 4/5/2022 12:14 AM, Christoph Hellwig wrote: >> On Mon, Apr 04, 2022 at 02:30:12PM -0600, Keith Busch wrote: >>>> Eg, nvme0: blah blah command set not supported >>> >>> The new print in the completion handler is pretty generic. I don't think it can >>> readily tell the difference from a harmless error. Maybe pr_err is too high? >>> >>> Or maybe since enough people have been concerned about *this* specific >>> identify, maybe it should be restricted to 2.0 devices where it's mandatory. I >>> was reluctant to do that at first since the initial device I tested was 1.4, >>> but it was a prototype and we should be fine without the non-mdts limits >>> anyway. >> >> What SCSI does is to add RQF_QUIET to all internal passthrough commands, >> and then skips printing the SCSI specific error messages in addition >> if that flag is set. >> >> This would be the nvme version of that: >> >> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c >> index 7e07dd69262a7..9346cd4cf5820 100644 >> --- a/drivers/nvme/host/core.c >> +++ b/drivers/nvme/host/core.c >> @@ -366,7 +366,8 @@ static inline void nvme_end_req(struct request *req) >> { >> blk_status_t status = nvme_error_status(nvme_req(req)->status); >> >> - if (unlikely(nvme_req(req)->status != NVME_SC_SUCCESS)) >> + if (unlikely(nvme_req(req)->status != NVME_SC_SUCCESS && >> + !(req->rq_flags & RQF_QUIET))) >> nvme_log_error(req); >> nvme_end_req_zoned(req); >> nvme_trace_bio_complete(req); >> @@ -648,6 +649,7 @@ void nvme_init_request(struct request *req, struct nvme_command *cmd) >> cmd->common.flags &= ~NVME_CMD_SGL_ALL; >> >> req->cmd_flags |= REQ_FAILFAST_DRIVER; >> + req->rq_flags |= RQF_QUIET; >> if (req->mq_hctx->type == HCTX_TYPE_POLL) >> req->cmd_flags |= REQ_POLLED; >> nvme_clear_nvme_request(req); > > > That's good too. > How about this so it's limited to debug loglevels: I don’t think we want to limit it to debug loglevels. The main purpose of the patch was to allow for debugging issues of live customer systems. Alan > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c > index f204c6f78b5b..871ad2421284 100644 > --- a/drivers/nvme/host/core.c > +++ b/drivers/nvme/host/core.c > @@ -303,9 +303,10 @@ static void nvme_log_error(struct request *req) > { > struct nvme_ns *ns = req->q->queuedata; > struct nvme_request *nr = nvme_req(req); > + int level = req->rq_flags & RQF_QUIET ? KERN_DEBUG : KERN_ERR; > > if (ns) { > - pr_err_ratelimited("%s: %s(0x%x) @ LBA %llu, %llu blocks, %s (sct 0x%x / sc 0x%x) %s%s\n", > + printk_ratelimited(level "%s: %s(0x%x) @ LBA %llu, %llu blocks, %s (sct 0x%x / sc 0x%x) %s%s\n", > ns->disk ? ns->disk->disk_name : "?", > nvme_get_opcode_str(nr->cmd->common.opcode), > nr->cmd->common.opcode, > @@ -319,7 +320,7 @@ static void nvme_log_error(struct request *req) > return; > } > > - pr_err_ratelimited("%s: %s(0x%x), %s (sct 0x%x / sc 0x%x) %s%s\n", > + printk_ratelimited(level "%s: %s(0x%x), %s (sct 0x%x / sc 0x%x) %s%s\n", > dev_name(nr->ctrl->device), > nvme_get_admin_opcode_str(nr->cmd->common.opcode), > nr->cmd->common.opcode, > @@ -651,6 +652,7 @@ void nvme_init_request(struct request *req, struct nvme_command *cmd) > cmd->common.flags &= ~NVME_CMD_SGL_ALL; > > req->cmd_flags |= REQ_FAILFAST_DRIVER; > + req->rq_flags |= RQF_QUIET; > if (req->mq_hctx->type == HCTX_TYPE_POLL) > req->cmd_flags |= REQ_POLLED; > nvme_clear_nvme_request(req); > ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests 2022-04-05 20:51 ` Alan Adamson @ 2022-04-05 21:56 ` Jonathan Derrick 0 siblings, 0 replies; 31+ messages in thread From: Jonathan Derrick @ 2022-04-05 21:56 UTC (permalink / raw) To: Alan Adamson; +Cc: Christoph Hellwig, Keith Busch, linux-nvme On 4/5/2022 2:51 PM, Alan Adamson wrote: > > >> On Apr 5, 2022, at 11:21 AM, Jonathan Derrick <jonathan.derrick@linux.dev> wrote: >> >> >> >> On 4/5/2022 12:14 AM, Christoph Hellwig wrote: >>> On Mon, Apr 04, 2022 at 02:30:12PM -0600, Keith Busch wrote: >>>>> Eg, nvme0: blah blah command set not supported >>>> >>>> The new print in the completion handler is pretty generic. I don't think it can >>>> readily tell the difference from a harmless error. Maybe pr_err is too high? >>>> >>>> Or maybe since enough people have been concerned about *this* specific >>>> identify, maybe it should be restricted to 2.0 devices where it's mandatory. I >>>> was reluctant to do that at first since the initial device I tested was 1.4, >>>> but it was a prototype and we should be fine without the non-mdts limits >>>> anyway. >>> >>> What SCSI does is to add RQF_QUIET to all internal passthrough commands, >>> and then skips printing the SCSI specific error messages in addition >>> if that flag is set. >>> >>> This would be the nvme version of that: >>> >>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c >>> index 7e07dd69262a7..9346cd4cf5820 100644 >>> --- a/drivers/nvme/host/core.c >>> +++ b/drivers/nvme/host/core.c >>> @@ -366,7 +366,8 @@ static inline void nvme_end_req(struct request *req) >>> { >>> blk_status_t status = nvme_error_status(nvme_req(req)->status); >>> >>> - if (unlikely(nvme_req(req)->status != NVME_SC_SUCCESS)) >>> + if (unlikely(nvme_req(req)->status != NVME_SC_SUCCESS && >>> + !(req->rq_flags & RQF_QUIET))) >>> nvme_log_error(req); >>> nvme_end_req_zoned(req); >>> nvme_trace_bio_complete(req); >>> @@ -648,6 +649,7 @@ void nvme_init_request(struct request *req, struct nvme_command *cmd) >>> cmd->common.flags &= ~NVME_CMD_SGL_ALL; >>> >>> req->cmd_flags |= REQ_FAILFAST_DRIVER; >>> + req->rq_flags |= RQF_QUIET; >>> if (req->mq_hctx->type == HCTX_TYPE_POLL) >>> req->cmd_flags |= REQ_POLLED; >>> nvme_clear_nvme_request(req); >> >> >> That's good too. >> How about this so it's limited to debug loglevels: > > I don’t think we want to limit it to debug loglevels. The main purpose of the patch was to allow for debugging issues of live customer systems. > > Alan I guess it's your preference if you want that print to be accessible during debug or if you don't want that print at all. > > >> >> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c >> index f204c6f78b5b..871ad2421284 100644 >> --- a/drivers/nvme/host/core.c >> +++ b/drivers/nvme/host/core.c >> @@ -303,9 +303,10 @@ static void nvme_log_error(struct request *req) >> { >> struct nvme_ns *ns = req->q->queuedata; >> struct nvme_request *nr = nvme_req(req); >> + int level = req->rq_flags & RQF_QUIET ? KERN_DEBUG : KERN_ERR; >> >> if (ns) { >> - pr_err_ratelimited("%s: %s(0x%x) @ LBA %llu, %llu blocks, %s (sct 0x%x / sc 0x%x) %s%s\n", >> + printk_ratelimited(level "%s: %s(0x%x) @ LBA %llu, %llu blocks, %s (sct 0x%x / sc 0x%x) %s%s\n", >> ns->disk ? ns->disk->disk_name : "?", >> nvme_get_opcode_str(nr->cmd->common.opcode), >> nr->cmd->common.opcode, >> @@ -319,7 +320,7 @@ static void nvme_log_error(struct request *req) >> return; >> } >> >> - pr_err_ratelimited("%s: %s(0x%x), %s (sct 0x%x / sc 0x%x) %s%s\n", >> + printk_ratelimited(level "%s: %s(0x%x), %s (sct 0x%x / sc 0x%x) %s%s\n", >> dev_name(nr->ctrl->device), >> nvme_get_admin_opcode_str(nr->cmd->common.opcode), >> nr->cmd->common.opcode, >> @@ -651,6 +652,7 @@ void nvme_init_request(struct request *req, struct nvme_command *cmd) >> cmd->common.flags &= ~NVME_CMD_SGL_ALL; >> >> req->cmd_flags |= REQ_FAILFAST_DRIVER; >> + req->rq_flags |= RQF_QUIET; >> if (req->mq_hctx->type == HCTX_TYPE_POLL) >> req->cmd_flags |= REQ_POLLED; >> nvme_clear_nvme_request(req); >> > ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests 2022-04-05 6:14 ` Christoph Hellwig 2022-04-05 18:21 ` Jonathan Derrick @ 2022-04-05 19:09 ` Alan Adamson 1 sibling, 0 replies; 31+ messages in thread From: Alan Adamson @ 2022-04-05 19:09 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Keith Busch, Jonathan Derrick, linux-nvme > On Apr 4, 2022, at 11:14 PM, Christoph Hellwig <hch@infradead.org> wrote: > > On Mon, Apr 04, 2022 at 02:30:12PM -0600, Keith Busch wrote: >>> Eg, nvme0: blah blah command set not supported >> >> The new print in the completion handler is pretty generic. I don't think it can >> readily tell the difference from a harmless error. Maybe pr_err is too high? >> >> Or maybe since enough people have been concerned about *this* specific >> identify, maybe it should be restricted to 2.0 devices where it's mandatory. I >> was reluctant to do that at first since the initial device I tested was 1.4, >> but it was a prototype and we should be fine without the non-mdts limits >> anyway. > > What SCSI does is to add RQF_QUIET to all internal passthrough commands, > and then skips printing the SCSI specific error messages in addition > if that flag is set. > > This would be the nvme version of that: > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c > index 7e07dd69262a7..9346cd4cf5820 100644 > --- a/drivers/nvme/host/core.c > +++ b/drivers/nvme/host/core.c > @@ -366,7 +366,8 @@ static inline void nvme_end_req(struct request *req) > { > blk_status_t status = nvme_error_status(nvme_req(req)->status); > > - if (unlikely(nvme_req(req)->status != NVME_SC_SUCCESS)) > + if (unlikely(nvme_req(req)->status != NVME_SC_SUCCESS && > + !(req->rq_flags & RQF_QUIET))) > nvme_log_error(req); > nvme_end_req_zoned(req); > nvme_trace_bio_complete(req); > @@ -648,6 +649,7 @@ void nvme_init_request(struct request *req, struct nvme_command *cmd) > cmd->common.flags &= ~NVME_CMD_SGL_ALL; > > req->cmd_flags |= REQ_FAILFAST_DRIVER; > + req->rq_flags |= RQF_QUIET; > if (req->mq_hctx->type == HCTX_TYPE_POLL) > req->cmd_flags |= REQ_POLLED; > nvme_clear_nvme_request(req); > I tested the change with an instrumented qemu emulated device. Without the change: # dmesg | grep -i nvme [ 3.610944] nvme nvme0: pci function 0000:00:04.0 [ 3.685467] nvme0: Identify(0x6), Invalid Field in Command (sct 0x0 / sc 0x2) DNR [ 3.687850] nvme nvme0: 1/0/0 default/read/poll queues With the change: # dmesg | grep -i nvme [ 3.437014] nvme nvme0: pci function 0000:00:04.0 [ 3.506572] nvme nvme0: 1/0/0 default/read/poll queues Would you like me to submit the patch? Alan ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests 2022-04-03 14:28 [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests Yi Zhang 2022-04-04 16:39 ` Alan Adamson @ 2022-04-05 20:00 ` Jason A. Donenfeld 2022-04-05 20:48 ` Alan Adamson 2022-06-09 8:20 ` 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests] Jason A. Donenfeld 1 sibling, 2 replies; 31+ messages in thread From: Jason A. Donenfeld @ 2022-04-05 20:00 UTC (permalink / raw) To: Yi Zhang, alan.adamson Cc: open list:NVM EXPRESS DRIVER, Sagi Grimberg, alan.adamson, linux-kernel Hi Alan, I too am seeing this. Tracking it down to the same commit, I decided to enable NVME_VERBOSE_ERRORS to get some more information. Now on boot and everytime I wake up from sleep, I see: [ 89.098578] nvme nvme0: Shutdown timeout set to 8 seconds [ 89.098683] nvme0: Identify(0x6), Invalid Field in Command (sct 0x0 / sc 0x2) MORE [ 89.119363] nvme nvme0: 16/0/0 default/read/poll queues With that middle line in red. Question is: is this actually an error? If not, maybe it shouldn't be printed as a KERN_ERR. And if it's printed as a KERN_INFO, maybe it should only do so when CONFIG_NVME_VERBOSE_ERRORS=y? Or do you think there is actually some other diagnostic value in having this print always? Using a Samsung SSD 970 EVO Plus 2TB, firmware version 2B2QEXM7, in case that's useful info. I also noticed a ~2 second boot delay on 5.18-rc1: [ 0.917631] pstore: Using crash dump compression: deflate [ 0.917807] Key type encrypted registered [ 0.951840] ACPI: battery: Slot [BAT0] (battery present) [ 3.146765] nvme nvme0: Shutdown timeout set to 8 seconds [ 3.146918] nvme0: Identify(0x6), Invalid Field in Command (sct 0x0 / sc 0x2) MORE [ 3.188852] nvme nvme0: 16/0/0 default/read/poll queues [ 3.198163] nvme0n1: p1 p2 [ 3.199554] Freeing unused kernel image (initmem) memory: 12952K I haven't looked into it much, but I assume it's also NVMe related? Or maybe the vconsole is just initializing faster so I see text where before I didn't. Not sure. Regards, Jason ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests 2022-04-05 20:00 ` Jason A. Donenfeld @ 2022-04-05 20:48 ` Alan Adamson 2022-06-09 8:20 ` 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests] Jason A. Donenfeld 1 sibling, 0 replies; 31+ messages in thread From: Alan Adamson @ 2022-04-05 20:48 UTC (permalink / raw) To: Jason A. Donenfeld Cc: Yi Zhang, open list:NVM EXPRESS DRIVER, Sagi Grimberg, linux-kernel Hi Jason, It’s a harmless error, but we are looking at suppressing it. Alan > On Apr 5, 2022, at 1:00 PM, Jason A. Donenfeld <Jason@zx2c4.com> wrote: > > Hi Alan, > > I too am seeing this. Tracking it down to the same commit, I decided to > enable NVME_VERBOSE_ERRORS to get some more information. Now on boot and > everytime I wake up from sleep, I see: > > [ 89.098578] nvme nvme0: Shutdown timeout set to 8 seconds > [ 89.098683] nvme0: Identify(0x6), Invalid Field in Command (sct 0x0 / sc 0x2) MORE > [ 89.119363] nvme nvme0: 16/0/0 default/read/poll queues > > With that middle line in red. > > Question is: is this actually an error? If not, maybe it shouldn't be > printed as a KERN_ERR. And if it's printed as a KERN_INFO, maybe it > should only do so when CONFIG_NVME_VERBOSE_ERRORS=y? Or do you think > there is actually some other diagnostic value in having this print > always? > > Using a Samsung SSD 970 EVO Plus 2TB, firmware version 2B2QEXM7, in case > that's useful info. > > I also noticed a ~2 second boot delay on 5.18-rc1: > > [ 0.917631] pstore: Using crash dump compression: deflate > [ 0.917807] Key type encrypted registered > [ 0.951840] ACPI: battery: Slot [BAT0] (battery present) > [ 3.146765] nvme nvme0: Shutdown timeout set to 8 seconds > [ 3.146918] nvme0: Identify(0x6), Invalid Field in Command (sct 0x0 / sc 0x2) MORE > [ 3.188852] nvme nvme0: 16/0/0 default/read/poll queues > [ 3.198163] nvme0n1: p1 p2 > [ 3.199554] Freeing unused kernel image (initmem) memory: 12952K > > I haven't looked into it much, but I assume it's also NVMe related? Or > maybe the vconsole is just initializing faster so I see text where > before I didn't. Not sure. > > Regards, > Jason ^ permalink raw reply [flat|nested] 31+ messages in thread
* 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests] 2022-04-05 20:00 ` Jason A. Donenfeld 2022-04-05 20:48 ` Alan Adamson @ 2022-06-09 8:20 ` Jason A. Donenfeld 2022-06-09 8:34 ` Jason A. Donenfeld 1 sibling, 1 reply; 31+ messages in thread From: Jason A. Donenfeld @ 2022-06-09 8:20 UTC (permalink / raw) To: Yi Zhang, alan.adamson; +Cc: open list:NVM EXPRESS DRIVER, Sagi Grimberg, LKML Hi folks, On Tue, Apr 5, 2022 at 10:00 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote: > Using a Samsung SSD 970 EVO Plus 2TB, firmware version 2B2QEXM7, in case > that's useful info. > > I also noticed a ~2 second boot delay on 5.18-rc1: Just FYI, I am still seeing this delay in 5.19-rc1. Boot lines from 5.17: [ 0.882680] nvme nvme1: missing or invalid SUBNQN field. [ 0.882719] nvme nvme1: Shutdown timeout set to 10 seconds [ 0.885227] nvme nvme1: 8/0/0 default/read/poll queues [ 0.887910] nvme1n1: p1 p2 p3 [ 0.888317] nvme nvme0: missing or invalid SUBNQN field. [ 0.888361] nvme nvme0: Shutdown timeout set to 8 seconds [ 0.906301] nvme nvme0: 16/0/0 default/read/poll queues [ 0.910087] nvme0n1: p1 p2 Boot lines from 5.18 & 5.19: [ 0.846827] nvme nvme1: missing or invalid SUBNQN field. [ 0.846857] nvme nvme1: Shutdown timeout set to 10 seconds [ 0.849043] nvme nvme1: 8/0/0 default/read/poll queues [ 0.851595] nvme1n1: p1 p2 p3 [ 3.226962] nvme nvme0: Shutdown timeout set to 8 seconds [ 3.253890] nvme nvme0: 16/0/0 default/read/poll queues [ 3.263778] nvme0n1: p1 p2 The Samsung 970 EVO Plus has a ~2 second delay that wasn't there in 5.17. Any idea what's going on? Thanks, Jason ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests] 2022-06-09 8:20 ` 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests] Jason A. Donenfeld @ 2022-06-09 8:34 ` Jason A. Donenfeld 2022-06-09 8:40 ` [PATCH] Revert "nvme-pci: add quirks for Samsung X5 SSDs" Jason A. Donenfeld 2022-06-09 9:32 ` 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests] R, Monish Kumar 0 siblings, 2 replies; 31+ messages in thread From: Jason A. Donenfeld @ 2022-06-09 8:34 UTC (permalink / raw) To: monish.kumar.r Cc: open list:NVM EXPRESS DRIVER, Sagi Grimberg, alan.adamson, LKML, Yi Zhang, Keith Busch, axboe, Christoph Hellwig, abhijeet.rao Hey again, Figured it out. 2.3 seconds to be exact... It looks like this is caused by: bc360b0b1611 ("nvme-pci: add quirks for Samsung X5 SSDs") https://lore.kernel.org/all/20220316075449.18906-1-monish.kumar.r@intel.com/ This commit doesn't have any justification and got applied without much discussion. Perhaps Monish could supply some more info about why this is needed here? FTR, I have no issues on my system when reverting that. Perhaps it should be reverted. (I can send a revert commit for that if necessary.) Looking further, however, the PCIe ID is said to be for a "Samsung X5", which Google says is a portable thunderbolt drive. Is the PCIe ID correct? On my system, this is the PCIe ID of a Samsung 970 EVO Plus. Is it possible that Monish copied and pasted the wrong PCIe ID? Or has Samsung *reused* the same PCIe ID on both devices? In which case, we'd need some additional data for that quirk to avoid the delay. Also note that this (potentially errant) commit has been backported to stable. Jason ^ permalink raw reply [flat|nested] 31+ messages in thread
* [PATCH] Revert "nvme-pci: add quirks for Samsung X5 SSDs" 2022-06-09 8:34 ` Jason A. Donenfeld @ 2022-06-09 8:40 ` Jason A. Donenfeld 2022-06-09 9:32 ` 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests] R, Monish Kumar 1 sibling, 0 replies; 31+ messages in thread From: Jason A. Donenfeld @ 2022-06-09 8:40 UTC (permalink / raw) To: open list:NVM EXPRESS DRIVER, Sagi Grimberg, alan.adamson, LKML, Yi Zhang, Keith Busch, axboe, Christoph Hellwig, abhijeet.rao, monish.kumar.r Cc: Jason A. Donenfeld, stable This reverts commit bc360b0b1611566e1bd47384daf49af6a1c51837. This matches the hardware identifier of the Samsung 970 EVO Plus, a very popular internal laptop NVMe drive, which is not the Samsung X5, an external thunderbolt drive. In particular, this causes: 1) a 2.3 second boot time delay; and 2) disabling of deep power saving states. So just revert this until whatever funny business can be worked out regarding Samsung's PCI IDs. Fixes: bc360b0b1611 ("nvme-pci: add quirks for Samsung X5 SSDs") Cc: stable@vger.kernel.org Cc: Monish Kumar R <monish.kumar.r@intel.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> --- drivers/nvme/host/pci.c | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 48f4f6eb877b..47b9e3e0ea5a 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -3483,10 +3483,7 @@ static const struct pci_device_id nvme_id_table[] = { NVME_QUIRK_128_BYTES_SQES | NVME_QUIRK_SHARED_TAGS | NVME_QUIRK_SKIP_CID_GEN }, - { PCI_DEVICE(0x144d, 0xa808), /* Samsung X5 */ - .driver_data = NVME_QUIRK_DELAY_BEFORE_CHK_RDY| - NVME_QUIRK_NO_DEEPEST_PS | - NVME_QUIRK_IGNORE_DEV_SUBNQN, }, + { PCI_DEVICE_CLASS(PCI_CLASS_STORAGE_EXPRESS, 0xffffff) }, { 0, } }; -- 2.35.1 ^ permalink raw reply related [flat|nested] 31+ messages in thread
* RE: 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests] 2022-06-09 8:34 ` Jason A. Donenfeld 2022-06-09 8:40 ` [PATCH] Revert "nvme-pci: add quirks for Samsung X5 SSDs" Jason A. Donenfeld @ 2022-06-09 9:32 ` R, Monish Kumar 2022-06-09 9:38 ` Jason A. Donenfeld 1 sibling, 1 reply; 31+ messages in thread From: R, Monish Kumar @ 2022-06-09 9:32 UTC (permalink / raw) To: Jason A. Donenfeld Cc: open list:NVM EXPRESS DRIVER, Sagi Grimberg, alan.adamson, LKML, Yi Zhang, Keith Busch, axboe, Christoph Hellwig, Rao, Abhijeet Hi Jason, I would like to provide justification for this Samsung X5 SSD fix added. We were facing SSD enumeration issue after cold / warm reboot with device connected ends up with probe failures. When I debug on this issue, I could find that this device was not enumerating once the system got booted. Moreover, we were facing this enumeration issue specific to this device. Based on analysis, due to deep power state of the device fails to enumerate. So, added the following quirks as a workaround fixe and it helps to enumerate the device after cold/warm reboot. If new Samsung X5 SSD's are working fine as expected, we can remove those fix. Regarding the PCI-Id's, I have confirmed from the logs and it shows as vendor ID : 0x144d device ID : 0xa808. I am not sure about why Samsung 970 EVO Plus have the same PCI-Ids. Logs for reference : After connecting Samsung X5 SSD. lspci 04:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983 dmesg Line 1478: <6>[ 112.838998] pci 0000:04:00.0: [144d:a808] type 00 class 0x010802 Line 1479: <6>[ 112.845765] pci 0000:04:00.0: reg 0x10: [mem 0x00000000-0x00003fff 64bit] Line 1480: <6>[ 112.853715] pci 0000:04:00.0: 8.000 Gb/s available PCIe bandwidth, limited by 2.5 GT/s PCIe x4 link at 0000:00:07.0 (capable of 31.504 Gb/s with 8.0 GT/s PCIe x4 link) Line 1481: <6>[ 112.870536] pci 0000:04:00.0: Adding to iommu group 22 Line 1498: <6>[ 113.019698] pci 0000:04:00.0: BAR 0: assigned [mem 0x83000000-0x83003fff 64bit] Regards, Monish Kumar R -----Original Message----- From: Jason A. Donenfeld <Jason@zx2c4.com> Sent: 09 June 2022 14:04 To: R, Monish Kumar <monish.kumar.r@intel.com> Cc: open list:NVM EXPRESS DRIVER <linux-nvme@lists.infradead.org>; Sagi Grimberg <sagi@grimberg.me>; alan.adamson@oracle.com; LKML <linux-kernel@vger.kernel.org>; Yi Zhang <yi.zhang@redhat.com>; Keith Busch <kbusch@kernel.org>; axboe@fb.com; Christoph Hellwig <hch@lst.de>; Rao, Abhijeet <abhijeet.rao@intel.com> Subject: Re: 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests] Hey again, Figured it out. 2.3 seconds to be exact... It looks like this is caused by: bc360b0b1611 ("nvme-pci: add quirks for Samsung X5 SSDs") https://lore.kernel.org/all/20220316075449.18906-1-monish.kumar.r@intel.com/ This commit doesn't have any justification and got applied without much discussion. Perhaps Monish could supply some more info about why this is needed here? FTR, I have no issues on my system when reverting that. Perhaps it should be reverted. (I can send a revert commit for that if necessary.) Looking further, however, the PCIe ID is said to be for a "Samsung X5", which Google says is a portable thunderbolt drive. Is the PCIe ID correct? On my system, this is the PCIe ID of a Samsung 970 EVO Plus. Is it possible that Monish copied and pasted the wrong PCIe ID? Or has Samsung *reused* the same PCIe ID on both devices? In which case, we'd need some additional data for that quirk to avoid the delay. Also note that this (potentially errant) commit has been backported to stable. Jason ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests] 2022-06-09 9:32 ` 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests] R, Monish Kumar @ 2022-06-09 9:38 ` Jason A. Donenfeld 2022-06-10 6:14 ` Christoph Hellwig 0 siblings, 1 reply; 31+ messages in thread From: Jason A. Donenfeld @ 2022-06-09 9:38 UTC (permalink / raw) To: R, Monish Kumar Cc: open list:NVM EXPRESS DRIVER, Sagi Grimberg, alan.adamson, LKML, Yi Zhang, Keith Busch, axboe, Christoph Hellwig, Rao, Abhijeet Hi Monish, On Thu, Jun 09, 2022 at 09:32:02AM +0000, R, Monish Kumar wrote: > Hi Jason, > > I would like to provide justification for this Samsung X5 SSD fix added. > We were facing SSD enumeration issue after cold / warm reboot with device > connected ends up with probe failures. > > When I debug on this issue, I could find that this device was not enumerating > once the system got booted. Moreover, we were facing this enumeration issue > specific to this device. > > Based on analysis, due to deep power state of the device fails to enumerate. > So, added the following quirks as a workaround fixe and it helps to enumerate the device after cold/warm reboot. If new Samsung X5 SSD's are working fine as expected, we can remove those > fix. FWIW, all of that should have been in the commit message. Also, "based on analysis" - what analysis exactly? I have no way of thinking more about the issue at hand other than, "Monish said things are like this in a lab". In any case, I believe the 970 ID predates that of the X5, and destroying battery on those laptops and introducing boot time delays isn't really okay. So let's just revert this until somebody can work out better how to differentiate drives that need a quirk from drives that don't need a quirk. I sent this in: https://lore.kernel.org/lkml/20220609084051.4445-1-Jason@zx2c4.com/ Jason ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests] 2022-06-09 9:38 ` Jason A. Donenfeld @ 2022-06-10 6:14 ` Christoph Hellwig 2022-06-10 9:19 ` Jason A. Donenfeld 2022-06-10 12:05 ` Pankaj Raghav 0 siblings, 2 replies; 31+ messages in thread From: Christoph Hellwig @ 2022-06-10 6:14 UTC (permalink / raw) To: Jason A. Donenfeld Cc: R, Monish Kumar, open list:NVM EXPRESS DRIVER, Sagi Grimberg, alan.adamson, LKML, Yi Zhang, Keith Busch, axboe, Christoph Hellwig, Rao, Abhijeet On Thu, Jun 09, 2022 at 11:38:47AM +0200, Jason A. Donenfeld wrote: > FWIW, all of that should have been in the commit message. Also, "based > on analysis" - what analysis exactly? I have no way of thinking more > about the issue at hand other than, "Monish said things are like this in > a lab". Please calm down a bit. His report is at least as good as your new report here.. > In any case, I believe the 970 ID predates that of the X5, and Huh? The 970 seems to actually be very slightly newer than the X5. What I suspect is that they actually are the same m.2 SSD or at least a very similar one and Samsung decided to ship it in the thunderbolt attached versions first. Maybe one of the Samsung folks here can confirm. That leaves us with two plausible theories: - the problems could be due to an earlier firmware version or ASIC stepping - the problems are due to the thunderbolt attachment Monish and Jason, can you please send me the output of nvme id-ctrl /dev/nvmeX (where /dev/nvmeX is the actual device number)? Monish, can you check if you are using the latest available firmware and if not update it and check if you still need the quirks. > destroying battery on those laptops and introducing boot time delays > isn't really okay. So let's just revert this until somebody can work out > better how to differentiate drives that need a quirk from drives that > don't need a quirk. While I'd really like to fix those issue, they are less severe than not being able to use a device at all. And just as a reminder: if you want to get anything please be nice to people and try work with them productively. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests] 2022-06-10 6:14 ` Christoph Hellwig @ 2022-06-10 9:19 ` Jason A. Donenfeld 2022-06-13 6:36 ` R, Monish Kumar 2022-06-13 13:55 ` Christoph Hellwig 2022-06-10 12:05 ` Pankaj Raghav 1 sibling, 2 replies; 31+ messages in thread From: Jason A. Donenfeld @ 2022-06-10 9:19 UTC (permalink / raw) To: Christoph Hellwig Cc: R, Monish Kumar, open list:NVM EXPRESS DRIVER, Sagi Grimberg, alan.adamson, LKML, Yi Zhang, Keith Busch, axboe, Rao, Abhijeet Hi Christoph, On Fri, Jun 10, 2022 at 08:14:49AM +0200, Christoph Hellwig wrote: > That leaves us with two plausible theories: > > - the problems could be due to an earlier firmware version or > ASIC stepping > - the problems are due to the thunderbolt attachment Right, that seems like the set of variance we're dealing with. If it's a firmware version issue, then we revert because people can update? Or can we quirk firmware version numbers too? If it's ASIC stepping, I guess we need to quirk that. And likewise thunderbolt, but that seems more awkward to quirk around, because afaik, it all just appears as PCIe? > Monish and Jason, can you please send me the output of nvme id-ctrl > /dev/nvmeX (where /dev/nvmeX is the actual device number)? NVME Identify Controller: vid : 0x144d ssvid : 0x144d sn : <redacted> mn : Samsung SSD 970 EVO Plus 2TB fr : 2B2QEXM7 rab : 2 ieee : 002538 cmic : 0 mdts : 9 cntlid : 0x4 ver : 0x10300 rtd3r : 0x30d40 rtd3e : 0x7a1200 oaes : 0 ctratt : 0 rrls : 0 cntrltype : 0 fguid : crdt1 : 0 crdt2 : 0 crdt3 : 0 nvmsr : 0 vwci : 0 mec : 0 oacs : 0x17 acl : 7 aerl : 3 frmw : 0x16 lpa : 0x3 elpe : 63 npss : 4 avscc : 0x1 apsta : 0x1 wctemp : 358 cctemp : 358 mtfa : 0 hmpre : 0 hmmin : 0 tnvmcap : 2000398934016 unvmcap : 0 rpmbs : 0 edstt : 35 dsto : 0 fwug : 0 kas : 0 hctma : 0x1 mntmt : 356 mxtmt : 358 sanicap : 0 hmminds : 0 hmmaxd : 0 nsetidmax : 0 endgidmax : 0 anatt : 0 anacap : 0 anagrpmax : 0 nanagrpid : 0 pels : 0 domainid : 0 megcap : 0 sqes : 0x66 cqes : 0x44 maxcmd : 0 nn : 1 oncs : 0x5f fuses : 0 fna : 0x5 vwc : 0x1 awun : 1023 awupf : 0 icsvscc : 1 nwpc : 0 acwu : 0 ocfs : 0 sgls : 0 mnan : 0 maxdna : 0 maxcna : 0 subnqn : ioccsz : 0 iorcsz : 0 icdoff : 0 fcatt : 0 msdbd : 0 ofcs : 0 ps 0 : mp:7.50W operational enlat:0 exlat:0 rrt:0 rrl:0 rwt:0 rwl:0 idle_power:- active_power:- ps 1 : mp:5.90W operational enlat:0 exlat:0 rrt:1 rrl:1 rwt:1 rwl:1 idle_power:- active_power:- ps 2 : mp:3.60W operational enlat:0 exlat:0 rrt:2 rrl:2 rwt:2 rwl:2 idle_power:- active_power:- ps 3 : mp:0.0700W non-operational enlat:210 exlat:1200 rrt:3 rrl:3 rwt:3 rwl:3 idle_power:- active_power:- ps 4 : mp:0.0050W non-operational enlat:2000 exlat:8000 rrt:4 rrl:4 rwt:4 rwl:4 idle_power:- active_power:- Jason ^ permalink raw reply [flat|nested] 31+ messages in thread
* RE: 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests] 2022-06-10 9:19 ` Jason A. Donenfeld @ 2022-06-13 6:36 ` R, Monish Kumar 2022-06-13 12:20 ` Jason A. Donenfeld 2022-06-13 13:55 ` Christoph Hellwig 1 sibling, 1 reply; 31+ messages in thread From: R, Monish Kumar @ 2022-06-13 6:36 UTC (permalink / raw) To: Christoph Hellwig Cc: open list:NVM EXPRESS DRIVER, Jason A. Donenfeld, Sagi Grimberg, alan.adamson, LKML, Yi Zhang, Keith Busch, axboe, Rao, Abhijeet Hi Christoph, Please see the below nvme id-ctrl response of Samsung X5 SSD. NVME Identify Controller: vid : 0x144d ssvid : 0x144d sn : <redacted> mn : Samsung Portable SSD X5 fr : 1P3QEXE7 rab : 2 ieee : 002538 cmic : 0 mdts : 9 cntlid : 4 ver : 10300 rtd3r : 30d40 rtd3e : 7a1200 oaes : 0 ctratt : 0 rrls : 0 oacs : 0x7 acl : 7 aerl : 3 frmw : 0x16 lpa : 0x3 elpe : 63 npss : 4 avscc : 0x1 apsta : 0x1 wctemp : 329 cctemp : 330 mtfa : 0 hmpre : 0 hmmin : 0 tnvmcap : 500107862016 unvmcap : 0 rpmbs : 0 edstt : 0 dsto : 0 fwug : 0 kas : 0 hctma : 0 mntmt : 0 mxtmt : 0 sanicap : 0 hmminds : 0 hmmaxd : 0 nsetidmax : 0 sqes : 0x66 cqes : 0x44 maxcmd : 0 nn : 1 oncs : 0x1f fuses : 0 fna : 0x5 vwc : 0x1 awun : 127 awupf : 0 nvscc : 1 acwu : 0 sgls : 0 subnqn : ioccsz : 0 iorcsz : 0 icdoff : 0 ctrattr : 0 msdbd : 0 ps 0 : mp:6.20W operational enlat:0 exlat:0 rrt:0 rrl:0 rwt:0 rwl:0 idle_power:- active_power:- ps 1 : mp:4.30W operational enlat:0 exlat:0 rrt:1 rrl:1 rwt:1 rwl:1 idle_power:- active_power:- ps 2 : mp:2.10W operational enlat:0 exlat:0 rrt:2 rrl:2 rwt:2 rwl:2 idle_power:- active_power:- ps 3 : mp:0.0400W non-operational enlat:210 exlat:1200 rrt:3 rrl:3 rwt:3 rwl:3 idle_power:- active_power:- ps 4 : mp:0.0050W non-operational enlat:2000 exlat:8000 rrt:4 rrl:4 rwt:4 rwl:4 idle_power:- active_power:- Regards, Monish Kumar R -----Original Message----- From: Jason A. Donenfeld <Jason@zx2c4.com> Sent: 10 June 2022 14:50 To: Christoph Hellwig <hch@lst.de> Cc: R, Monish Kumar <monish.kumar.r@intel.com>; open list:NVM EXPRESS DRIVER <linux-nvme@lists.infradead.org>; Sagi Grimberg <sagi@grimberg.me>; alan.adamson@oracle.com; LKML <linux-kernel@vger.kernel.org>; Yi Zhang <yi.zhang@redhat.com>; Keith Busch <kbusch@kernel.org>; axboe@fb.com; Rao, Abhijeet <abhijeet.rao@intel.com> Subject: Re: 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests] Hi Christoph, On Fri, Jun 10, 2022 at 08:14:49AM +0200, Christoph Hellwig wrote: > That leaves us with two plausible theories: > > - the problems could be due to an earlier firmware version or > ASIC stepping > - the problems are due to the thunderbolt attachment Right, that seems like the set of variance we're dealing with. If it's a firmware version issue, then we revert because people can update? Or can we quirk firmware version numbers too? If it's ASIC stepping, I guess we need to quirk that. And likewise thunderbolt, but that seems more awkward to quirk around, because afaik, it all just appears as PCIe? > Monish and Jason, can you please send me the output of nvme id-ctrl > /dev/nvmeX (where /dev/nvmeX is the actual device number)? NVME Identify Controller: vid : 0x144d ssvid : 0x144d sn : <redacted> mn : Samsung SSD 970 EVO Plus 2TB fr : 2B2QEXM7 rab : 2 ieee : 002538 cmic : 0 mdts : 9 cntlid : 0x4 ver : 0x10300 rtd3r : 0x30d40 rtd3e : 0x7a1200 oaes : 0 ctratt : 0 rrls : 0 cntrltype : 0 fguid : crdt1 : 0 crdt2 : 0 crdt3 : 0 nvmsr : 0 vwci : 0 mec : 0 oacs : 0x17 acl : 7 aerl : 3 frmw : 0x16 lpa : 0x3 elpe : 63 npss : 4 avscc : 0x1 apsta : 0x1 wctemp : 358 cctemp : 358 mtfa : 0 hmpre : 0 hmmin : 0 tnvmcap : 2000398934016 unvmcap : 0 rpmbs : 0 edstt : 35 dsto : 0 fwug : 0 kas : 0 hctma : 0x1 mntmt : 356 mxtmt : 358 sanicap : 0 hmminds : 0 hmmaxd : 0 nsetidmax : 0 endgidmax : 0 anatt : 0 anacap : 0 anagrpmax : 0 nanagrpid : 0 pels : 0 domainid : 0 megcap : 0 sqes : 0x66 cqes : 0x44 maxcmd : 0 nn : 1 oncs : 0x5f fuses : 0 fna : 0x5 vwc : 0x1 awun : 1023 awupf : 0 icsvscc : 1 nwpc : 0 acwu : 0 ocfs : 0 sgls : 0 mnan : 0 maxdna : 0 maxcna : 0 subnqn : ioccsz : 0 iorcsz : 0 icdoff : 0 fcatt : 0 msdbd : 0 ofcs : 0 ps 0 : mp:7.50W operational enlat:0 exlat:0 rrt:0 rrl:0 rwt:0 rwl:0 idle_power:- active_power:- ps 1 : mp:5.90W operational enlat:0 exlat:0 rrt:1 rrl:1 rwt:1 rwl:1 idle_power:- active_power:- ps 2 : mp:3.60W operational enlat:0 exlat:0 rrt:2 rrl:2 rwt:2 rwl:2 idle_power:- active_power:- ps 3 : mp:0.0700W non-operational enlat:210 exlat:1200 rrt:3 rrl:3 rwt:3 rwl:3 idle_power:- active_power:- ps 4 : mp:0.0050W non-operational enlat:2000 exlat:8000 rrt:4 rrl:4 rwt:4 rwl:4 idle_power:- active_power:- Jason ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests] 2022-06-13 6:36 ` R, Monish Kumar @ 2022-06-13 12:20 ` Jason A. Donenfeld 2022-06-13 13:02 ` R, Monish Kumar 0 siblings, 1 reply; 31+ messages in thread From: Jason A. Donenfeld @ 2022-06-13 12:20 UTC (permalink / raw) To: R, Monish Kumar Cc: Christoph Hellwig, open list:NVM EXPRESS DRIVER, Sagi Grimberg, alan.adamson, LKML, Yi Zhang, Keith Busch, axboe, Rao, Abhijeet On Mon, Jun 13, 2022 at 06:36:44AM +0000, R, Monish Kumar wrote: > mn : Samsung Portable SSD X5 > fr : 1P3QEXE7 Isn't the latest firmware for the QEXE7 line 2B2QEXE7? Jason ^ permalink raw reply [flat|nested] 31+ messages in thread
* RE: 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests] 2022-06-13 12:20 ` Jason A. Donenfeld @ 2022-06-13 13:02 ` R, Monish Kumar 0 siblings, 0 replies; 31+ messages in thread From: R, Monish Kumar @ 2022-06-13 13:02 UTC (permalink / raw) To: Jason A. Donenfeld, Pankaj Raghav Cc: Christoph Hellwig, open list:NVM EXPRESS DRIVER, Sagi Grimberg, alan.adamson, LKML, Yi Zhang, Keith Busch, axboe, Rao, Abhijeet, p.raghav Jason, I am not sure, whether this running firmware (fr: 1P3QEXE7) was the latest one. Moreover, this device is a newer one and we bought couple of months back. When I am tried to update SSD firmware using Samsung Portable SSD Software, it failed to communicate with their Samsung servers. Let @Pankaj Raghav <pankydev8@gmail.com> can get back on this and confirm about this firmware version, as he reported this issue internally to Samsung. Regards, Monish Kumar R -----Original Message----- From: Jason A. Donenfeld <Jason@zx2c4.com> Sent: 13 June 2022 17:51 To: R, Monish Kumar <monish.kumar.r@intel.com> Cc: Christoph Hellwig <hch@lst.de>; open list:NVM EXPRESS DRIVER <linux-nvme@lists.infradead.org>; Sagi Grimberg <sagi@grimberg.me>; alan.adamson@oracle.com; LKML <linux-kernel@vger.kernel.org>; Yi Zhang <yi.zhang@redhat.com>; Keith Busch <kbusch@kernel.org>; axboe@fb.com; Rao, Abhijeet <abhijeet.rao@intel.com> Subject: Re: 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests] On Mon, Jun 13, 2022 at 06:36:44AM +0000, R, Monish Kumar wrote: > mn : Samsung Portable SSD X5 > fr : 1P3QEXE7 Isn't the latest firmware for the QEXE7 line 2B2QEXE7? Jason ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests] 2022-06-10 9:19 ` Jason A. Donenfeld 2022-06-13 6:36 ` R, Monish Kumar @ 2022-06-13 13:55 ` Christoph Hellwig 2022-06-15 10:27 ` Pankaj Raghav 1 sibling, 1 reply; 31+ messages in thread From: Christoph Hellwig @ 2022-06-13 13:55 UTC (permalink / raw) To: Jason A. Donenfeld Cc: Christoph Hellwig, R, Monish Kumar, open list:NVM EXPRESS DRIVER, Sagi Grimberg, alan.adamson, LKML, Yi Zhang, Keith Busch, axboe, Rao, Abhijeet On Fri, Jun 10, 2022 at 11:19:31AM +0200, Jason A. Donenfeld wrote: > Right, that seems like the set of variance we're dealing with. If it's a > firmware version issue, then we revert because people can update? Or can > we quirk firmware version numbers too? We can quirk on firmware version and model number as well. Those quirks need to go into the core nvme module and not just the PCI driver, though. > If it's ASIC stepping, I guess we > need to quirk that. And likewise thunderbolt, but that seems more > awkward to quirk around, because afaik, it all just appears as PCIe? It all appears as PCIe, but the pci_dev has an is_thunderbolt flag. Thanks to both of you for the information. I'd like to wait until the end of the week or so if we can hear something from Samsung, and if we don't we'll have to quirk based on the model number. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests] 2022-06-13 13:55 ` Christoph Hellwig @ 2022-06-15 10:27 ` Pankaj Raghav 2022-06-15 11:31 ` Christoph Hellwig 0 siblings, 1 reply; 31+ messages in thread From: Pankaj Raghav @ 2022-06-15 10:27 UTC (permalink / raw) To: Christoph Hellwig Cc: Jason A. Donenfeld, R, Monish Kumar, open list:NVM EXPRESS DRIVER, Sagi Grimberg, alan.adamson, LKML, Yi Zhang, Keith Busch, axboe, Rao, Abhijeet Hi Christoph, On Mon, Jun 13, 2022 at 03:55:49PM +0200, Christoph Hellwig wrote: > It all appears as PCIe, but the pci_dev has an is_thunderbolt flag. > > Thanks to both of you for the information. I'd like to wait until the > end of the week or so if we can hear something from Samsung, and if we > don't we'll have to quirk based on the model number. Our FW team has started looking into the issue. They said they will try to come up with a solution before 5.20. If not, we can add this quirk based on the FW ver. and a proper solution can be added by 5.21. > ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests] 2022-06-15 10:27 ` Pankaj Raghav @ 2022-06-15 11:31 ` Christoph Hellwig 2022-06-20 3:36 ` onenowy 2022-06-20 4:34 ` SungHwan Jung 0 siblings, 2 replies; 31+ messages in thread From: Christoph Hellwig @ 2022-06-15 11:31 UTC (permalink / raw) To: Pankaj Raghav Cc: Christoph Hellwig, Jason A. Donenfeld, R, Monish Kumar, open list:NVM EXPRESS DRIVER, Sagi Grimberg, alan.adamson, LKML, Yi Zhang, Keith Busch, axboe, Rao, Abhijeet On Wed, Jun 15, 2022 at 12:27:57PM +0200, Pankaj Raghav wrote: > Hi Christoph, > On Mon, Jun 13, 2022 at 03:55:49PM +0200, Christoph Hellwig wrote: > > It all appears as PCIe, but the pci_dev has an is_thunderbolt flag. > > > > Thanks to both of you for the information. I'd like to wait until the > > end of the week or so if we can hear something from Samsung, and if we > > don't we'll have to quirk based on the model number. > Our FW team has started looking into the issue. They said they will try > to come up with a solution before 5.20. If not, we can add this quirk > based on the FW ver. and a proper solution can be added by 5.21. I don't think we can wait for 5.20 - the "offending" commit is in 5.19-rc and -stable. So I'll plan to prepare a patch based on the model number for now, still hoping we can come up with something better eventually. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests] 2022-06-15 11:31 ` Christoph Hellwig @ 2022-06-20 3:36 ` onenowy 2022-06-20 7:09 ` Christoph Hellwig 2022-06-20 4:34 ` SungHwan Jung 1 sibling, 1 reply; 31+ messages in thread From: onenowy @ 2022-06-20 3:36 UTC (permalink / raw) To: hch Cc: Jason, abhijeet.rao, alan.adamson, axboe, kbusch, linux-kernel, linux-nvme, monish.kumar.r, pankydev8, sagi, yi.zhang > I don't think we can wait for 5.20 - the "offending" commit is in > 5.19-rc and -stable. So I'll plan to prepare a patch based on the model > number for now, still hoping we can come up with something better > eventually. Hi , Some samsung SSD for OEM also have the identical PCI-ids and are affected by this quirk. But they have different subsystem-ids. For example, model number: MZQLB1T9HAJR-00V3 for lenovo vendor: 144d device: a808 subvendor: 1d49 subdevice: 403b model number: MZVLB256HBHQ-00000 vendor: 144d device: a808 subvendor: 144d subdevice: a801 Addtition of subsystem-ids of X5 to pci_device_id(as below) may solve this problem. diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 17aeb7d5c48522..92fd3b1d88fc95 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -3475,7 +3475,7 @@ static const struct pci_device_id nvme_id_table[] = { NVME_QUIRK_128_BYTES_SQES | NVME_QUIRK_SHARED_TAGS | NVME_QUIRK_SKIP_CID_GEN }, - { PCI_DEVICE(0x144d, 0xa808), /* Samsung X5 */ + { PCI_DEVICE_SUB(0x144d, 0xa808, {X5 subvendor?}, {X5 subdevice?}), /* Samsung X5 */ .driver_data = NVME_QUIRK_DELAY_BEFORE_CHK_RDY| NVME_QUIRK_NO_DEEPEST_PS | NVME_QUIRK_IGNORE_DEV_SUBNQN, }, But I don't know X5's subsystem ids, Can someone provide subsystem-ids of X5? ^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests] 2022-06-20 3:36 ` onenowy @ 2022-06-20 7:09 ` Christoph Hellwig 0 siblings, 0 replies; 31+ messages in thread From: Christoph Hellwig @ 2022-06-20 7:09 UTC (permalink / raw) To: onenowy Cc: hch, Jason, abhijeet.rao, alan.adamson, axboe, kbusch, linux-kernel, linux-nvme, monish.kumar.r, pankydev8, sagi, yi.zhang On Mon, Jun 20, 2022 at 12:36:27PM +0900, onenowy wrote: > Some samsung SSD for OEM also have the identical PCI-ids and are affected by this quirk. > But they have different subsystem-ids. > Addtition of subsystem-ids of X5 to pci_device_id(as below) may solve this problem. Monish, can you look into that? ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests] 2022-06-15 11:31 ` Christoph Hellwig 2022-06-20 3:36 ` onenowy @ 2022-06-20 4:34 ` SungHwan Jung 1 sibling, 0 replies; 31+ messages in thread From: SungHwan Jung @ 2022-06-20 4:34 UTC (permalink / raw) To: hch Cc: Jason, abhijeet.rao, alan.adamson, axboe, kbusch, linux-kernel, linux-nvme, monish.kumar.r, pankydev8, sagi, yi.zhang Hi again, I'm sorry to send mail again. I think it would be to confirm that subsystem-ids for all X5 model and these ids are not used for other ssd by contact with samsung. But I have no idea how to contant with them, please fowards this to them or if you already have these information, please tests the patch in previous mail. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests] 2022-06-10 6:14 ` Christoph Hellwig 2022-06-10 9:19 ` Jason A. Donenfeld @ 2022-06-10 12:05 ` Pankaj Raghav 1 sibling, 0 replies; 31+ messages in thread From: Pankaj Raghav @ 2022-06-10 12:05 UTC (permalink / raw) To: Christoph Hellwig Cc: Jason A. Donenfeld, R, Monish Kumar, open list:NVM EXPRESS DRIVER, Sagi Grimberg, alan.adamson, LKML, Yi Zhang, Keith Busch, axboe, Rao, Abhijeet, Pankaj Raghav, Javier González On Fri, Jun 10, 2022 at 08:14:49AM +0200, Christoph Hellwig wrote: > The 970 seems to actually be very slightly newer than the X5. What > I suspect is that they actually are the same m.2 SSD or at least a > very similar one and Samsung decided to ship it in the thunderbolt > attached versions first. Maybe one of the Samsung folks here can > confirm. > > That leaves us with two plausible theories: > > - the problems could be due to an earlier firmware version or > ASIC stepping > - the problems are due to the thunderbolt attachment > I have forwarded this report internally within Samsung and I will post an update once I have more information about this issue. Cheers, Pankaj ^ permalink raw reply [flat|nested] 31+ messages in thread
end of thread, other threads:[~2022-06-20 7:09 UTC | newest] Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-04-03 14:28 [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests Yi Zhang 2022-04-04 16:39 ` Alan Adamson 2022-04-04 17:02 ` Keith Busch 2022-04-04 19:34 ` Jonathan Derrick 2022-04-04 20:30 ` Keith Busch 2022-04-05 0:43 ` Jonathan Derrick 2022-04-06 17:34 ` Keith Busch 2022-04-05 6:14 ` Christoph Hellwig 2022-04-05 18:21 ` Jonathan Derrick 2022-04-05 20:51 ` Alan Adamson 2022-04-05 21:56 ` Jonathan Derrick 2022-04-05 19:09 ` Alan Adamson 2022-04-05 20:00 ` Jason A. Donenfeld 2022-04-05 20:48 ` Alan Adamson 2022-06-09 8:20 ` 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests] Jason A. Donenfeld 2022-06-09 8:34 ` Jason A. Donenfeld 2022-06-09 8:40 ` [PATCH] Revert "nvme-pci: add quirks for Samsung X5 SSDs" Jason A. Donenfeld 2022-06-09 9:32 ` 2 second nvme initialization delay regression in 5.18 [Was: Re: [bug report]nvme0: Admin Cmd(0x6), I/O Error (sct 0x0 / sc 0x2) MORE DNR observed during blktests] R, Monish Kumar 2022-06-09 9:38 ` Jason A. Donenfeld 2022-06-10 6:14 ` Christoph Hellwig 2022-06-10 9:19 ` Jason A. Donenfeld 2022-06-13 6:36 ` R, Monish Kumar 2022-06-13 12:20 ` Jason A. Donenfeld 2022-06-13 13:02 ` R, Monish Kumar 2022-06-13 13:55 ` Christoph Hellwig 2022-06-15 10:27 ` Pankaj Raghav 2022-06-15 11:31 ` Christoph Hellwig 2022-06-20 3:36 ` onenowy 2022-06-20 7:09 ` Christoph Hellwig 2022-06-20 4:34 ` SungHwan Jung 2022-06-10 12:05 ` Pankaj Raghav
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.