smartctl "kills" specific drives on kernel 5.13 but works fine on 5.15

All of lore.kernel.org
 help / color / mirror / Atom feed

* smartctl "kills" specific drives on kernel 5.13 but works fine on 5.15 - why?
@ 2022-09-14 19:44 Nick Neumann
  2022-09-15 17:50 ` Nick Neumann
  0 siblings, 1 reply; 5+ messages in thread
From: Nick Neumann @ 2022-09-14 19:44 UTC (permalink / raw)
  To: linux-nvme

I'm running ubuntu 20.04 LTS with HWE, which reports kernel 5.13.0-51
generic . Both a crucial P5 1TB and Crucial P5 2TB behave rather
poorly. With one drive installed, running

sudo smartctl -x /dev/nvme0

will output some info, then hang for a while, and then print
"NVME_IOCTL_ADMIN_CMD: Interrupted system call"

From that point on, the drives are gone from the system until I cut
and restore power (reboot is not enough).

Running smartctl against the drives works fine in windows and in
Ubuntu 22.04 LTS, which reports kernel 5.15.0-43

I thought for sure I'd find that a quirk for the drives had been added
between kernels 5.13 and 5.15, but alas, I don't see one. The PCI
Vendor/Device ID is 1344:5405 for the 1TB model, and while the crucial
P2 has a quirk in drivers/nvme/host/pci.c, it has a different vendor
ID altogether (c0a9).

Any thoughts on where I can look or what I might compare to try to
figure out what changed to get the Crucial P5 drives behaving? I was
hoping there was some setting I could tweak to get them going without
having to move to 22.04 LTS. (I've tried
"nvme_core.default_ps_max_latency_us=0" and various values for
"pci_aspm" with no luck.)

Thanks,
Nick

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: smartctl "kills" specific drives on kernel 5.13 but works fine on 5.15 - why?
  2022-09-14 19:44 smartctl "kills" specific drives on kernel 5.13 but works fine on 5.15 - why? Nick Neumann
@ 2022-09-15 17:50 ` Nick Neumann
  2022-09-15 20:03   ` Keith Busch
  0 siblings, 1 reply; 5+ messages in thread
From: Nick Neumann @ 2022-09-15 17:50 UTC (permalink / raw)
  To: linux-nvme

On Wed, Sep 14, 2022 at 2:44 PM Nick Neumann <nick@pcpartpicker.com> wrote:
>
> I'm running ubuntu 20.04 LTS with HWE, which reports kernel 5.13.0-51
> generic . Both a crucial P5 1TB and Crucial P5 2TB behave rather
> poorly. With one drive installed, running
>
> sudo smartctl -x /dev/nvme0
>
> will output some info, then hang for a while, and then print
> "NVME_IOCTL_ADMIN_CMD: Interrupted system call"
>
> From that point on, the drives are gone from the system until I cut
> and restore power (reboot is not enough).
>
> Running smartctl against the drives works fine in windows and in
> Ubuntu 22.04 LTS, which reports kernel 5.15.0-43
>
> I thought for sure I'd find that a quirk for the drives had been added
> between kernels 5.13 and 5.15, but alas, I don't see one. The PCI
> Vendor/Device ID is 1344:5405 for the 1TB model, and while the crucial
> P2 has a quirk in drivers/nvme/host/pci.c, it has a different vendor
> ID altogether (c0a9).
>
> Any thoughts on where I can look or what I might compare to try to
> figure out what changed to get the Crucial P5 drives behaving? I was
> hoping there was some setting I could tweak to get them going without
> having to move to 22.04 LTS. (I've tried
> "nvme_core.default_ps_max_latency_us=0" and various values for
> "pci_aspm" with no luck.)

Figured this out. It isn't a linux kernel change, but rather a
smartctl change. (In hindsight I should have started digging there
first.)

The issue was https://www.smartmontools.org/ticket/1404, fixed by the
7.2 release (and Ubuntu 20.04LTS is on 7.1). The fix in smartmontools
was to change to reading logs 4KB at a time, just like nvme did in
https://github.com/linux-nvme/nvme-cli/commit/465a4d. (The device
advertises that it has an MDTS of 9 so, as far as I understand,
reading in 4KB chunks should not be necessary; the smartmontools
author was not certain where the blame for the issue really belonged,
but changing to work like nvme-cli avoids it.)

For now I'll avoid reading the error log via smartctl on problematic
drives until I can move to a later smartmontools version.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: smartctl "kills" specific drives on kernel 5.13 but works fine on 5.15 - why?
  2022-09-15 17:50 ` Nick Neumann
@ 2022-09-15 20:03   ` Keith Busch
  2022-09-15 20:48     ` Nick Neumann
  0 siblings, 1 reply; 5+ messages in thread
From: Keith Busch @ 2022-09-15 20:03 UTC (permalink / raw)
  To: Nick Neumann, y; +Cc: linux-nvme

On Thu, Sep 15, 2022 at 12:50:54PM -0500, Nick Neumann wrote:
> On Wed, Sep 14, 2022 at 2:44 PM Nick Neumann <nick@pcpartpicker.com> wrote:
> >
> > I'm running ubuntu 20.04 LTS with HWE, which reports kernel 5.13.0-51
> > generic . Both a crucial P5 1TB and Crucial P5 2TB behave rather
> > poorly. With one drive installed, running
> >
> > sudo smartctl -x /dev/nvme0
> >
> > will output some info, then hang for a while, and then print
> > "NVME_IOCTL_ADMIN_CMD: Interrupted system call"
> >
> > From that point on, the drives are gone from the system until I cut
> > and restore power (reboot is not enough).
> >
> > Running smartctl against the drives works fine in windows and in
> > Ubuntu 22.04 LTS, which reports kernel 5.15.0-43
> >
> > I thought for sure I'd find that a quirk for the drives had been added
> > between kernels 5.13 and 5.15, but alas, I don't see one. The PCI
> > Vendor/Device ID is 1344:5405 for the 1TB model, and while the crucial
> > P2 has a quirk in drivers/nvme/host/pci.c, it has a different vendor
> > ID altogether (c0a9).
> >
> > Any thoughts on where I can look or what I might compare to try to
> > figure out what changed to get the Crucial P5 drives behaving? I was
> > hoping there was some setting I could tweak to get them going without
> > having to move to 22.04 LTS. (I've tried
> > "nvme_core.default_ps_max_latency_us=0" and various values for
> > "pci_aspm" with no luck.)
> 
> Figured this out. It isn't a linux kernel change, but rather a
> smartctl change. (In hindsight I should have started digging there
> first.)
> 
> The issue was https://www.smartmontools.org/ticket/1404, fixed by the
> 7.2 release (and Ubuntu 20.04LTS is on 7.1). The fix in smartmontools
> was to change to reading logs 4KB at a time, just like nvme did in
> https://github.com/linux-nvme/nvme-cli/commit/465a4d. (The device
> advertises that it has an MDTS of 9 so, as far as I understand,
> reading in 4KB chunks should not be necessary; the smartmontools
> author was not certain where the blame for the issue really belonged,
> but changing to work like nvme-cli avoids it.)
> 
> For now I'll avoid reading the error log via smartctl on problematic
> drives until I can move to a later smartmontools version.

Not sure what MDTS has to do with this. The error log was originally defined to
be a max 4k size which is below the smallest possible MDTS.

My guess is smartclt tricked the driver into allocating a PRP List, but the
controller instead accessed it as a PRP entry, which could corrupt memory or
fail the transaction if data direction is enforced by the memory controller.
Why that causes the nvme controller to fail as you've described is weird,
though.


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: smartctl "kills" specific drives on kernel 5.13 but works fine on 5.15 - why?
  2022-09-15 20:03   ` Keith Busch
@ 2022-09-15 20:48     ` Nick Neumann
  2022-09-15 22:24       ` Keith Busch
  0 siblings, 1 reply; 5+ messages in thread
From: Nick Neumann @ 2022-09-15 20:48 UTC (permalink / raw)
  To: Keith Busch; +Cc: y, linux-nvme

On Thu, Sep 15, 2022 at 3:03 PM Keith Busch <kbusch@kernel.org> wrote:

> Not sure what MDTS has to do with this. The error log was originally defined to
> be a max 4k size which is below the smallest possible MDTS.
>
> My guess is smartclt tricked the driver into allocating a PRP List, but the
> controller instead accessed it as a PRP entry, which could corrupt memory or
> fail the transaction if data direction is enforced by the memory controller.
> Why that causes the nvme controller to fail as you've described is weird,
> though.

I definitely don't know this stuff very well - the smartctl bug
commentary was referencing the nvme-cli commit where log pages are
transferred in 4k chunks to avoid having to worry about exceeding the
MDTS value. The problematic drives have error logs larger than 4K.

I believe the logic in the smartctl commentary was along the lines of
"well, the MDTS is large enough that we should be able to transfer
more than 4k at a time, but we're currently crashing. And nvme-cli
does it 4k at a time always, and if we change to that, the crash goes
away, so let's do that."

As to the allocation, smartctl calls into nvme with
nvme_admin_get_log_page and passes a buffer (that smartctl allocates)
of size n * sizeof(nvme_error_log_page), where n is the number of
error log entries it is trying to read. The fix in smarmontools moved
from trying to read all of the error log entries at once via a single
call to nvme_adming_get_log_page, to doing 4K bytes at a time.

Not sure how helpful any of that is; it's where my current understanding is at.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: smartctl "kills" specific drives on kernel 5.13 but works fine on 5.15 - why?
  2022-09-15 20:48     ` Nick Neumann
@ 2022-09-15 22:24       ` Keith Busch
  0 siblings, 0 replies; 5+ messages in thread
From: Keith Busch @ 2022-09-15 22:24 UTC (permalink / raw)
  To: Nick Neumann; +Cc: y, linux-nvme

On Thu, Sep 15, 2022 at 03:48:01PM -0500, Nick Neumann wrote:
> On Thu, Sep 15, 2022 at 3:03 PM Keith Busch <kbusch@kernel.org> wrote:
> > Not sure what MDTS has to do with this. The error log was originally defined to
> > be a max 4k size which is below the smallest possible MDTS.
> >
> > My guess is smartclt tricked the driver into allocating a PRP List, but the
> > controller instead accessed it as a PRP entry, which could corrupt memory or
> > fail the transaction if data direction is enforced by the memory controller.
> > Why that causes the nvme controller to fail as you've described is weird,
> > though.
> 
> I definitely don't know this stuff very well - the smartctl bug
> commentary was referencing the nvme-cli commit where log pages are
> transferred in 4k chunks to avoid having to worry about exceeding the
> MDTS value. The problematic drives have error logs larger than 4K.
> 
> I believe the logic in the smartctl commentary was along the lines of
> "well, the MDTS is large enough that we should be able to transfer
> more than 4k at a time, but we're currently crashing. And nvme-cli
> does it 4k at a time always, and if we change to that, the crash goes
> away, so let's do that."
> 
> As to the allocation, smartctl calls into nvme with
> nvme_admin_get_log_page and passes a buffer (that smartctl allocates)
> of size n * sizeof(nvme_error_log_page), where n is the number of
> error log entries it is trying to read. The fix in smarmontools moved
> from trying to read all of the error log entries at once via a single
> call to nvme_adming_get_log_page, to doing 4K bytes at a time.
> 
> Not sure how helpful any of that is; it's where my current understanding is at.

If 'n' is > 64, then that would tell the driver to allocate a PRP list, and I
am pretty sure based on the observations that the drive believes the address is
a PRP entry. That doesn't readily explain why your ssd became unresponsive
immediately after dispatching the command, but the drive definitely sounds
broken.


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-09-15 22:24 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-14 19:44 smartctl "kills" specific drives on kernel 5.13 but works fine on 5.15 - why? Nick Neumann
2022-09-15 17:50 ` Nick Neumann
2022-09-15 20:03   ` Keith Busch
2022-09-15 20:48     ` Nick Neumann
2022-09-15 22:24       ` Keith Busch

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.