All of lore.kernel.org
 help / color / mirror / Atom feed
* My Western Digital SN850 appears to suffer from deep power state issues - considering submitting quirk patch.
@ 2022-05-15 16:00 Marcos Scriven
  2022-05-15 19:44 ` Keith Busch
  0 siblings, 1 reply; 6+ messages in thread
From: Marcos Scriven @ 2022-05-15 16:00 UTC (permalink / raw)
  To: linux-nvme

Hi all

I've been experiencing issues with my system freezing, and traced it down to the nvme controller resetting:

[268690.209099] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
[268690.289109] nvme 0000:01:00.0: enabling device (0000 -> 0002)
[268690.289234] nvme nvme0: Removing after probe failure status: -19
[268690.313116] nvme0n1: detected capacity change from 1953525168 to 0
[268690.313116] blk_update_request: I/O error, dev nvme0n1, sector 119170336 op 0x1:(WRITE) flags 0x800 phys_seg 14 prio class 0
[268690.313117] blk_update_request: I/O error, dev nvme0n1, sector 293367304 op 0x1:(WRITE) flags 0x8800 phys_seg 5 prio class 0
[268690.313118] blk_update_request: I/O error, dev nvme0n1, sector 1886015680 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0

Only a reboot resolves this.

The vendor/product id:

01:00.0 Non-Volatile memory controller [0108]: Sandisk Corp WD Black SN850 [15b7:5011] (rev 01)

This is installed in a desktop machine (the details of which I can give if relevant). I only mention this as the power profile is much less frugal than a laptop.

There are several bug reports on various distros about this, but to give just a couple:

https://bugzilla.kernel.org/show_bug.cgi?id=195039
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1705748

It's also clearly a know quirk. Here's a couple of examples:

https://github.com/torvalds/linux/commit/dc22c1c058b5c4fe967a20589e36f029ee42a706
https://github.com/torvalds/linux/commit/538e4a8c571efdf131834431e0c14808bcfb1004

On my particular system there seems to be two power state settings.

One from nvme id-ctrl /dev/nvme0:

ps    0 : mp:9.00W operational enlat:0 exlat:0 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:0.6300W active_power:9.00W
ps    1 : mp:4.10W operational enlat:0 exlat:0 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:0.6300W active_power:4.10W
ps    2 : mp:3.50W operational enlat:0 exlat:0 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:0.6300W active_power:3.50W
ps    3 : mp:0.0250W non-operational enlat:5000 exlat:10000 rrt:3 rrl:3
          rwt:3 rwl:3 idle_power:0.0250W active_power:-
ps    4 : mp:0.0050W non-operational enlat:5000 exlat:45000 rrt:4 rrl:4
          rwt:4 rwl:4 idle_power:0.0050W active_power:-

The other from nvme get-feature -f 0x0c -H /dev/nvme0

get-feature:0xc (Autonomous Power State Transition), Current value:0x000001
	Autonomous Power State Transition Enable (APSTE): Enabled
	Auto PST Entries	.................
	Entry[ 0]
	.................
	Idle Time Prior to Transition (ITPT): 750 ms
	Idle Transition Power State   (ITPS): 3
	.................
	Entry[ 1]
	.................
	Idle Time Prior to Transition (ITPT): 750 ms
	Idle Transition Power State   (ITPS): 3
	.................
	Entry[ 2]
	.................
	Idle Time Prior to Transition (ITPT): 750 ms
	Idle Transition Power State   (ITPS): 3
	.................
	Entry[ 3]
	.................
	Idle Time Prior to Transition (ITPT): 2500 ms
	Idle Transition Power State   (ITPS): 4
	.................

I don't really know how those two interact, or are related, if at all? The timings don't seem to match up.

Anyway, with all that background, I'm happy to try NVME_QUIRK_NO_DEEPEST_PS for 15b7:5011 locally, and submit here if it works.

However, the main problem is how to reproduce this issue reliably/deterministically, in order to be confident in the patch. It can happen with in minutes or days at the moment.

So, my questions:

1) How can I reproduce the issue deterministically?
2) Are there any other causes of this I'd need to rule out? E.g BIOS, PSU, broken drive rather than a power state quirk.

I also have a couple of more fundamental questions, the answer to which is probably way beyond my understanding:

3) Why are so many drives needing this qurik in Linux? Could it be that Windows also avoids these power states?
4) I looked at the code around the message, and it seems to be this is about an attempt to reset the controller, rather than just accept it's timed out an operation. Is that correct? And if so, could there be a problem with the way resetting is working - or is it again a quirk with these NVMEs?
5) On that note, some Googling yielded this patch, that I think was rejected https://patchwork.kernel.org/project/linux-block/patch/20180516040313.13596-12-ming.lei@redhat.com/. I'm unclear on the details, but felt it might be relevant.

Thanks to the maintainers of the NVME drivers.

Marcos



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: My Western Digital SN850 appears to suffer from deep power state issues - considering submitting quirk patch.
  2022-05-15 16:00 My Western Digital SN850 appears to suffer from deep power state issues - considering submitting quirk patch Marcos Scriven
@ 2022-05-15 19:44 ` Keith Busch
  2022-05-15 20:13   ` Keith Busch
  2022-05-16 17:58   ` Christoph Hellwig
  0 siblings, 2 replies; 6+ messages in thread
From: Keith Busch @ 2022-05-15 19:44 UTC (permalink / raw)
  To: Marcos Scriven; +Cc: linux-nvme

On Sun, May 15, 2022 at 05:00:44PM +0100, Marcos Scriven wrote:
> Hi all
> 
> I've been experiencing issues with my system freezing, and traced it down to the nvme controller resetting:
> 
> [268690.209099] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
> [268690.289109] nvme 0000:01:00.0: enabling device (0000 -> 0002)
> [268690.289234] nvme nvme0: Removing after probe failure status: -19
> [268690.313116] nvme0n1: detected capacity change from 1953525168 to 0
> [268690.313116] blk_update_request: I/O error, dev nvme0n1, sector 119170336 op 0x1:(WRITE) flags 0x800 phys_seg 14 prio class 0
> [268690.313117] blk_update_request: I/O error, dev nvme0n1, sector 293367304 op 0x1:(WRITE) flags 0x8800 phys_seg 5 prio class 0
> [268690.313118] blk_update_request: I/O error, dev nvme0n1, sector 1886015680 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
> 
> Only a reboot resolves this.
> 
> The vendor/product id:
> 
> 01:00.0 Non-Volatile memory controller [0108]: Sandisk Corp WD Black SN850 [15b7:5011] (rev 01)
> 
> This is installed in a desktop machine (the details of which I can give if relevant). I only mention this as the power profile is much less frugal than a laptop.

Some of the behavior you're describing has been isolated to specific
drive+platform combinations in the past, but let's hear your results from the
follow up experiements before considering if we need to introduce a dmi
type quirk.
 
> Anyway, with all that background, I'm happy to try NVME_QUIRK_NO_DEEPEST_PS for 15b7:5011 locally, and submit here if it works.

I think that's worth trying. Alternatively, you could just mess with the
module's param 'nvme_core.default_ps_max_latency_us' value and see if only the
deepest states or if any low power state is problematic.
 
> However, the main problem is how to reproduce this issue reliably/deterministically, in order to be confident in the patch. It can happen with in minutes or days at the moment.
>
> So, my questions:
> 
> 1) How can I reproduce the issue deterministically?

Unfortunately I really don't know. I have no hands-on experience with these
kinds of systems.

> 2) Are there any other causes of this I'd need to rule out? E.g BIOS, PSU, broken drive rather than a power state quirk.

PCIe ASPM has occasionally be a problem, so you could try disabling that too
(pcie_aspm=off).
 
> I also have a couple of more fundamental questions, the answer to which is probably way beyond my understanding:
> 
> 3) Why are so many drives needing this qurik in Linux? Could it be that Windows also avoids these power states?

Many client vendors don't prioritize Linux for their IOP testing, so we tend to
be the last to find out about issues.

> 4) I looked at the code around the message, and it seems to be this is about an attempt to reset the controller, rather than just accept it's timed out an operation. Is that correct? And if so, could there be a problem with the way resetting is working - or is it again a quirk with these NVMEs?

We used to have a health check thread periodically query the link status and
preemptively initiate a reset if it detects a problem outside any IO context.
That query defeated desired low power settings so we removed it. Now we only
check the link status if an IO times out, which is why the resetting message
appears in that context.

I don't think there's any particular issue with the way the driver reacts to
the condition. When you see an all f's reponse, that really indicates the link
is inaccessible. There's nothing we can do at the nvme driver level to
communicate with the device downstream that link, so no operations will ever
succeed. Once we're in this state, the nvme reset operation is almost certainly
doomed to fail since we can't communicate with the end device.

There might be additional things we could do at the PCIe level, like a slot
reset on the downstream port, but I haven't seen evidence that type of
escalation improves anything so far. It might be worth a shot, though.

> 5) On that note, some Googling yielded this patch, that I think was rejected
> https://patchwork.kernel.org/project/linux-block/patch/20180516040313.13596-12-ming.lei@redhat.com/.
> I'm unclear on the details, but felt it might be relevant.

That just changes the context where actual reset happens, but still uses the
same trigger to initiate the reset. I don't think that would help in your
situation since the link was down before an IO timed out.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: My Western Digital SN850 appears to suffer from deep power state issues - considering submitting quirk patch.
  2022-05-15 19:44 ` Keith Busch
@ 2022-05-15 20:13   ` Keith Busch
  2022-05-16 17:58   ` Christoph Hellwig
  1 sibling, 0 replies; 6+ messages in thread
From: Keith Busch @ 2022-05-15 20:13 UTC (permalink / raw)
  To: Marcos Scriven; +Cc: linux-nvme

On Sun, May 15, 2022 at 01:44:32PM -0600, Keith Busch wrote:
> There might be additional things we could do at the PCIe level, like a slot
> reset on the downstream port, but I haven't seen evidence that type of
> escalation improves anything so far. It might be worth a shot, though.

Something like this:

---
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1392,6 +1392,15 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 	if (nvme_should_reset(dev, csts)) {
 		nvme_warn_reset(dev, csts);
 		nvme_dev_disable(dev, false);
+
+		if (csts == 0xffffffff) {
+			struct pci_dev *parent = to_pci_dev(dev->dev)->bus->self;
+
+			dev_warn(dev->ctrl.device, "link is inaccessible, attempt reset bus:%x\n",
+				parent->subordinate->number);
+			pci_reset_bus(parent);
+		}
+
 		nvme_reset_ctrl(&dev->ctrl);
 		return BLK_EH_DONE;
 	}
--


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: My Western Digital SN850 appears to suffer from deep power state issues - considering submitting quirk patch.
  2022-05-15 19:44 ` Keith Busch
  2022-05-15 20:13   ` Keith Busch
@ 2022-05-16 17:58   ` Christoph Hellwig
  2022-05-19 10:14     ` Marcos Scriven
  1 sibling, 1 reply; 6+ messages in thread
From: Christoph Hellwig @ 2022-05-16 17:58 UTC (permalink / raw)
  To: Keith Busch; +Cc: Marcos Scriven, linux-nvme

On Sun, May 15, 2022 at 01:44:32PM -0600, Keith Busch wrote:
> Some of the behavior you're describing has been isolated to specific
> drive+platform combinations in the past, but let's hear your results from the
> follow up experiements before considering if we need to introduce a dmi
> type quirk.

Also, are we even sure this is related to power states?

>  
> > Anyway, with all that background, I'm happy to try NVME_QUIRK_NO_DEEPEST_PS for 15b7:5011 locally, and submit here if it works.
> 
> I think that's worth trying. Alternatively, you could just mess with the
> module's param 'nvme_core.default_ps_max_latency_us' value and see if only the
> deepest states or if any low power state is problematic.

Maybe even just nvme_core.default_ps_max_latency_us=0 to verify it
really is power state related as a start.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: My Western Digital SN850 appears to suffer from deep power state issues - considering submitting quirk patch.
  2022-05-16 17:58   ` Christoph Hellwig
@ 2022-05-19 10:14     ` Marcos Scriven
  2022-05-19 20:02       ` Keith Busch
  0 siblings, 1 reply; 6+ messages in thread
From: Marcos Scriven @ 2022-05-19 10:14 UTC (permalink / raw)
  To: linux-nvme

Thank you for your help Keith and Cristoph. I've been doing some more investigations.

On Mon, 16 May 2022, at 18:58, Christoph Hellwig wrote:
> On Sun, May 15, 2022 at 01:44:32PM -0600, Keith Busch wrote:
> > Some of the behavior you're describing has been isolated to specific
> > drive+platform combinations in the past, but let's hear your results from the
> > follow up experiements before considering if we need to introduce a dmi
> > type quirk.
> 
> Also, are we even sure this is related to power states?
> 

That's a good question - somehow I assumed this was the case after every post I saw mentioning this error was about changing latency.

> >  
> > > Anyway, with all that background, I'm happy to try NVME_QUIRK_NO_DEEPEST_PS for 15b7:5011 locally, and submit here if it works.
> > 
> > I think that's worth trying. Alternatively, you could just mess with the
> > module's param 'nvme_core.default_ps_max_latency_us' value and see if only the
> > deepest states or if any low power state is problematic.
> 
> Maybe even just nvme_core.default_ps_max_latency_us=0 to verify it
> really is power state related as a start.
> 

I have now tried both of these options (separately and together), and the issue still occurs. I confirmed settings in command line:

cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-5.13.19-6-pve root=/dev/mapper/pve-root ro quiet video=efifb:off acpi_enforce_resources=lax nvme_core.default_ps_max_latency_us=0 pcie_aspm=off

Does this mean it's definitively not a power state issue?

The slightly positive news is I now have a fairly dependable way of replicating the issue. I've described it over in the Proxmox forums (https://forum.proxmox.com/threads/what-processes-resources-are-used-while-doing-a-vm-backup-in-stop-mode.109779/) but in short, just backing up a container (from the affected drive, to an unaffected one) has about a 30% chance of causing the drive to go offline. I suppose the fact this happens is another indicator it's not to do with lowering power states (autonomous or otherwise) if it's happening right when the disk is being read.

I tried running strace on the process to see if I could see anything obvious about what the process is doing while it fails, vs what happens when it completes without error. I can't see anything obvious.

Another thing I tried was a raw dd read from the affected disk to dev/null, to see if it was something about intensive reading that causes this, and it did not happen. During that the controller temp maxes out at 63C. nvme smart-log doesn't show any critical warnings.

I'm wondering if there's anything that would help me identify if it's a hardware issue (PSU, SSD, motherboard) in terms of low leve debugging or BIOS settings.

As an complete aside on mailgroup protocol, my understanding is text only, and 'inline' style. Responses come both from the group and directly from the people replying - should I "reply all" or just the group? For now, I've gone for the latter.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: My Western Digital SN850 appears to suffer from deep power state issues - considering submitting quirk patch.
  2022-05-19 10:14     ` Marcos Scriven
@ 2022-05-19 20:02       ` Keith Busch
  0 siblings, 0 replies; 6+ messages in thread
From: Keith Busch @ 2022-05-19 20:02 UTC (permalink / raw)
  To: Marcos Scriven; +Cc: linux-nvme

On Thu, May 19, 2022 at 11:14:41AM +0100, Marcos Scriven wrote:
> 
> I have now tried both of these options (separately and together), and the issue still occurs. I confirmed settings in command line:
> 
> cat /proc/cmdline
> BOOT_IMAGE=/boot/vmlinuz-5.13.19-6-pve root=/dev/mapper/pve-root ro quiet video=efifb:off acpi_enforce_resources=lax nvme_core.default_ps_max_latency_us=0 pcie_aspm=off
> 
> Does this mean it's definitively not a power state issue?

If you're still seeing this same all f's failure even with these settings, I
think it rules out the autonomous power settings provided by nvme and pcie.

It doesn't necessarily rule out potentially platform specific power
capabilities, but that'd be well outside my view of the nvme driver stack, and
I don't have any ideas off the top of my head on what to even check.
 
> The slightly positive news is I now have a fairly dependable way of replicating the issue. I've described it over in the Proxmox forums (https://forum.proxmox.com/threads/what-processes-resources-are-used-while-doing-a-vm-backup-in-stop-mode.109779/) but in short, just backing up a container (from the affected drive, to an unaffected one) has about a 30% chance of causing the drive to go offline. I suppose the fact this happens is another indicator it's not to do with lowering power states (autonomous or otherwise) if it's happening right when the disk is being read.
> 
> I tried running strace on the process to see if I could see anything obvious about what the process is doing while it fails, vs what happens when it completes without error. I can't see anything obvious.
> 
> Another thing I tried was a raw dd read from the affected disk to dev/null, to see if it was something about intensive reading that causes this, and it did not happen. During that the controller temp maxes out at 63C. nvme smart-log doesn't show any critical warnings.

63C, not great, not terrible.

'dd' is a nice tool, but you may be able to push the drive further with 'fio',
assuming an intense read workload is pushing the drive to failure. Just a quick
example:

  # fio --name=global --filename=/dev/nvme0n1 --bs=64k --direct=1 --ioengine=libaio --rw=randread --iodepth=32 --numjobs=8 --name=test
 
> I'm wondering if there's anything that would help me identify if it's a hardware issue (PSU, SSD, motherboard) in terms of low leve debugging or BIOS settings.

It does sound hardware related, but I'm not aware of any reasonable tools or
methods to debug it. Right now, I can only recommend verifying you've got the
latest vendor provided firmware installed.

> As an complete aside on mailgroup protocol, my understanding is text only, and 'inline' style. Responses come both from the group and directly from the people replying - should I "reply all" or just the group? For now, I've gone for the latter.

The mailing list only accepts plain text. Top posting is generally frowned
upon. A reply-all is fine. Wrapping columns at 80 characters helps readability.


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2022-05-19 20:02 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-15 16:00 My Western Digital SN850 appears to suffer from deep power state issues - considering submitting quirk patch Marcos Scriven
2022-05-15 19:44 ` Keith Busch
2022-05-15 20:13   ` Keith Busch
2022-05-16 17:58   ` Christoph Hellwig
2022-05-19 10:14     ` Marcos Scriven
2022-05-19 20:02       ` Keith Busch

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.