linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle
@ 2019-05-08 18:59 Kai-Heng Feng
  2019-05-08 19:15 ` Chaitanya Kulkarni
  2019-05-08 19:16 ` Keith Busch
  0 siblings, 2 replies; 38+ messages in thread
From: Kai-Heng Feng @ 2019-05-08 18:59 UTC (permalink / raw)
  To: keith.busch, axboe, hch, sagi
  Cc: linux-nvme, linux-kernel, Kai-Heng Feng, Mario Limonciello

Several NVMes consume lots of power during Suspend-to-Idle.

According to the NVMe vendors, APST should be used and there are two
things that should be prevented:
- Controller shutdown (CC.SHN)
- PCI D3

I think that's because the Windows Modern Standby design document [1]
states below as a requirement:
"
Akin to AHCI PCIe SSDs, NVMe SSDs need to provide the host with a
non-operational power state that is comparable to DEVSLP (<5mW draw,
<100ms exit latency) in order to allow the host to perform appropriate
transitions into Modern Standby. Should the NVMe SSD not expose such a
non-operational power state, autonomous power state transitions (APST)
is the only other option to enter Modern Standby successfully.
"

D3 wasn't mentioned at all, though for some vendors D3 still put the
device into a low power state, others disable APST after trasition to
D3.

So instead of disabling NVMe controller in suspend callback we do the
following instead:
- Make sure all IRQs are quiesced
- Stop all queues
- Prevent the device entering D3
- Use memory barrier to make sure DMA operations are flushed on HMB devices

This patch has been verified on several different NVMes:
- Intel
- Hynix
- LiteOn
- Micron
- WDC
- Samsung
- Tohiba

With the new suspend routine, the laptop I tested use roughly 0.8W
under Suspend-to-Idle, and resume time is faster than the original
D3 scheme.

The only one exception so far is Toshiba XG5, which works with the old
suspend routine, so introduce a new quirk for XG5.
We are working with Toshiba to root cause the issue on XG5.

[1] https://docs.microsoft.com/en-us/windows-hardware/design/device-experiences/part-selection#ssd-storage

Tested-by: Lucien Kao <Lucien_kao@compal.com>
Tested-by: Mice Lin <mice_lin@wistron.com>
Tested-by: Jason Tan <LiangChuan.Tan@wdc.com>
Tested-by: Cody Liu (codyliu) <codyliu@micron.com>
Tested-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Mario Limonciello <mario.limonciello@dell.com>
Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
---
 drivers/nvme/host/core.c |   8 +++
 drivers/nvme/host/nvme.h |   8 +++
 drivers/nvme/host/pci.c  | 102 +++++++++++++++++++++++++++++++++++++--
 3 files changed, 115 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index a6644a2c3ef7..929db749da98 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3903,6 +3903,14 @@ static inline void _nvme_check_size(void)
 	BUILD_BUG_ON(sizeof(struct nvme_directive_cmd) != 64);
 }
 
+void nvme_enter_deepest(struct nvme_ctrl *ctrl)
+{
+	int ret = nvme_set_features(ctrl, NVME_FEAT_POWER_MGMT, ctrl->npss,
+				    NULL, 0, NULL);
+	if (ret)
+		dev_warn(ctrl->device, "failed to set deepest non-operational state (%d)\n", ret);
+}
+EXPORT_SYMBOL_GPL(nvme_enter_deepest);
 
 static int __init nvme_core_init(void)
 {
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 5ee75b5ff83f..ea40a83314ae 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -92,6 +92,11 @@ enum nvme_quirks {
 	 * Broken Write Zeroes.
 	 */
 	NVME_QUIRK_DISABLE_WRITE_ZEROES		= (1 << 9),
+
+	/*
+	 * Use D3 on S2I regardless of NPSS.
+	 */
+	NVME_QUIRK_USE_D3_ON_S2I		= (1 << 10),
 };
 
 /*
@@ -229,6 +234,7 @@ struct nvme_ctrl {
 	/* Power saving configuration */
 	u64 ps_max_latency_us;
 	bool apst_enabled;
+	bool suspend_to_idle;
 
 	/* PCIe only: */
 	u32 hmpre;
@@ -467,6 +473,8 @@ int nvme_delete_ctrl(struct nvme_ctrl *ctrl);
 int nvme_get_log(struct nvme_ctrl *ctrl, u32 nsid, u8 log_page, u8 lsp,
 		void *log, size_t size, u64 offset);
 
+void nvme_enter_deepest(struct nvme_ctrl *ctrl);
+
 extern const struct attribute_group *nvme_ns_id_attr_groups[];
 extern const struct block_device_operations nvme_ns_head_ops;
 
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 3e4fb891a95a..dea42b41f9a8 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -23,6 +23,7 @@
 #include <linux/io-64-nonatomic-lo-hi.h>
 #include <linux/sed-opal.h>
 #include <linux/pci-p2pdma.h>
+#include <linux/suspend.h>
 
 #include "trace.h"
 #include "nvme.h"
@@ -2828,12 +2829,98 @@ static void nvme_remove(struct pci_dev *pdev)
 }
 
 #ifdef CONFIG_PM_SLEEP
+static void nvme_do_suspend_to_idle(struct pci_dev *pdev)
+{
+	struct nvme_dev *ndev = pci_get_drvdata(pdev);
+
+	pdev->dev_flags |= PCI_DEV_FLAGS_NO_D3;
+	ndev->ctrl.suspend_to_idle = true;
+
+	nvme_start_freeze(&ndev->ctrl);
+	nvme_wait_freeze_timeout(&ndev->ctrl, NVME_IO_TIMEOUT);
+	nvme_stop_queues(&ndev->ctrl);
+
+	nvme_enter_deepest(&ndev->ctrl);
+
+	if (ndev->ctrl.queue_count > 0) {
+		nvme_disable_io_queues(ndev);
+		nvme_poll_irqdisable(&ndev->queues[0], -1);
+	}
+
+	nvme_suspend_io_queues(ndev);
+	nvme_suspend_queue(&ndev->queues[0]);
+	pci_free_irq_vectors(pdev);
+
+	blk_mq_tagset_busy_iter(&ndev->tagset, nvme_cancel_request, &ndev->ctrl);
+	blk_mq_tagset_busy_iter(&ndev->admin_tagset, nvme_cancel_request, &ndev->ctrl);
+
+	/* Make sure all HMB DMA operations are done */
+	mb();
+}
+
+static int nvme_do_resume_from_idle(struct pci_dev *pdev)
+{
+	struct nvme_dev *ndev = pci_get_drvdata(pdev);
+	int result;
+
+	pdev->dev_flags &= ~PCI_DEV_FLAGS_NO_D3;
+	ndev->ctrl.suspend_to_idle = false;
+
+	result = pci_alloc_irq_vectors(pdev, 1, 1, PCI_IRQ_ALL_TYPES);
+	if (result < 0)
+		goto out;
+
+	result = nvme_pci_configure_admin_queue(ndev);
+	if (result)
+		goto out;
+
+	result = nvme_alloc_admin_tags(ndev);
+	if (result)
+		goto out;
+
+	ndev->ctrl.max_hw_sectors = NVME_MAX_KB_SZ << 1;
+	ndev->ctrl.max_segments = NVME_MAX_SEGS;
+	mutex_unlock(&ndev->shutdown_lock);
+
+	result = nvme_init_identify(&ndev->ctrl);
+	if (result)
+		goto out;
+
+	result = nvme_setup_io_queues(ndev);
+	if (result)
+		goto out;
+
+	if (ndev->online_queues < 2) {
+		dev_warn(ndev->ctrl.device, "IO queues not created\n");
+		nvme_kill_queues(&ndev->ctrl);
+		nvme_remove_namespaces(&ndev->ctrl);
+	} else {
+		nvme_start_queues(&ndev->ctrl);
+		nvme_wait_freeze(&ndev->ctrl);
+		nvme_dev_add(ndev);
+		nvme_unfreeze(&ndev->ctrl);
+	}
+
+	nvme_start_ctrl(&ndev->ctrl);
+
+	return 0;
+
+ out:
+	nvme_remove_dead_ctrl(ndev, result);
+	return result;
+}
+
 static int nvme_suspend(struct device *dev)
 {
 	struct pci_dev *pdev = to_pci_dev(dev);
 	struct nvme_dev *ndev = pci_get_drvdata(pdev);
 
-	nvme_dev_disable(ndev, true);
+	if (IS_ENABLED(CONFIG_ACPI) && pm_suspend_via_s2idle() &&
+	    ndev->ctrl.npss && !(ndev->ctrl.quirks & NVME_QUIRK_USE_D3_ON_S2I))
+		nvme_do_suspend_to_idle(pdev);
+	else
+		nvme_dev_disable(ndev, true);
+
 	return 0;
 }
 
@@ -2841,9 +2928,16 @@ static int nvme_resume(struct device *dev)
 {
 	struct pci_dev *pdev = to_pci_dev(dev);
 	struct nvme_dev *ndev = pci_get_drvdata(pdev);
+	int result = 0;
 
-	nvme_reset_ctrl(&ndev->ctrl);
-	return 0;
+	if (ndev->ctrl.suspend_to_idle) {
+		result = nvme_do_resume_from_idle(pdev);
+		if (result)
+			dev_warn(dev, "failed to resume from idle state (%d)\n", result);
+	} else
+		nvme_reset_ctrl(&ndev->ctrl);
+
+	return result;
 }
 #endif
 
@@ -2921,6 +3015,8 @@ static const struct pci_device_id nvme_id_table[] = {
 	{ PCI_VDEVICE(INTEL, 0x5845),	/* Qemu emulated controller */
 		.driver_data = NVME_QUIRK_IDENTIFY_CNS |
 				NVME_QUIRK_DISABLE_WRITE_ZEROES, },
+	{ PCI_DEVICE(0x1179, 0x0116),	/* Toshiba XG5 */
+		.driver_data = NVME_QUIRK_USE_D3_ON_S2I, },
 	{ PCI_DEVICE(0x1bb1, 0x0100),   /* Seagate Nytro Flash Storage */
 		.driver_data = NVME_QUIRK_DELAY_BEFORE_CHK_RDY, },
 	{ PCI_DEVICE(0x1c58, 0x0003),	/* HGST adapter */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle
  2019-05-08 18:59 [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle Kai-Heng Feng
@ 2019-05-08 19:15 ` Chaitanya Kulkarni
  2019-05-08 19:16 ` Keith Busch
  1 sibling, 0 replies; 38+ messages in thread
From: Chaitanya Kulkarni @ 2019-05-08 19:15 UTC (permalink / raw)
  To: Kai-Heng Feng, keith.busch, axboe, hch, sagi
  Cc: linux-kernel, linux-nvme, Mario Limonciello

Did you get a chance to run this patch through checkpatch.pl ?

On 05/08/2019 12:00 PM, Kai-Heng Feng wrote:
> Several NVMes consume lots of power during Suspend-to-Idle.
>
> According to the NVMe vendors, APST should be used and there are two
> things that should be prevented:
> - Controller shutdown (CC.SHN)
> - PCI D3
>
> I think that's because the Windows Modern Standby design document [1]
> states below as a requirement:
> "
> Akin to AHCI PCIe SSDs, NVMe SSDs need to provide the host with a
> non-operational power state that is comparable to DEVSLP (<5mW draw,
> <100ms exit latency) in order to allow the host to perform appropriate
> transitions into Modern Standby. Should the NVMe SSD not expose such a
> non-operational power state, autonomous power state transitions (APST)
> is the only other option to enter Modern Standby successfully.
> "
>
> D3 wasn't mentioned at all, though for some vendors D3 still put the
> device into a low power state, others disable APST after trasition to
> D3.
>
> So instead of disabling NVMe controller in suspend callback we do the
> following instead:
> - Make sure all IRQs are quiesced
> - Stop all queues
> - Prevent the device entering D3
> - Use memory barrier to make sure DMA operations are flushed on HMB devices
>
> This patch has been verified on several different NVMes:
> - Intel
> - Hynix
> - LiteOn
> - Micron
> - WDC
> - Samsung
> - Tohiba
>
> With the new suspend routine, the laptop I tested use roughly 0.8W
> under Suspend-to-Idle, and resume time is faster than the original
> D3 scheme.
>
> The only one exception so far is Toshiba XG5, which works with the old
> suspend routine, so introduce a new quirk for XG5.
> We are working with Toshiba to root cause the issue on XG5.
>
> [1] https://docs.microsoft.com/en-us/windows-hardware/design/device-experiences/part-selection#ssd-storage
>
> Tested-by: Lucien Kao <Lucien_kao@compal.com>
> Tested-by: Mice Lin <mice_lin@wistron.com>
> Tested-by: Jason Tan <LiangChuan.Tan@wdc.com>
> Tested-by: Cody Liu (codyliu) <codyliu@micron.com>
> Tested-by: Theodore Ts'o <tytso@mit.edu>
> Signed-off-by: Mario Limonciello <mario.limonciello@dell.com>
> Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
> ---
>   drivers/nvme/host/core.c |   8 +++
>   drivers/nvme/host/nvme.h |   8 +++
>   drivers/nvme/host/pci.c  | 102 +++++++++++++++++++++++++++++++++++++--
>   3 files changed, 115 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index a6644a2c3ef7..929db749da98 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -3903,6 +3903,14 @@ static inline void _nvme_check_size(void)
>   	BUILD_BUG_ON(sizeof(struct nvme_directive_cmd) != 64);
>   }
>
> +void nvme_enter_deepest(struct nvme_ctrl *ctrl)
> +{
> +	int ret = nvme_set_features(ctrl, NVME_FEAT_POWER_MGMT, ctrl->npss,
> +				    NULL, 0, NULL);
> +	if (ret)
> +		dev_warn(ctrl->device, "failed to set deepest non-operational state (%d)\n", ret);
> +}
> +EXPORT_SYMBOL_GPL(nvme_enter_deepest);
>
>   static int __init nvme_core_init(void)
>   {
> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> index 5ee75b5ff83f..ea40a83314ae 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -92,6 +92,11 @@ enum nvme_quirks {
>   	 * Broken Write Zeroes.
>   	 */
>   	NVME_QUIRK_DISABLE_WRITE_ZEROES		= (1 << 9),
> +
> +	/*
> +	 * Use D3 on S2I regardless of NPSS.
> +	 */
> +	NVME_QUIRK_USE_D3_ON_S2I		= (1 << 10),
>   };
>
>   /*
> @@ -229,6 +234,7 @@ struct nvme_ctrl {
>   	/* Power saving configuration */
>   	u64 ps_max_latency_us;
>   	bool apst_enabled;
> +	bool suspend_to_idle;
>
>   	/* PCIe only: */
>   	u32 hmpre;
> @@ -467,6 +473,8 @@ int nvme_delete_ctrl(struct nvme_ctrl *ctrl);
>   int nvme_get_log(struct nvme_ctrl *ctrl, u32 nsid, u8 log_page, u8 lsp,
>   		void *log, size_t size, u64 offset);
>
> +void nvme_enter_deepest(struct nvme_ctrl *ctrl);
> +
>   extern const struct attribute_group *nvme_ns_id_attr_groups[];
>   extern const struct block_device_operations nvme_ns_head_ops;
>
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 3e4fb891a95a..dea42b41f9a8 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -23,6 +23,7 @@
>   #include <linux/io-64-nonatomic-lo-hi.h>
>   #include <linux/sed-opal.h>
>   #include <linux/pci-p2pdma.h>
> +#include <linux/suspend.h>
>
>   #include "trace.h"
>   #include "nvme.h"
> @@ -2828,12 +2829,98 @@ static void nvme_remove(struct pci_dev *pdev)
>   }
>
>   #ifdef CONFIG_PM_SLEEP
> +static void nvme_do_suspend_to_idle(struct pci_dev *pdev)
> +{
> +	struct nvme_dev *ndev = pci_get_drvdata(pdev);
> +
> +	pdev->dev_flags |= PCI_DEV_FLAGS_NO_D3;
> +	ndev->ctrl.suspend_to_idle = true;
> +
> +	nvme_start_freeze(&ndev->ctrl);
> +	nvme_wait_freeze_timeout(&ndev->ctrl, NVME_IO_TIMEOUT);
> +	nvme_stop_queues(&ndev->ctrl);
> +
> +	nvme_enter_deepest(&ndev->ctrl);
> +
> +	if (ndev->ctrl.queue_count > 0) {
> +		nvme_disable_io_queues(ndev);
> +		nvme_poll_irqdisable(&ndev->queues[0], -1);
> +	}
> +
> +	nvme_suspend_io_queues(ndev);
> +	nvme_suspend_queue(&ndev->queues[0]);
> +	pci_free_irq_vectors(pdev);
> +
> +	blk_mq_tagset_busy_iter(&ndev->tagset, nvme_cancel_request, &ndev->ctrl);
> +	blk_mq_tagset_busy_iter(&ndev->admin_tagset, nvme_cancel_request, &ndev->ctrl);
> +
> +	/* Make sure all HMB DMA operations are done */
> +	mb();
> +}
> +
> +static int nvme_do_resume_from_idle(struct pci_dev *pdev)
> +{
> +	struct nvme_dev *ndev = pci_get_drvdata(pdev);
> +	int result;
> +
> +	pdev->dev_flags &= ~PCI_DEV_FLAGS_NO_D3;
> +	ndev->ctrl.suspend_to_idle = false;
> +
> +	result = pci_alloc_irq_vectors(pdev, 1, 1, PCI_IRQ_ALL_TYPES);
> +	if (result < 0)
> +		goto out;
> +
> +	result = nvme_pci_configure_admin_queue(ndev);
> +	if (result)
> +		goto out;
> +
> +	result = nvme_alloc_admin_tags(ndev);
> +	if (result)
> +		goto out;
> +
> +	ndev->ctrl.max_hw_sectors = NVME_MAX_KB_SZ << 1;
> +	ndev->ctrl.max_segments = NVME_MAX_SEGS;
> +	mutex_unlock(&ndev->shutdown_lock);
> +
> +	result = nvme_init_identify(&ndev->ctrl);
> +	if (result)
> +		goto out;
> +
> +	result = nvme_setup_io_queues(ndev);
> +	if (result)
> +		goto out;
> +
> +	if (ndev->online_queues < 2) {
> +		dev_warn(ndev->ctrl.device, "IO queues not created\n");
> +		nvme_kill_queues(&ndev->ctrl);
> +		nvme_remove_namespaces(&ndev->ctrl);
> +	} else {
> +		nvme_start_queues(&ndev->ctrl);
> +		nvme_wait_freeze(&ndev->ctrl);
> +		nvme_dev_add(ndev);
> +		nvme_unfreeze(&ndev->ctrl);
> +	}
> +
> +	nvme_start_ctrl(&ndev->ctrl);
> +
> +	return 0;
> +
> + out:
> +	nvme_remove_dead_ctrl(ndev, result);
> +	return result;
> +}
> +
>   static int nvme_suspend(struct device *dev)
>   {
>   	struct pci_dev *pdev = to_pci_dev(dev);
>   	struct nvme_dev *ndev = pci_get_drvdata(pdev);
>
> -	nvme_dev_disable(ndev, true);
> +	if (IS_ENABLED(CONFIG_ACPI) && pm_suspend_via_s2idle() &&
> +	    ndev->ctrl.npss && !(ndev->ctrl.quirks & NVME_QUIRK_USE_D3_ON_S2I))
> +		nvme_do_suspend_to_idle(pdev);
> +	else
> +		nvme_dev_disable(ndev, true);
> +
>   	return 0;
>   }
>
> @@ -2841,9 +2928,16 @@ static int nvme_resume(struct device *dev)
>   {
>   	struct pci_dev *pdev = to_pci_dev(dev);
>   	struct nvme_dev *ndev = pci_get_drvdata(pdev);
> +	int result = 0;
>
> -	nvme_reset_ctrl(&ndev->ctrl);
> -	return 0;
> +	if (ndev->ctrl.suspend_to_idle) {
> +		result = nvme_do_resume_from_idle(pdev);
> +		if (result)
> +			dev_warn(dev, "failed to resume from idle state (%d)\n", result);
> +	} else
> +		nvme_reset_ctrl(&ndev->ctrl);
> +
> +	return result;
>   }
>   #endif
>
> @@ -2921,6 +3015,8 @@ static const struct pci_device_id nvme_id_table[] = {
>   	{ PCI_VDEVICE(INTEL, 0x5845),	/* Qemu emulated controller */
>   		.driver_data = NVME_QUIRK_IDENTIFY_CNS |
>   				NVME_QUIRK_DISABLE_WRITE_ZEROES, },
> +	{ PCI_DEVICE(0x1179, 0x0116),	/* Toshiba XG5 */
> +		.driver_data = NVME_QUIRK_USE_D3_ON_S2I, },
>   	{ PCI_DEVICE(0x1bb1, 0x0100),   /* Seagate Nytro Flash Storage */
>   		.driver_data = NVME_QUIRK_DELAY_BEFORE_CHK_RDY, },
>   	{ PCI_DEVICE(0x1c58, 0x0003),	/* HGST adapter */
>


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle
  2019-05-08 18:59 [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle Kai-Heng Feng
  2019-05-08 19:15 ` Chaitanya Kulkarni
@ 2019-05-08 19:16 ` Keith Busch
  2019-05-08 19:30   ` Kai-Heng Feng
  1 sibling, 1 reply; 38+ messages in thread
From: Keith Busch @ 2019-05-08 19:16 UTC (permalink / raw)
  To: Kai-Heng Feng
  Cc: keith.busch, axboe, hch, sagi, linux-nvme, linux-kernel,
	Mario Limonciello

On Thu, May 09, 2019 at 02:59:55AM +0800, Kai-Heng Feng wrote:
> +static int nvme_do_resume_from_idle(struct pci_dev *pdev)
> +{
> +	struct nvme_dev *ndev = pci_get_drvdata(pdev);
> +	int result;
> +
> +	pdev->dev_flags &= ~PCI_DEV_FLAGS_NO_D3;
> +	ndev->ctrl.suspend_to_idle = false;
> +
> +	result = pci_alloc_irq_vectors(pdev, 1, 1, PCI_IRQ_ALL_TYPES);
> +	if (result < 0)
> +		goto out;
> +
> +	result = nvme_pci_configure_admin_queue(ndev);
> +	if (result)
> +		goto out;
> +
> +	result = nvme_alloc_admin_tags(ndev);
> +	if (result)
> +		goto out;
> +
> +	ndev->ctrl.max_hw_sectors = NVME_MAX_KB_SZ << 1;
> +	ndev->ctrl.max_segments = NVME_MAX_SEGS;
> +	mutex_unlock(&ndev->shutdown_lock);

This lock was never locked.

But I think these special suspend/resume routines are too similar to the
existing ones, they should just incorporate this feature if we need to
do this.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle
  2019-05-08 19:16 ` Keith Busch
@ 2019-05-08 19:30   ` Kai-Heng Feng
  2019-05-08 19:38     ` Mario.Limonciello
  0 siblings, 1 reply; 38+ messages in thread
From: Kai-Heng Feng @ 2019-05-08 19:30 UTC (permalink / raw)
  To: Keith Busch
  Cc: Keith Busch, axboe, hch, sagi, linux-nvme, linux-kernel,
	Mario Limonciello

at 03:16, Keith Busch <kbusch@kernel.org> wrote:

> On Thu, May 09, 2019 at 02:59:55AM +0800, Kai-Heng Feng wrote:
>> +static int nvme_do_resume_from_idle(struct pci_dev *pdev)
>> +{
>> +	struct nvme_dev *ndev = pci_get_drvdata(pdev);
>> +	int result;
>> +
>> +	pdev->dev_flags &= ~PCI_DEV_FLAGS_NO_D3;
>> +	ndev->ctrl.suspend_to_idle = false;
>> +
>> +	result = pci_alloc_irq_vectors(pdev, 1, 1, PCI_IRQ_ALL_TYPES);
>> +	if (result < 0)
>> +		goto out;
>> +
>> +	result = nvme_pci_configure_admin_queue(ndev);
>> +	if (result)
>> +		goto out;
>> +
>> +	result = nvme_alloc_admin_tags(ndev);
>> +	if (result)
>> +		goto out;
>> +
>> +	ndev->ctrl.max_hw_sectors = NVME_MAX_KB_SZ << 1;
>> +	ndev->ctrl.max_segments = NVME_MAX_SEGS;
>> +	mutex_unlock(&ndev->shutdown_lock);
>
> This lock was never locked.

Yea, will fix this.

>
> But I think these special suspend/resume routines are too similar to the
> existing ones, they should just incorporate this feature if we need to
> do this.

That’s my original direction, but this requires a new bool to  
nvme_dev_disable(), so it becomes
static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown, bool  
suspend_to_idle)

And it starts to get messy.

Kai-Heng

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle
  2019-05-08 19:30   ` Kai-Heng Feng
@ 2019-05-08 19:38     ` Mario.Limonciello
  2019-05-08 19:51       ` Christoph Hellwig
  0 siblings, 1 reply; 38+ messages in thread
From: Mario.Limonciello @ 2019-05-08 19:38 UTC (permalink / raw)
  To: kai.heng.feng, kbusch
  Cc: keith.busch, axboe, hch, sagi, linux-nvme, linux-kernel

> -----Original Message-----
> From: Kai-Heng Feng <kai.heng.feng@canonical.com>
> Sent: Wednesday, May 8, 2019 2:30 PM
> To: Keith Busch
> Cc: Keith Busch; axboe@fb.com; hch@lst.de; sagi@grimberg.me; linux-
> nvme@lists.infradead.org; linux-kernel@vger.kernel.org; Limonciello, Mario
> Subject: Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on
> Suspend-to-Idle
> 
> 
> [EXTERNAL EMAIL]
> 
> at 03:16, Keith Busch <kbusch@kernel.org> wrote:
> 
> > On Thu, May 09, 2019 at 02:59:55AM +0800, Kai-Heng Feng wrote:
> >> +static int nvme_do_resume_from_idle(struct pci_dev *pdev)
> >> +{
> >> +	struct nvme_dev *ndev = pci_get_drvdata(pdev);
> >> +	int result;
> >> +
> >> +	pdev->dev_flags &= ~PCI_DEV_FLAGS_NO_D3;
> >> +	ndev->ctrl.suspend_to_idle = false;
> >> +
> >> +	result = pci_alloc_irq_vectors(pdev, 1, 1, PCI_IRQ_ALL_TYPES);
> >> +	if (result < 0)
> >> +		goto out;
> >> +
> >> +	result = nvme_pci_configure_admin_queue(ndev);
> >> +	if (result)
> >> +		goto out;
> >> +
> >> +	result = nvme_alloc_admin_tags(ndev);
> >> +	if (result)
> >> +		goto out;
> >> +
> >> +	ndev->ctrl.max_hw_sectors = NVME_MAX_KB_SZ << 1;
> >> +	ndev->ctrl.max_segments = NVME_MAX_SEGS;
> >> +	mutex_unlock(&ndev->shutdown_lock);
> >
> > This lock was never locked.
> 
> Yea, will fix this.
> 
> >
> > But I think these special suspend/resume routines are too similar to the
> > existing ones, they should just incorporate this feature if we need to
> > do this.
> 
> That’s my original direction, but this requires a new bool to
> nvme_dev_disable(), so it becomes
> static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown, bool
> suspend_to_idle)
> 
> And it starts to get messy.
> 

The existing routines have an implied assumption that firmware will come swinging
with a hammer to control the rails the SSD sits on.
With S2I everything needs to come from the driver side and it really is a
different paradigm.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle
  2019-05-08 19:38     ` Mario.Limonciello
@ 2019-05-08 19:51       ` Christoph Hellwig
  2019-05-08 20:28         ` Mario.Limonciello
  0 siblings, 1 reply; 38+ messages in thread
From: Christoph Hellwig @ 2019-05-08 19:51 UTC (permalink / raw)
  To: Mario.Limonciello
  Cc: kai.heng.feng, kbusch, keith.busch, axboe, hch, sagi, linux-nvme,
	linux-kernel

On Wed, May 08, 2019 at 07:38:50PM +0000, Mario.Limonciello@dell.com wrote:
> The existing routines have an implied assumption that firmware will come swinging
> with a hammer to control the rails the SSD sits on.
> With S2I everything needs to come from the driver side and it really is a
> different paradigm.

And that is why is this patch is fundamentally broken.

When using the simple pm ops suspend the pm core expects the device
to be powered off.  If fancy suspend doesn't want that we need to
communicate what to do to the device in another way, as the whole
thing is a platform decision.  There probabl is one (or five) methods
in dev_pm_ops that do the right thing, but please coordinate this
with the PM maintainers to make sure it does the right thing and
doesn't for example break either hibernate where we really don't
expect just a lower power state, or enterprise class NVMe devices
that don't do APST and don't really do different power states at
all in many cases.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle
  2019-05-08 19:51       ` Christoph Hellwig
@ 2019-05-08 20:28         ` Mario.Limonciello
  2019-05-09  6:12           ` Christoph Hellwig
  0 siblings, 1 reply; 38+ messages in thread
From: Mario.Limonciello @ 2019-05-08 20:28 UTC (permalink / raw)
  To: hch
  Cc: kai.heng.feng, kbusch, keith.busch, axboe, sagi, linux-nvme,
	linux-kernel

> -----Original Message-----
> From: Christoph Hellwig <hch@lst.de>
> Sent: Wednesday, May 8, 2019 2:52 PM
> To: Limonciello, Mario
> Cc: kai.heng.feng@canonical.com; kbusch@kernel.org; keith.busch@intel.com;
> axboe@fb.com; hch@lst.de; sagi@grimberg.me; linux-nvme@lists.infradead.org;
> linux-kernel@vger.kernel.org
> Subject: Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on
> Suspend-to-Idle
> 
> 
> [EXTERNAL EMAIL]
> 
> On Wed, May 08, 2019 at 07:38:50PM +0000, Mario.Limonciello@dell.com wrote:
> > The existing routines have an implied assumption that firmware will come
> swinging
> > with a hammer to control the rails the SSD sits on.
> > With S2I everything needs to come from the driver side and it really is a
> > different paradigm.
> 
> And that is why is this patch is fundamentally broken.
> 
> When using the simple pm ops suspend the pm core expects the device
> to be powered off.  If fancy suspend doesn't want that we need to
> communicate what to do to the device in another way, as the whole
> thing is a platform decision.  There probabl is one (or five) methods
> in dev_pm_ops that do the right thing, but please coordinate this
> with the PM maintainers to make sure it does the right thing and
> doesn't for example break either hibernate where we really don't
> expect just a lower power state, or 

You might think this would be adding runtime_suspend/runtime_resume
callbacks, but those also get called actually at runtime which is not
the goal here.  At runtime, these types of disks should rely on APST which
should calculate the appropriate latencies around the different power states.

This code path is only applicable in the suspend to idle state, which /does/
call suspend/resume functions associated with dev_pm_ops.  There isn't
a dedicated function in there for use only in suspend to idle, which is
why pm_suspend_via_s2idle() needs to get called.

SIMPLE_DEV_PM_OPS normally sets the same function for suspend and
freeze (hibernate), so to avoid any changes to the hibernate case it seems
to me that there needs to be a new nvme_freeze() that calls into the existing
nvme_dev_disable for the freeze pm op and nvme_thaw() that calls into the
existing nvme_reset_ctrl for the thaw pm op.

> enterprise class NVMe devices
> that don't do APST and don't really do different power states at
> all in many cases.

Enterprise class NVMe devices that don't do APST - do they typically
have a non-zero value for ndev->ctrl.npss?

If not, they wouldn't enter this new codepath even if the server entered into S2I.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle
  2019-05-08 20:28         ` Mario.Limonciello
@ 2019-05-09  6:12           ` Christoph Hellwig
  2019-05-09  6:48             ` Kai-Heng Feng
  0 siblings, 1 reply; 38+ messages in thread
From: Christoph Hellwig @ 2019-05-09  6:12 UTC (permalink / raw)
  To: Mario.Limonciello
  Cc: hch, kai.heng.feng, kbusch, keith.busch, axboe, sagi, linux-nvme,
	linux-kernel

On Wed, May 08, 2019 at 08:28:30PM +0000, Mario.Limonciello@dell.com wrote:
> You might think this would be adding runtime_suspend/runtime_resume
> callbacks, but those also get called actually at runtime which is not
> the goal here.  At runtime, these types of disks should rely on APST which
> should calculate the appropriate latencies around the different power states.
> 
> This code path is only applicable in the suspend to idle state, which /does/
> call suspend/resume functions associated with dev_pm_ops.  There isn't
> a dedicated function in there for use only in suspend to idle, which is
> why pm_suspend_via_s2idle() needs to get called.

The problem is that it also gets called for others paths:

#ifdef CONFIG_PM_SLEEP
#define SET_SYSTEM_SLEEP_PM_OPS(suspend_fn, resume_fn) \
        .suspend = suspend_fn, \
	.resume = resume_fn, \
	.freeze = suspend_fn, \
	.thaw = resume_fn, \
	.poweroff = suspend_fn, \
	.restore = resume_fn,
#else
else
#define SET_SYSTEM_SLEEP_PM_OPS(suspend_fn, resume_fn)
#endif

#define SIMPLE_DEV_PM_OPS(name, suspend_fn, resume_fn) \
const struct dev_pm_ops name = { \
	SET_SYSTEM_SLEEP_PM_OPS(suspend_fn, resume_fn) \
}

And at least for poweroff this new code seems completely wrong, even
for freeze it looks rather borderline.

And more to the points - if these "modern MS standby" systems are
becoming common, which it looks they are, we need support in the PM core
for those instead of working around the decisions in low-level drivers.

> SIMPLE_DEV_PM_OPS normally sets the same function for suspend and
> freeze (hibernate), so to avoid any changes to the hibernate case it seems
> to me that there needs to be a new nvme_freeze() that calls into the existing
> nvme_dev_disable for the freeze pm op and nvme_thaw() that calls into the
> existing nvme_reset_ctrl for the thaw pm op.

At least, yes.

> 
> > enterprise class NVMe devices
> > that don't do APST and don't really do different power states at
> > all in many cases.
> 
> Enterprise class NVMe devices that don't do APST - do they typically
> have a non-zero value for ndev->ctrl.npss?
> 
> If not, they wouldn't enter this new codepath even if the server entered into S2I.

No, devices that do set NPSS will have at least some power states
per definition, although they might not be too useful.  I suspect checking
APSTA might be safer, but if we don't want to rely on APST we should
check for a power state supporting the condition that the MS document
quoted in the original document supports.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle
  2019-05-09  6:12           ` Christoph Hellwig
@ 2019-05-09  6:48             ` Kai-Heng Feng
  2019-05-09  6:52               ` Christoph Hellwig
  2019-05-09  9:07               ` Rafael J. Wysocki
  0 siblings, 2 replies; 38+ messages in thread
From: Kai-Heng Feng @ 2019-05-09  6:48 UTC (permalink / raw)
  To: Christoph Hellwig, rafael.j.wysocki
  Cc: Mario.Limonciello, Keith Busch, Keith Busch, Jens Axboe,
	Sagi Grimberg, linux-nvme, Linux PM, LKML

Cc Rafael and linux-pm

at 14:12, Christoph Hellwig <hch@lst.de> wrote:

> On Wed, May 08, 2019 at 08:28:30PM +0000, Mario.Limonciello@dell.com wrote:
>> You might think this would be adding runtime_suspend/runtime_resume
>> callbacks, but those also get called actually at runtime which is not
>> the goal here.  At runtime, these types of disks should rely on APST which
>> should calculate the appropriate latencies around the different power  
>> states.
>>
>> This code path is only applicable in the suspend to idle state, which  
>> /does/
>> call suspend/resume functions associated with dev_pm_ops.  There isn't
>> a dedicated function in there for use only in suspend to idle, which is
>> why pm_suspend_via_s2idle() needs to get called.
>
> The problem is that it also gets called for others paths:
>
> #ifdef CONFIG_PM_SLEEP
> #define SET_SYSTEM_SLEEP_PM_OPS(suspend_fn, resume_fn) \
>         .suspend = suspend_fn, \
> 	.resume = resume_fn, \
> 	.freeze = suspend_fn, \
> 	.thaw = resume_fn, \
> 	.poweroff = suspend_fn, \
> 	.restore = resume_fn,
> #else
> else
> #define SET_SYSTEM_SLEEP_PM_OPS(suspend_fn, resume_fn)
> #endif
>
> #define SIMPLE_DEV_PM_OPS(name, suspend_fn, resume_fn) \
> const struct dev_pm_ops name = { \
> 	SET_SYSTEM_SLEEP_PM_OPS(suspend_fn, resume_fn) \
> }
>
> And at least for poweroff this new code seems completely wrong, even
> for freeze it looks rather borderline.

Not really, for hibernation pm_suspend_via_s2idle() evaluates to false so  
the old code path will be taken.

>
> And more to the points - if these "modern MS standby" systems are
> becoming common, which it looks they are, we need support in the PM core
> for those instead of working around the decisions in low-level drivers.

Rafael, what do you think about this?
Including this patch, there are five drivers that use  
pm_suspend_via_{firmware,s2idle}() to differentiate between S2I and S3.
So I think maybe it’s time to introduce a new suspend callback for S2I?

>
>> SIMPLE_DEV_PM_OPS normally sets the same function for suspend and
>> freeze (hibernate), so to avoid any changes to the hibernate case it seems
>> to me that there needs to be a new nvme_freeze() that calls into the  
>> existing
>> nvme_dev_disable for the freeze pm op and nvme_thaw() that calls into the
>> existing nvme_reset_ctrl for the thaw pm op.
>
> At least, yes.

Hibernation should remain the same as stated above.

>
>>> enterprise class NVMe devices
>>> that don't do APST and don't really do different power states at
>>> all in many cases.
>>
>> Enterprise class NVMe devices that don't do APST - do they typically
>> have a non-zero value for ndev->ctrl.npss?
>>
>> If not, they wouldn't enter this new codepath even if the server entered  
>> into S2I.
>
> No, devices that do set NPSS will have at least some power states
> per definition, although they might not be too useful.  I suspect checking
> APSTA might be safer, but if we don't want to rely on APST we should
> check for a power state supporting the condition that the MS document
> quoted in the original document supports.

If Modern Standby or Connected Standby is not supported by servers, I don’t  
think the design documents mean much here.
We probably should check if the platform firmware really supports S2I  
instead.

Kai-Heng


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle
  2019-05-09  6:48             ` Kai-Heng Feng
@ 2019-05-09  6:52               ` Christoph Hellwig
  2019-05-09  9:19                 ` Rafael J. Wysocki
  2019-05-09  9:07               ` Rafael J. Wysocki
  1 sibling, 1 reply; 38+ messages in thread
From: Christoph Hellwig @ 2019-05-09  6:52 UTC (permalink / raw)
  To: Kai-Heng Feng
  Cc: Christoph Hellwig, rafael.j.wysocki, Mario.Limonciello,
	Keith Busch, Keith Busch, Jens Axboe, Sagi Grimberg, linux-nvme,
	Linux PM, LKML

On Thu, May 09, 2019 at 02:48:59PM +0800, Kai-Heng Feng wrote:
> Not really, for hibernation pm_suspend_via_s2idle() evaluates to false so 
> the old code path will be taken.
>
>>
>> And more to the points - if these "modern MS standby" systems are
>> becoming common, which it looks they are, we need support in the PM core
>> for those instead of working around the decisions in low-level drivers.
>
> Rafael, what do you think about this?
> Including this patch, there are five drivers that use 
> pm_suspend_via_{firmware,s2idle}() to differentiate between S2I and S3.
> So I think maybe it’s time to introduce a new suspend callback for S2I?

We also really need something like that to avoid the PCI_DEV_FLAGS_NO_D3
abuse - that flag is a quirk statically set on a device at probe time
to prevent any entering of D3 state.

>> per definition, although they might not be too useful.  I suspect checking
>> APSTA might be safer, but if we don't want to rely on APST we should
>> check for a power state supporting the condition that the MS document
>> quoted in the original document supports.
>
> If Modern Standby or Connected Standby is not supported by servers, I 
> don’t think the design documents mean much here.
> We probably should check if the platform firmware really supports S2I 
> instead.

That too.  As said this really is a platform decision, and needs to
be managed by the platform code through the PM core.  Individual drivers
like nvme can just implement the behavior, but are the absolute wrong
place to make decisions on what kinds of suspend to enter.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle
  2019-05-09  6:48             ` Kai-Heng Feng
  2019-05-09  6:52               ` Christoph Hellwig
@ 2019-05-09  9:07               ` Rafael J. Wysocki
  2019-05-09  9:42                 ` Kai-Heng Feng
  1 sibling, 1 reply; 38+ messages in thread
From: Rafael J. Wysocki @ 2019-05-09  9:07 UTC (permalink / raw)
  To: Kai-Heng Feng
  Cc: Christoph Hellwig, Rafael Wysocki, Mario Limonciello,
	Keith Busch, Keith Busch, Jens Axboe, Sagi Grimberg, linux-nvme,
	Linux PM, LKML

On Thu, May 9, 2019 at 8:49 AM Kai-Heng Feng
<kai.heng.feng@canonical.com> wrote:
>
> Cc Rafael and linux-pm

I would have been much more useful to CC the patch to linux-pm at
least from the outset.

> at 14:12, Christoph Hellwig <hch@lst.de> wrote:
>
> > On Wed, May 08, 2019 at 08:28:30PM +0000, Mario.Limonciello@dell.com wrote:
> >> You might think this would be adding runtime_suspend/runtime_resume
> >> callbacks, but those also get called actually at runtime which is not
> >> the goal here.  At runtime, these types of disks should rely on APST which
> >> should calculate the appropriate latencies around the different power
> >> states.
> >>
> >> This code path is only applicable in the suspend to idle state, which
> >> /does/
> >> call suspend/resume functions associated with dev_pm_ops.  There isn't
> >> a dedicated function in there for use only in suspend to idle, which is
> >> why pm_suspend_via_s2idle() needs to get called.
> >
> > The problem is that it also gets called for others paths:
> >
> > #ifdef CONFIG_PM_SLEEP
> > #define SET_SYSTEM_SLEEP_PM_OPS(suspend_fn, resume_fn) \
> >         .suspend = suspend_fn, \
> >       .resume = resume_fn, \
> >       .freeze = suspend_fn, \
> >       .thaw = resume_fn, \
> >       .poweroff = suspend_fn, \
> >       .restore = resume_fn,
> > #else
> > else
> > #define SET_SYSTEM_SLEEP_PM_OPS(suspend_fn, resume_fn)
> > #endif
> >
> > #define SIMPLE_DEV_PM_OPS(name, suspend_fn, resume_fn) \
> > const struct dev_pm_ops name = { \
> >       SET_SYSTEM_SLEEP_PM_OPS(suspend_fn, resume_fn) \
> > }
> >
> > And at least for poweroff this new code seems completely wrong, even
> > for freeze it looks rather borderline.
>
> Not really, for hibernation pm_suspend_via_s2idle() evaluates to false so
> the old code path will be taken.
>
> >
> > And more to the points - if these "modern MS standby" systems are
> > becoming common, which it looks they are, we need support in the PM core
> > for those instead of working around the decisions in low-level drivers.
>
> Rafael, what do you think about this?

The difference between suspend-to-idle and suspend-to-RAM (S3) boils
down to the fact that at the end of S3 suspend all control of the
system is passed to the platform firmware.  Then, the firmware can
take care of some things that may not be taken care of by drivers (it
sometimes assumes that drivers will not take care of those things even
which is generally problematic).

For suspend-to-idle the final physical state of the system should (in
theory) be the same as the deepest possible physical idle state of it
achievable through working-state PM (combination of PM-runtime and
cpuidle, roughly speaking).  However, in practice the working-state PM
may not even be able to get there, as it generally requires many
things to happen exactly at the right time in a specific order and the
probability of that in the working-state PM situation is practically
0.  Suspend-to-idle helps here by quiescing everything in an ordered
fashion which makes all of the requisite conditions more likely to be
met together.

So yes, from an individual driver perspective, the device handling for
s2idle should be more like for PM-runtime than for S3 (s2R), but this
really shouldn't matter (and it doesn't matter for the vast majority
of drivers).

Unfortunately, the "modern MS standby" concept makes it matter,
because "modern MS standby" causes system-wide transitions to be
"special" and it appears to expect drivers to take care of the "extra
bit" that would have been taken care of by the platform firmware in
the S3 case.  [Note that in the Windows world the "modern MS standby"
systems don't support S3 ("modern MS standby" and S3 support are
mutually exclusive in Windows AFAICS) while Linux needs to support S3
and is expected to achieve the minimum power state through s2idle
(generally, even on the same platform) at the same time.]

> Including this patch, there are five drivers that use
> pm_suspend_via_{firmware,s2idle}() to differentiate between S2I and S3.

Well, that is not a large number relative to the total number of
drivers in Linux.

> So I think maybe it’s time to introduce a new suspend callback for S2I?

That would be a set of 6 new suspend and resume callbacks, mind you,
and there's quite a few of them already.  And the majority of drivers
would not need to use them anyway.

Also, please note that, possibly apart from the device power state
setting, the S2I and S2R handling really aren't that different at all.
You basically need to carry out the same preparations during suspend
and reverse them during resume in both cases.

That said I admit that there are cases in which device drivers need to
know that the system-wide transition under way is into s2idle and so
they should do extra stuff.  If pm_suspend_via_firmware() is not
sufficient for that, then I'm open to other suggestions, but
introducing a new set of callbacks for that alone would be rather
excessive IMO.

> >> SIMPLE_DEV_PM_OPS normally sets the same function for suspend and
> >> freeze (hibernate), so to avoid any changes to the hibernate case it seems
> >> to me that there needs to be a new nvme_freeze() that calls into the
> >> existing
> >> nvme_dev_disable for the freeze pm op and nvme_thaw() that calls into the
> >> existing nvme_reset_ctrl for the thaw pm op.
> >
> > At least, yes.
>
> Hibernation should remain the same as stated above.

Depending on what check is used in that code path.
pm_suspend_via_s2idle() will return "true" in the hibernation path
too, for one.

> >>> enterprise class NVMe devices
> >>> that don't do APST and don't really do different power states at
> >>> all in many cases.
> >>
> >> Enterprise class NVMe devices that don't do APST - do they typically
> >> have a non-zero value for ndev->ctrl.npss?
> >>
> >> If not, they wouldn't enter this new codepath even if the server entered
> >> into S2I.
> >
> > No, devices that do set NPSS will have at least some power states
> > per definition, although they might not be too useful.  I suspect checking
> > APSTA might be safer, but if we don't want to rely on APST we should
> > check for a power state supporting the condition that the MS document
> > quoted in the original document supports.
>
> If Modern Standby or Connected Standby is not supported by servers, I don’t
> think the design documents mean much here.
> We probably should check if the platform firmware really supports S2I
> instead.

S2I is expected to work regardless of the platform firmware and there
is nothing like "platform firmware support for S2I".  IOW, that check
would always return "false".

What you really need to know is if the given particular transition is S2I.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle
  2019-05-09  6:52               ` Christoph Hellwig
@ 2019-05-09  9:19                 ` Rafael J. Wysocki
  2019-05-09  9:25                   ` Christoph Hellwig
  0 siblings, 1 reply; 38+ messages in thread
From: Rafael J. Wysocki @ 2019-05-09  9:19 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Kai-Heng Feng, Rafael Wysocki, Mario Limonciello, Keith Busch,
	Keith Busch, Jens Axboe, Sagi Grimberg, linux-nvme, Linux PM,
	LKML

On Thu, May 9, 2019 at 8:52 AM Christoph Hellwig <hch@lst.de> wrote:
>
> On Thu, May 09, 2019 at 02:48:59PM +0800, Kai-Heng Feng wrote:
> > Not really, for hibernation pm_suspend_via_s2idle() evaluates to false so
> > the old code path will be taken.
> >
> >>
> >> And more to the points - if these "modern MS standby" systems are
> >> becoming common, which it looks they are, we need support in the PM core
> >> for those instead of working around the decisions in low-level drivers.
> >
> > Rafael, what do you think about this?
> > Including this patch, there are five drivers that use
> > pm_suspend_via_{firmware,s2idle}() to differentiate between S2I and S3.
> > So I think maybe it’s time to introduce a new suspend callback for S2I?
>
> We also really need something like that to avoid the PCI_DEV_FLAGS_NO_D3
> abuse - that flag is a quirk statically set on a device at probe time
> to prevent any entering of D3 state.

I agree that PCI_DEV_FLAGS_NO_D3 has to be avoided.

However, IMO introducing a new set of suspend (and resume) callbacks
for S2I would not be practical, because

(a) the only difference between S2I and S2R from a driver perspective
is that it may be expected to do something "special" about setting the
device power state in the S2I case (the rest of what needs to be done
during system-wide suspend/resume remains the same in both cases),

(b) the new callbacks would only be really useful for a handful of drivers.

> >> per definition, although they might not be too useful.  I suspect checking
> >> APSTA might be safer, but if we don't want to rely on APST we should
> >> check for a power state supporting the condition that the MS document
> >> quoted in the original document supports.
> >
> > If Modern Standby or Connected Standby is not supported by servers, I
> > don’t think the design documents mean much here.
> > We probably should check if the platform firmware really supports S2I
> > instead.
>
> That too.  As said this really is a platform decision, and needs to
> be managed by the platform code through the PM core.

I'm not what you mean by "platform decision" here.

>  Individual drivers like nvme can just implement the behavior, but are the absolute wrong
> place to make decisions on what kinds of suspend to enter.

Right, the choice of the target system state has already been made
when their callbacks get invoked (and it has been made by user space,
not by the platform).

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle
  2019-05-09  9:19                 ` Rafael J. Wysocki
@ 2019-05-09  9:25                   ` Christoph Hellwig
  2019-05-09 20:48                     ` Rafael J. Wysocki
  0 siblings, 1 reply; 38+ messages in thread
From: Christoph Hellwig @ 2019-05-09  9:25 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Christoph Hellwig, Kai-Heng Feng, Rafael Wysocki,
	Mario Limonciello, Keith Busch, Keith Busch, Jens Axboe,
	Sagi Grimberg, linux-nvme, Linux PM, LKML

On Thu, May 09, 2019 at 11:19:37AM +0200, Rafael J. Wysocki wrote:
> Right, the choice of the target system state has already been made
> when their callbacks get invoked (and it has been made by user space,
> not by the platform).

From a previous discussion I remember the main problem here is that
a lot of consumer NVMe use more power when put into D3hot than just
letting the device itself manage the power state transitions themselves.
Based on this patch there also might be some other device that want
an explicit power state transition from the host, but still not be
put into D3hot.

The avoid D3hot at all cost thing seems to be based on the Windows
broken^H^H^H^H^H^Hmodern standby principles.  So for platforms that
follow the modern standby model we need to avoid putting NVMe devices
that support power management into D3hot somehow.  This patch doesa a
few more things, but at least for the device where I was involved in
the earlier discussion those are not needed, and from the Linux
point of view many of them seem wrong too.

How do you think we best make that distinction?  Are the pm_ops
enough if we don't use the simple version?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle
  2019-05-09  9:07               ` Rafael J. Wysocki
@ 2019-05-09  9:42                 ` Kai-Heng Feng
  2019-05-09  9:56                   ` Christoph Hellwig
  0 siblings, 1 reply; 38+ messages in thread
From: Kai-Heng Feng @ 2019-05-09  9:42 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Christoph Hellwig, Rafael Wysocki, Mario Limonciello,
	Keith Busch, Keith Busch, Jens Axboe, Sagi Grimberg, linux-nvme,
	Linux PM, LKML

at 17:07, Rafael J. Wysocki <rafael@kernel.org> wrote:

> On Thu, May 9, 2019 at 8:49 AM Kai-Heng Feng
> <kai.heng.feng@canonical.com> wrote:
>> Cc Rafael and linux-pm
>
> I would have been much more useful to CC the patch to linux-pm at
> least from the outset.
>
>> at 14:12, Christoph Hellwig <hch@lst.de> wrote:
>>
>>> On Wed, May 08, 2019 at 08:28:30PM +0000, Mario.Limonciello@dell.com  
>>> wrote:
>>>> You might think this would be adding runtime_suspend/runtime_resume
>>>> callbacks, but those also get called actually at runtime which is not
>>>> the goal here.  At runtime, these types of disks should rely on APST  
>>>> which
>>>> should calculate the appropriate latencies around the different power
>>>> states.
>>>>
>>>> This code path is only applicable in the suspend to idle state, which
>>>> /does/
>>>> call suspend/resume functions associated with dev_pm_ops.  There isn't
>>>> a dedicated function in there for use only in suspend to idle, which is
>>>> why pm_suspend_via_s2idle() needs to get called.
>>>
>>> The problem is that it also gets called for others paths:
>>>
>>> #ifdef CONFIG_PM_SLEEP
>>> #define SET_SYSTEM_SLEEP_PM_OPS(suspend_fn, resume_fn) \
>>>         .suspend = suspend_fn, \
>>>       .resume = resume_fn, \
>>>       .freeze = suspend_fn, \
>>>       .thaw = resume_fn, \
>>>       .poweroff = suspend_fn, \
>>>       .restore = resume_fn,
>>> #else
>>> else
>>> #define SET_SYSTEM_SLEEP_PM_OPS(suspend_fn, resume_fn)
>>> #endif
>>>
>>> #define SIMPLE_DEV_PM_OPS(name, suspend_fn, resume_fn) \
>>> const struct dev_pm_ops name = { \
>>>       SET_SYSTEM_SLEEP_PM_OPS(suspend_fn, resume_fn) \
>>> }
>>>
>>> And at least for poweroff this new code seems completely wrong, even
>>> for freeze it looks rather borderline.
>>
>> Not really, for hibernation pm_suspend_via_s2idle() evaluates to false so
>> the old code path will be taken.
>>
>>> And more to the points - if these "modern MS standby" systems are
>>> becoming common, which it looks they are, we need support in the PM core
>>> for those instead of working around the decisions in low-level drivers.
>>
>> Rafael, what do you think about this?
>
> The difference between suspend-to-idle and suspend-to-RAM (S3) boils
> down to the fact that at the end of S3 suspend all control of the
> system is passed to the platform firmware.  Then, the firmware can
> take care of some things that may not be taken care of by drivers (it
> sometimes assumes that drivers will not take care of those things even
> which is generally problematic).
>
> For suspend-to-idle the final physical state of the system should (in
> theory) be the same as the deepest possible physical idle state of it
> achievable through working-state PM (combination of PM-runtime and
> cpuidle, roughly speaking).  However, in practice the working-state PM
> may not even be able to get there, as it generally requires many
> things to happen exactly at the right time in a specific order and the
> probability of that in the working-state PM situation is practically
> 0.  Suspend-to-idle helps here by quiescing everything in an ordered
> fashion which makes all of the requisite conditions more likely to be
> met together.
>
> So yes, from an individual driver perspective, the device handling for
> s2idle should be more like for PM-runtime than for S3 (s2R), but this
> really shouldn't matter (and it doesn't matter for the vast majority
> of drivers).
>
> Unfortunately, the "modern MS standby" concept makes it matter,
> because "modern MS standby" causes system-wide transitions to be
> "special" and it appears to expect drivers to take care of the "extra
> bit" that would have been taken care of by the platform firmware in
> the S3 case.  [Note that in the Windows world the "modern MS standby"
> systems don't support S3 ("modern MS standby" and S3 support are
> mutually exclusive in Windows AFAICS) while Linux needs to support S3
> and is expected to achieve the minimum power state through s2idle
> (generally, even on the same platform) at the same time.]
>
>> Including this patch, there are five drivers that use
>> pm_suspend_via_{firmware,s2idle}() to differentiate between S2I and S3.
>
> Well, that is not a large number relative to the total number of
> drivers in Linux.

That’s right, but I think we are going to see more of similar cases.

>
>> So I think maybe it’s time to introduce a new suspend callback for S2I?
>
> That would be a set of 6 new suspend and resume callbacks, mind you,
> and there's quite a few of them already.  And the majority of drivers
> would not need to use them anyway.

I think suspend_to_idle() and resume_from_idle() should be enough?
What are other 4 callbacks?

>
> Also, please note that, possibly apart from the device power state
> setting, the S2I and S2R handling really aren't that different at all.
> You basically need to carry out the same preparations during suspend
> and reverse them during resume in both cases.

But for this case, it’s quite different to the original suspend and resume  
callbacks.

>
> That said I admit that there are cases in which device drivers need to
> know that the system-wide transition under way is into s2idle and so
> they should do extra stuff.  If pm_suspend_via_firmware() is not
> sufficient for that, then I'm open to other suggestions, but
> introducing a new set of callbacks for that alone would be rather
> excessive IMO.

 From drivers’ perspective nothing changed, as PM core can prioritize  
suspend_to_idle() over suspend() when it’s actually S2I.

>
>>>> SIMPLE_DEV_PM_OPS normally sets the same function for suspend and
>>>> freeze (hibernate), so to avoid any changes to the hibernate case it  
>>>> seems
>>>> to me that there needs to be a new nvme_freeze() that calls into the
>>>> existing
>>>> nvme_dev_disable for the freeze pm op and nvme_thaw() that calls into  
>>>> the
>>>> existing nvme_reset_ctrl for the thaw pm op.
>>>
>>> At least, yes.
>>
>> Hibernation should remain the same as stated above.
>
> Depending on what check is used in that code path.
> pm_suspend_via_s2idle() will return "true" in the hibernation path
> too, for one.

You are right, I should use !pm_suspend_via_firmware() instead.

>
>>>>> enterprise class NVMe devices
>>>>> that don't do APST and don't really do different power states at
>>>>> all in many cases.
>>>>
>>>> Enterprise class NVMe devices that don't do APST - do they typically
>>>> have a non-zero value for ndev->ctrl.npss?
>>>>
>>>> If not, they wouldn't enter this new codepath even if the server entered
>>>> into S2I.
>>>
>>> No, devices that do set NPSS will have at least some power states
>>> per definition, although they might not be too useful.  I suspect  
>>> checking
>>> APSTA might be safer, but if we don't want to rely on APST we should
>>> check for a power state supporting the condition that the MS document
>>> quoted in the original document supports.
>>
>> If Modern Standby or Connected Standby is not supported by servers, I  
>> don’t
>> think the design documents mean much here.
>> We probably should check if the platform firmware really supports S2I
>> instead.
>
> S2I is expected to work regardless of the platform firmware and there
> is nothing like "platform firmware support for S2I".  IOW, that check
> would always return "false".
>
> What you really need to know is if the given particular transition is S2I.

Maybe a helper based on FADT flag and _DSM can do this thing?

Kai-Heng



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle
  2019-05-09  9:42                 ` Kai-Heng Feng
@ 2019-05-09  9:56                   ` Christoph Hellwig
  2019-05-09 10:28                     ` Kai-Heng Feng
  0 siblings, 1 reply; 38+ messages in thread
From: Christoph Hellwig @ 2019-05-09  9:56 UTC (permalink / raw)
  To: Kai-Heng Feng
  Cc: Rafael J. Wysocki, Christoph Hellwig, Rafael Wysocki,
	Mario Limonciello, Keith Busch, Keith Busch, Jens Axboe,
	Sagi Grimberg, linux-nvme, Linux PM, LKML

On Thu, May 09, 2019 at 05:42:30PM +0800, Kai-Heng Feng wrote:
>> That would be a set of 6 new suspend and resume callbacks, mind you,
>> and there's quite a few of them already.  And the majority of drivers
>> would not need to use them anyway.
>
> I think suspend_to_idle() and resume_from_idle() should be enough?
> What are other 4 callbacks?
>
>>
>> Also, please note that, possibly apart from the device power state
>> setting, the S2I and S2R handling really aren't that different at all.
>> You basically need to carry out the same preparations during suspend
>> and reverse them during resume in both cases.
>
> But for this case, it’s quite different to the original suspend and 
> resume callbacks.

Let's think of what cases we needed.

The "classic" suspend in the nvme driver basically shuts down the
device entirely.  This is useful for:

 a) device that have no power management
 b) System power states that eventually power off the entire PCIe bus.
    I think that would:

     - suspend to disk (hibernate)
     - classic suspend to ram

The we have the sequence in your patch.  This seems to be related to
some of the MS wording, but I'm not sure what for example tearing down
the queues buys us.  Can you explain a bit more where those bits
make a difference?

Otherwise I think we should use a "no-op" suspend, just leaving the
power management to the device, or a simple setting the device to the
deepest power state for everything else, where everything else is
suspend, or suspend to idle.

And of course than we have windows modern standby actually mandating
runtime D3 in some case, and vague handwaving mentions of this being
forced on the platforms, which I'm not entirely sure how they fit
into the above picture.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle
  2019-05-09  9:56                   ` Christoph Hellwig
@ 2019-05-09 10:28                     ` Kai-Heng Feng
  2019-05-09 10:31                       ` Christoph Hellwig
  2019-05-09 16:20                       ` Keith Busch
  0 siblings, 2 replies; 38+ messages in thread
From: Kai-Heng Feng @ 2019-05-09 10:28 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Rafael J. Wysocki, Rafael Wysocki, Mario Limonciello,
	Keith Busch, Keith Busch, Jens Axboe, Sagi Grimberg, linux-nvme,
	Linux PM, LKML

at 17:56, Christoph Hellwig <hch@lst.de> wrote:

> On Thu, May 09, 2019 at 05:42:30PM +0800, Kai-Heng Feng wrote:
>>> That would be a set of 6 new suspend and resume callbacks, mind you,
>>> and there's quite a few of them already.  And the majority of drivers
>>> would not need to use them anyway.
>>
>> I think suspend_to_idle() and resume_from_idle() should be enough?
>> What are other 4 callbacks?
>>
>>> Also, please note that, possibly apart from the device power state
>>> setting, the S2I and S2R handling really aren't that different at all.
>>> You basically need to carry out the same preparations during suspend
>>> and reverse them during resume in both cases.
>>
>> But for this case, it’s quite different to the original suspend and
>> resume callbacks.
>
> Let's think of what cases we needed.
>
> The "classic" suspend in the nvme driver basically shuts down the
> device entirely.  This is useful for:
>
>  a) device that have no power management
>  b) System power states that eventually power off the entire PCIe bus.
>     I think that would:
>
>      - suspend to disk (hibernate)
>      - classic suspend to ram
>
> The we have the sequence in your patch.  This seems to be related to
> some of the MS wording, but I'm not sure what for example tearing down
> the queues buys us.  Can you explain a bit more where those bits
> make a difference?

Based on my testing if queues (IRQ) are not disabled, NVMe controller won’t  
be quiesced.
Symptoms can be high power drain or system freeze.

I can check with vendors whether this also necessary under Windows.

>
> Otherwise I think we should use a "no-op" suspend, just leaving the
> power management to the device, or a simple setting the device to the
> deepest power state for everything else, where everything else is
> suspend, or suspend to idle.

I am not sure I get your idea. Does this “no-op” suspend happen in NVMe  
driver or PM core?

>
> And of course than we have windows modern standby actually mandating
> runtime D3 in some case, and vague handwaving mentions of this being
> forced on the platforms, which I'm not entirely sure how they fit
> into the above picture.

I was told that Windows doesn’t use runtime D3, APST is used exclusively.

Kai-Heng



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle
  2019-05-09 10:28                     ` Kai-Heng Feng
@ 2019-05-09 10:31                       ` Christoph Hellwig
  2019-05-09 11:59                         ` Kai-Heng Feng
  2019-05-09 16:20                       ` Keith Busch
  1 sibling, 1 reply; 38+ messages in thread
From: Christoph Hellwig @ 2019-05-09 10:31 UTC (permalink / raw)
  To: Kai-Heng Feng
  Cc: Christoph Hellwig, Rafael J. Wysocki, Rafael Wysocki,
	Mario Limonciello, Keith Busch, Keith Busch, Jens Axboe,
	Sagi Grimberg, linux-nvme, Linux PM, LKML

On Thu, May 09, 2019 at 06:28:32PM +0800, Kai-Heng Feng wrote:
> Based on my testing if queues (IRQ) are not disabled, NVMe controller 
> won’t be quiesced.
> Symptoms can be high power drain or system freeze.
>
> I can check with vendors whether this also necessary under Windows.

System freeze sounds odd.  And we had a patch from a person on the
Cc list here that was handed to me through a few indirections that
just skipps the suspend entirely for some cases, which seemd to
work fine with the controllers in question.

>> Otherwise I think we should use a "no-op" suspend, just leaving the
>> power management to the device, or a simple setting the device to the
>> deepest power state for everything else, where everything else is
>> suspend, or suspend to idle.
>
> I am not sure I get your idea. Does this “no-op” suspend happen in NVMe 
> driver or PM core?

no-op means we don't want to do anything in nvme.  If that happens
by not calling nvme or stubbing out the method for that particular
case does not matter.

>> And of course than we have windows modern standby actually mandating
>> runtime D3 in some case, and vague handwaving mentions of this being
>> forced on the platforms, which I'm not entirely sure how they fit
>> into the above picture.
>
> I was told that Windows doesn’t use runtime D3, APST is used exclusively.

As far as I know the default power management modes in the Microsoft
NVMe driver is explicit power management transitions, and in the Intel
RST driver that is commonly used it is APST.  But both could still
be comined with runtime D3 in theory, I'm just not sure if they are.

Microsoft has been pushing for aggressive runtime D3 for a while, but
I don't know if that includes NVMe devices.

>
> Kai-Heng
>
---end quoted text---

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle
  2019-05-09 10:31                       ` Christoph Hellwig
@ 2019-05-09 11:59                         ` Kai-Heng Feng
  2019-05-09 18:57                           ` Mario.Limonciello
  0 siblings, 1 reply; 38+ messages in thread
From: Kai-Heng Feng @ 2019-05-09 11:59 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Rafael J. Wysocki, Rafael Wysocki, Mario Limonciello,
	Keith Busch, Keith Busch, Jens Axboe, Sagi Grimberg, linux-nvme,
	Linux PM, LKML

at 18:31, Christoph Hellwig <hch@lst.de> wrote:

> On Thu, May 09, 2019 at 06:28:32PM +0800, Kai-Heng Feng wrote:
>> Based on my testing if queues (IRQ) are not disabled, NVMe controller
>> won’t be quiesced.
>> Symptoms can be high power drain or system freeze.
>>
>> I can check with vendors whether this also necessary under Windows.
>
> System freeze sounds odd.  And we had a patch from a person on the
> Cc list here that was handed to me through a few indirections that
> just skipps the suspend entirely for some cases, which seemd to
> work fine with the controllers in question.

That works fine for some devices, but for Toshiba NVMes this said scenario  
freezes the system, hence the new patch here.

And for all NVMes I tested this new suspend routine saves even more power  
than simply skipping suspend.

>
>>> Otherwise I think we should use a "no-op" suspend, just leaving the
>>> power management to the device, or a simple setting the device to the
>>> deepest power state for everything else, where everything else is
>>> suspend, or suspend to idle.
>>
>> I am not sure I get your idea. Does this “no-op” suspend happen in NVMe
>> driver or PM core?
>
> no-op means we don't want to do anything in nvme.  If that happens
> by not calling nvme or stubbing out the method for that particular
> case does not matter.

Ok, but we still need to figure out how to prevent the device device from  
tradition to D3.

>
>>> And of course than we have windows modern standby actually mandating
>>> runtime D3 in some case, and vague handwaving mentions of this being
>>> forced on the platforms, which I'm not entirely sure how they fit
>>> into the above picture.
>>
>> I was told that Windows doesn’t use runtime D3, APST is used exclusively.
>
> As far as I know the default power management modes in the Microsoft
> NVMe driver is explicit power management transitions, and in the Intel
> RST driver that is commonly used it is APST.  But both could still
> be comined with runtime D3 in theory, I'm just not sure if they are.
>
> Microsoft has been pushing for aggressive runtime D3 for a while, but
> I don't know if that includes NVMe devices.

Ok, I’ll check with vendors about this.

Kai-Heng

>
>> Kai-Heng
> ---end quoted text—



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle
  2019-05-09 10:28                     ` Kai-Heng Feng
  2019-05-09 10:31                       ` Christoph Hellwig
@ 2019-05-09 16:20                       ` Keith Busch
  1 sibling, 0 replies; 38+ messages in thread
From: Keith Busch @ 2019-05-09 16:20 UTC (permalink / raw)
  To: Kai-Heng Feng
  Cc: Christoph Hellwig, Rafael J. Wysocki, Wysocki, Rafael J,
	Mario Limonciello, Busch, Keith, Jens Axboe, Sagi Grimberg,
	linux-nvme, Linux PM, LKML

On Thu, May 09, 2019 at 03:28:32AM -0700, Kai-Heng Feng wrote:
> at 17:56, Christoph Hellwig <hch@lst.de> wrote:
> > The we have the sequence in your patch.  This seems to be related to
> > some of the MS wording, but I'm not sure what for example tearing down
> > the queues buys us.  Can you explain a bit more where those bits
> > make a difference?
> 
> Based on my testing if queues (IRQ) are not disabled, NVMe controller won’t  
> be quiesced.
> Symptoms can be high power drain or system freeze.
> 
> I can check with vendors whether this also necessary under Windows.

Hm, that doesn't sound right based on the spec's description of how this
feature works. We should not need to tear down IO queues for entering
low power, nor reset the controller to exit it.

The sequence for entering non-operational low power should just be freeze
request queues, set the power feature, then unfreeze request queues.

We shouldn't have to do anything to exit the state as the spec requires
devices autonomously return to operational when an IO doorbell is written.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle
  2019-05-09 11:59                         ` Kai-Heng Feng
@ 2019-05-09 18:57                           ` Mario.Limonciello
  2019-05-09 19:28                             ` Keith Busch
  0 siblings, 1 reply; 38+ messages in thread
From: Mario.Limonciello @ 2019-05-09 18:57 UTC (permalink / raw)
  To: kai.heng.feng, hch
  Cc: rafael, rafael.j.wysocki, kbusch, keith.busch, axboe, sagi,
	linux-nvme, linux-pm, linux-kernel

> >
> >>> Otherwise I think we should use a "no-op" suspend, just leaving the
> >>> power management to the device, or a simple setting the device to the
> >>> deepest power state for everything else, where everything else is
> >>> suspend, or suspend to idle.
> >>
> >> I am not sure I get your idea. Does this “no-op” suspend happen in NVMe
> >> driver or PM core?
> >
> > no-op means we don't want to do anything in nvme.  If that happens
> > by not calling nvme or stubbing out the method for that particular
> > case does not matter.
> 
> Ok, but we still need to figure out how to prevent the device device from
> tradition to D3.

This so-called no-op was something that we had experimented with while developing
this patch, but found that it would not help power consumption on all drives.

That's why we have explicit call to go into deepest power state in current patch.

> 
> >
> >>> And of course than we have windows modern standby actually mandating
> >>> runtime D3 in some case, and vague handwaving mentions of this being
> >>> forced on the platforms, which I'm not entirely sure how they fit
> >>> into the above picture.
> >>
> >> I was told that Windows doesn’t use runtime D3, APST is used exclusively.
> >
> > As far as I know the default power management modes in the Microsoft
> > NVMe driver is explicit power management transitions, and in the Intel
> > RST driver that is commonly used it is APST.  But both could still
> > be comined with runtime D3 in theory, I'm just not sure if they are.
> >
> > Microsoft has been pushing for aggressive runtime D3 for a while, but
> > I don't know if that includes NVMe devices.
> 
> Ok, I’ll check with vendors about this.
> 

No, current Windows versions don't transition to D3 with inbox NVME driver.
You're correct, it's explicit state transitions even if APST was enabled
(as this patch is currently doing as well).



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle
  2019-05-09 18:57                           ` Mario.Limonciello
@ 2019-05-09 19:28                             ` Keith Busch
  2019-05-09 20:54                               ` Rafael J. Wysocki
                                                 ` (2 more replies)
  0 siblings, 3 replies; 38+ messages in thread
From: Keith Busch @ 2019-05-09 19:28 UTC (permalink / raw)
  To: Mario.Limonciello
  Cc: kai.heng.feng, hch, axboe, sagi, rafael, linux-pm,
	rafael.j.wysocki, linux-kernel, linux-nvme, keith.busch

On Thu, May 09, 2019 at 06:57:34PM +0000, Mario.Limonciello@dell.com wrote:
> No, current Windows versions don't transition to D3 with inbox NVME driver.
> You're correct, it's explicit state transitions even if APST was enabled
> (as this patch is currently doing as well).

The proposed patch does too much, and your resume latency will be worse
off for doing an unnecessary controller reset.

The following should be all that's needed if the device is spec
compliant. The resume part isn't necessary if npss is non-operational, but
we're not saving that info, and it shouldn't hurt to be explicit anyway.

I don't have any PS capable devices, so this is just compile tested.

---
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 6265d9225ec8..ce8b9bc949b9 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1132,6 +1132,22 @@ static int nvme_set_features(struct nvme_ctrl *dev, unsigned fid, unsigned dword
 	return ret;
 }
 
+int nvme_set_power(struct nvme_ctrl *ctrl, unsigned npss)
+{
+	int ret;
+
+	mutex_lock(&ctrl->scan_lock);
+	nvme_start_freeze(ctrl);
+	nvme_wait_freeze(ctrl);
+	ret = nvme_set_features(ctrl, NVME_FEAT_POWER_MGMT, npss, NULL, 0,
+				NULL);
+	nvme_unfreeze(ctrl);
+	mutex_unlock(&ctrl->scan_lock);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(nvme_set_power);
+
 int nvme_set_queue_count(struct nvme_ctrl *ctrl, int *count)
 {
 	u32 q_count = (*count - 1) | ((*count - 1) << 16);
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 527d64545023..f2be6aad9804 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -459,6 +459,7 @@ int __nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
 		unsigned timeout, int qid, int at_head,
 		blk_mq_req_flags_t flags, bool poll);
 int nvme_set_queue_count(struct nvme_ctrl *ctrl, int *count);
+int nvme_set_power(struct nvme_ctrl *ctrl, unsigned npss);
 void nvme_stop_keep_alive(struct nvme_ctrl *ctrl);
 int nvme_reset_ctrl(struct nvme_ctrl *ctrl);
 int nvme_reset_ctrl_sync(struct nvme_ctrl *ctrl);
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index a90cf5d63aac..2c4154cb4e79 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -18,6 +18,7 @@
 #include <linux/mutex.h>
 #include <linux/once.h>
 #include <linux/pci.h>
+#include <linux/suspend.h>
 #include <linux/t10-pi.h>
 #include <linux/types.h>
 #include <linux/io-64-nonatomic-lo-hi.h>
@@ -2851,6 +2852,8 @@ static int nvme_suspend(struct device *dev)
 	struct pci_dev *pdev = to_pci_dev(dev);
 	struct nvme_dev *ndev = pci_get_drvdata(pdev);
 
+	if (!pm_suspend_via_firmware())
+		return nvme_set_power(&ndev->ctrl, ndev->ctrl.npss);
 	nvme_dev_disable(ndev, true);
 	return 0;
 }
@@ -2860,6 +2863,8 @@ static int nvme_resume(struct device *dev)
 	struct pci_dev *pdev = to_pci_dev(dev);
 	struct nvme_dev *ndev = pci_get_drvdata(pdev);
 
+	if (!pm_suspend_via_firmware())
+		return nvme_set_power(&ndev->ctrl, 0);
 	nvme_reset_ctrl(&ndev->ctrl);
 	return 0;
 }
--

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle
  2019-05-09  9:25                   ` Christoph Hellwig
@ 2019-05-09 20:48                     ` Rafael J. Wysocki
  0 siblings, 0 replies; 38+ messages in thread
From: Rafael J. Wysocki @ 2019-05-09 20:48 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Rafael J. Wysocki, Kai-Heng Feng, Rafael Wysocki,
	Mario Limonciello, Keith Busch, Keith Busch, Jens Axboe,
	Sagi Grimberg, linux-nvme, Linux PM, LKML

On Thu, May 9, 2019 at 11:25 AM Christoph Hellwig <hch@lst.de> wrote:
>
> On Thu, May 09, 2019 at 11:19:37AM +0200, Rafael J. Wysocki wrote:
> > Right, the choice of the target system state has already been made
> > when their callbacks get invoked (and it has been made by user space,
> > not by the platform).
>
> From a previous discussion I remember the main problem here is that
> a lot of consumer NVMe use more power when put into D3hot than just
> letting the device itself manage the power state transitions themselves.
> Based on this patch there also might be some other device that want
> an explicit power state transition from the host, but still not be
> put into D3hot.
>
> The avoid D3hot at all cost thing seems to be based on the Windows
> broken^H^H^H^H^H^Hmodern standby principles.  So for platforms that
> follow the modern standby model we need to avoid putting NVMe devices
> that support power management into D3hot somehow.  This patch doesa a
> few more things, but at least for the device where I was involved in
> the earlier discussion those are not needed, and from the Linux
> point of view many of them seem wrong too.
>
> How do you think we best make that distinction?  Are the pm_ops
> enough if we don't use the simple version?

First, I think that it is instructive to look at what happens without
the patch: nvme_suspend() gets called by pci_pm_suspend() (which
basically causes the device to be "stopped" IIUC) and then
pci_pm_suspend_noirq() is expected to put the device into the right
power state through pci_prepare_to_sleep().  In theory, this should
work for both S2R and S2I as long as the standard PCIe PM plus
possibly ACPI PM is sufficient for the device.  [Of course, the
platform firmware invoked at the last stage of S2R can "fix up" things
to reduce power further, but that should not be necessary if all is
handled properly up to this point.]

The claim in the patch changelog is that one design choice in Windows
related to "Modern Standby" has caused our default PCI PM to not apply
to NVMe devices in general (or to apply to them, but without much
effect, which is practically equivalent IMO).  This is not about a
"different paradigm" (as Mario put it) or a different type of system
suspend, but about the default PCI PM being basically useless for
those devices at least in some configurations.

And BTW, the same problem would have affected PM-runtime, had it been
supported by the nvme driver, because Linux uses the combination of
the standard PCIe PM and ACPI PM for PM-runtime too, and the
"paradigm" in there is pretty much the same as for S2I, so let's not
confuse things, pretty please.

All of this means that the driver needs to override the default PCI PM
like in the patch that Keith has just posted.  Unfortunately, it looks
like the "suspend via firmware" check needs to be there, because the
platform firmware doing S3 on some platforms may get confused by the
custom PM in the driver.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle
  2019-05-09 19:28                             ` Keith Busch
@ 2019-05-09 20:54                               ` Rafael J. Wysocki
  2019-05-09 21:16                                 ` Keith Busch
  2019-05-09 21:37                               ` Mario.Limonciello
  2019-05-10  5:30                               ` Christoph Hellwig
  2 siblings, 1 reply; 38+ messages in thread
From: Rafael J. Wysocki @ 2019-05-09 20:54 UTC (permalink / raw)
  To: Keith Busch
  Cc: Mario Limonciello, Kai-Heng Feng, Christoph Hellwig, Jens Axboe,
	Sagi Grimberg, Rafael J. Wysocki, Linux PM, Rafael Wysocki,
	Linux Kernel Mailing List, linux-nvme, Keith Busch

On Thu, May 9, 2019 at 9:33 PM Keith Busch <kbusch@kernel.org> wrote:
>
> On Thu, May 09, 2019 at 06:57:34PM +0000, Mario.Limonciello@dell.com wrote:
> > No, current Windows versions don't transition to D3 with inbox NVME driver.
> > You're correct, it's explicit state transitions even if APST was enabled
> > (as this patch is currently doing as well).
>
> The proposed patch does too much, and your resume latency will be worse
> off for doing an unnecessary controller reset.
>
> The following should be all that's needed if the device is spec
> compliant. The resume part isn't necessary if npss is non-operational, but
> we're not saving that info, and it shouldn't hurt to be explicit anyway.
>
> I don't have any PS capable devices, so this is just compile tested.
>
> ---
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 6265d9225ec8..ce8b9bc949b9 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -1132,6 +1132,22 @@ static int nvme_set_features(struct nvme_ctrl *dev, unsigned fid, unsigned dword
>         return ret;
>  }
>
> +int nvme_set_power(struct nvme_ctrl *ctrl, unsigned npss)
> +{
> +       int ret;
> +
> +       mutex_lock(&ctrl->scan_lock);
> +       nvme_start_freeze(ctrl);
> +       nvme_wait_freeze(ctrl);
> +       ret = nvme_set_features(ctrl, NVME_FEAT_POWER_MGMT, npss, NULL, 0,
> +                               NULL);
> +       nvme_unfreeze(ctrl);
> +       mutex_unlock(&ctrl->scan_lock);
> +
> +       return ret;
> +}
> +EXPORT_SYMBOL_GPL(nvme_set_power);
> +
>  int nvme_set_queue_count(struct nvme_ctrl *ctrl, int *count)
>  {
>         u32 q_count = (*count - 1) | ((*count - 1) << 16);
> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> index 527d64545023..f2be6aad9804 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -459,6 +459,7 @@ int __nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
>                 unsigned timeout, int qid, int at_head,
>                 blk_mq_req_flags_t flags, bool poll);
>  int nvme_set_queue_count(struct nvme_ctrl *ctrl, int *count);
> +int nvme_set_power(struct nvme_ctrl *ctrl, unsigned npss);
>  void nvme_stop_keep_alive(struct nvme_ctrl *ctrl);
>  int nvme_reset_ctrl(struct nvme_ctrl *ctrl);
>  int nvme_reset_ctrl_sync(struct nvme_ctrl *ctrl);
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index a90cf5d63aac..2c4154cb4e79 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -18,6 +18,7 @@
>  #include <linux/mutex.h>
>  #include <linux/once.h>
>  #include <linux/pci.h>
> +#include <linux/suspend.h>
>  #include <linux/t10-pi.h>
>  #include <linux/types.h>
>  #include <linux/io-64-nonatomic-lo-hi.h>
> @@ -2851,6 +2852,8 @@ static int nvme_suspend(struct device *dev)
>         struct pci_dev *pdev = to_pci_dev(dev);
>         struct nvme_dev *ndev = pci_get_drvdata(pdev);
>
> +       if (!pm_suspend_via_firmware())
> +               return nvme_set_power(&ndev->ctrl, ndev->ctrl.npss);

You probably want to call pci_save_state(pdev) in the branch above to
prevent pci_pm_suspend_noirq() from calling pci_prepare_to_sleep()
going forward, so I would write this routine as

if (pm_suspend_via_firmware()) {
        nvme_dev_disable(ndev, true);
        return 0;
}

pci_save_state(pdev)
return nvme_set_power(&ndev->ctrl, ndev->ctrl.npss);

>         nvme_dev_disable(ndev, true);
>         return 0;
>  }
> @@ -2860,6 +2863,8 @@ static int nvme_resume(struct device *dev)
>         struct pci_dev *pdev = to_pci_dev(dev);
>         struct nvme_dev *ndev = pci_get_drvdata(pdev);
>
> +       if (!pm_suspend_via_firmware())
> +               return nvme_set_power(&ndev->ctrl, 0);
>         nvme_reset_ctrl(&ndev->ctrl);
>         return 0;
>  }

The rest of the patch LGTM.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle
  2019-05-09 20:54                               ` Rafael J. Wysocki
@ 2019-05-09 21:16                                 ` Keith Busch
  2019-05-09 21:39                                   ` Rafael J. Wysocki
  0 siblings, 1 reply; 38+ messages in thread
From: Keith Busch @ 2019-05-09 21:16 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Mario Limonciello, Kai-Heng Feng, Christoph Hellwig, Jens Axboe,
	Sagi Grimberg, Linux PM, Rafael Wysocki,
	Linux Kernel Mailing List, linux-nvme, Keith Busch

On Thu, May 09, 2019 at 10:54:04PM +0200, Rafael J. Wysocki wrote:
> On Thu, May 9, 2019 at 9:33 PM Keith Busch <kbusch@kernel.org> wrote:
> >  #include <linux/io-64-nonatomic-lo-hi.h>
> > @@ -2851,6 +2852,8 @@ static int nvme_suspend(struct device *dev)
> >         struct pci_dev *pdev = to_pci_dev(dev);
> >         struct nvme_dev *ndev = pci_get_drvdata(pdev);
> >
> > +       if (!pm_suspend_via_firmware())
> > +               return nvme_set_power(&ndev->ctrl, ndev->ctrl.npss);
> 
> You probably want to call pci_save_state(pdev) in the branch above to
> prevent pci_pm_suspend_noirq() from calling pci_prepare_to_sleep()
> going forward, so I would write this routine as
> 
> if (pm_suspend_via_firmware()) {
>         nvme_dev_disable(ndev, true);
>         return 0;
> }
> 
> pci_save_state(pdev)
> return nvme_set_power(&ndev->ctrl, ndev->ctrl.npss);

Ah, good point. I'll make sure that's added and will wait to see hear if
there's any other feedback.

I am trying to test the paths by faking out PS capabilities, and have
a question on how to force each:

Running "rtcwake -m freeze ...", that takes the !pm_suspend_via_firmware()
path as I expected.

But trying to test the original path, I thought using "-m mem" would
have been a suspend via firmware, but that is still returning false.

Is that expected? I've only tried this on one platform so far, so might
just be this particular one is missing a firmware capability. 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle
  2019-05-09 19:28                             ` Keith Busch
  2019-05-09 20:54                               ` Rafael J. Wysocki
@ 2019-05-09 21:37                               ` Mario.Limonciello
  2019-05-09 21:54                                 ` Keith Busch
  2019-05-10  5:30                               ` Christoph Hellwig
  2 siblings, 1 reply; 38+ messages in thread
From: Mario.Limonciello @ 2019-05-09 21:37 UTC (permalink / raw)
  To: kbusch
  Cc: kai.heng.feng, hch, axboe, sagi, rafael, linux-pm,
	rafael.j.wysocki, linux-kernel, linux-nvme, keith.busch

> On Thu, May 09, 2019 at 06:57:34PM +0000, Mario.Limonciello@dell.com wrote:
> > No, current Windows versions don't transition to D3 with inbox NVME driver.
> > You're correct, it's explicit state transitions even if APST was enabled
> > (as this patch is currently doing as well).
> 
> The proposed patch does too much, and your resume latency will be worse
> off for doing an unnecessary controller reset.
> 
> The following should be all that's needed if the device is spec
> compliant. The resume part isn't necessary if npss is non-operational, but
> we're not saving that info, and it shouldn't hurt to be explicit anyway.
> 
> I don't have any PS capable devices, so this is just compile tested.
> 
> ---
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 6265d9225ec8..ce8b9bc949b9 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -1132,6 +1132,22 @@ static int nvme_set_features(struct nvme_ctrl *dev,
> unsigned fid, unsigned dword
>  	return ret;
>  }
> 
> +int nvme_set_power(struct nvme_ctrl *ctrl, unsigned npss)
> +{
> +	int ret;
> +
> +	mutex_lock(&ctrl->scan_lock);
> +	nvme_start_freeze(ctrl);
> +	nvme_wait_freeze(ctrl);
> +	ret = nvme_set_features(ctrl, NVME_FEAT_POWER_MGMT, npss, NULL, 0,
> +				NULL);
> +	nvme_unfreeze(ctrl);
> +	mutex_unlock(&ctrl->scan_lock);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(nvme_set_power);

I believe without memory barriers at the end disks with HMB this will
still kernel panic (Such as Toshiba BG3).

This still allows D3 which we found at least failed to go into deepest state and blocked
platform s0ix for the following SSDs (maybe others):
Hynix PC601
LiteOn CL1

> +
>  int nvme_set_queue_count(struct nvme_ctrl *ctrl, int *count)
>  {
>  	u32 q_count = (*count - 1) | ((*count - 1) << 16);
> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> index 527d64545023..f2be6aad9804 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -459,6 +459,7 @@ int __nvme_submit_sync_cmd(struct request_queue *q,
> struct nvme_command *cmd,
>  		unsigned timeout, int qid, int at_head,
>  		blk_mq_req_flags_t flags, bool poll);
>  int nvme_set_queue_count(struct nvme_ctrl *ctrl, int *count);
> +int nvme_set_power(struct nvme_ctrl *ctrl, unsigned npss);
>  void nvme_stop_keep_alive(struct nvme_ctrl *ctrl);
>  int nvme_reset_ctrl(struct nvme_ctrl *ctrl);
>  int nvme_reset_ctrl_sync(struct nvme_ctrl *ctrl);
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index a90cf5d63aac..2c4154cb4e79 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -18,6 +18,7 @@
>  #include <linux/mutex.h>
>  #include <linux/once.h>
>  #include <linux/pci.h>
> +#include <linux/suspend.h>
>  #include <linux/t10-pi.h>
>  #include <linux/types.h>
>  #include <linux/io-64-nonatomic-lo-hi.h>
> @@ -2851,6 +2852,8 @@ static int nvme_suspend(struct device *dev)
>  	struct pci_dev *pdev = to_pci_dev(dev);
>  	struct nvme_dev *ndev = pci_get_drvdata(pdev);
> 
> +	if (!pm_suspend_via_firmware())
> +		return nvme_set_power(&ndev->ctrl, ndev->ctrl.npss);
>  	nvme_dev_disable(ndev, true);
>  	return 0;
>  }
> @@ -2860,6 +2863,8 @@ static int nvme_resume(struct device *dev)
>  	struct pci_dev *pdev = to_pci_dev(dev);
>  	struct nvme_dev *ndev = pci_get_drvdata(pdev);
> 
> +	if (!pm_suspend_via_firmware())
> +		return nvme_set_power(&ndev->ctrl, 0);
>  	nvme_reset_ctrl(&ndev->ctrl);
>  	return 0;
>  }
> --

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle
  2019-05-09 21:16                                 ` Keith Busch
@ 2019-05-09 21:39                                   ` Rafael J. Wysocki
  0 siblings, 0 replies; 38+ messages in thread
From: Rafael J. Wysocki @ 2019-05-09 21:39 UTC (permalink / raw)
  To: Keith Busch
  Cc: Rafael J. Wysocki, Mario Limonciello, Kai-Heng Feng,
	Christoph Hellwig, Jens Axboe, Sagi Grimberg, Linux PM,
	Rafael Wysocki, Linux Kernel Mailing List, linux-nvme,
	Keith Busch

On Thu, May 9, 2019 at 11:21 PM Keith Busch <kbusch@kernel.org> wrote:
>
> On Thu, May 09, 2019 at 10:54:04PM +0200, Rafael J. Wysocki wrote:
> > On Thu, May 9, 2019 at 9:33 PM Keith Busch <kbusch@kernel.org> wrote:
> > >  #include <linux/io-64-nonatomic-lo-hi.h>
> > > @@ -2851,6 +2852,8 @@ static int nvme_suspend(struct device *dev)
> > >         struct pci_dev *pdev = to_pci_dev(dev);
> > >         struct nvme_dev *ndev = pci_get_drvdata(pdev);
> > >
> > > +       if (!pm_suspend_via_firmware())
> > > +               return nvme_set_power(&ndev->ctrl, ndev->ctrl.npss);
> >
> > You probably want to call pci_save_state(pdev) in the branch above to
> > prevent pci_pm_suspend_noirq() from calling pci_prepare_to_sleep()
> > going forward, so I would write this routine as
> >
> > if (pm_suspend_via_firmware()) {
> >         nvme_dev_disable(ndev, true);
> >         return 0;
> > }
> >
> > pci_save_state(pdev)
> > return nvme_set_power(&ndev->ctrl, ndev->ctrl.npss);
>
> Ah, good point. I'll make sure that's added and will wait to see hear if
> there's any other feedback.
>
> I am trying to test the paths by faking out PS capabilities, and have
> a question on how to force each:
>
> Running "rtcwake -m freeze ...", that takes the !pm_suspend_via_firmware()
> path as I expected.
>
> But trying to test the original path, I thought using "-m mem" would
> have been a suspend via firmware, but that is still returning false.
>
> Is that expected?

Yes, if s2idle is the default on that platform.  You should be able to
switch over to S3 by writing "deep" into /sys/power/mem_sleep as long
as it is supported on that platform at all.

> I've only tried this on one platform so far, so might
> just be this particular one is missing a firmware capability.

You can check that by looking into /sys/power/mem_sleep (if there is
only "[s2idle]" in there, S3 is not supported).

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle
  2019-05-09 21:37                               ` Mario.Limonciello
@ 2019-05-09 21:54                                 ` Keith Busch
  2019-05-09 22:19                                   ` Mario.Limonciello
  0 siblings, 1 reply; 38+ messages in thread
From: Keith Busch @ 2019-05-09 21:54 UTC (permalink / raw)
  To: Mario.Limonciello
  Cc: kai.heng.feng, hch, axboe, sagi, rafael, linux-pm,
	rafael.j.wysocki, linux-kernel, linux-nvme, keith.busch

On Thu, May 09, 2019 at 09:37:58PM +0000, Mario.Limonciello@dell.com wrote:
> > +int nvme_set_power(struct nvme_ctrl *ctrl, unsigned npss)
> > +{
> > +	int ret;
> > +
> > +	mutex_lock(&ctrl->scan_lock);
> > +	nvme_start_freeze(ctrl);
> > +	nvme_wait_freeze(ctrl);
> > +	ret = nvme_set_features(ctrl, NVME_FEAT_POWER_MGMT, npss, NULL, 0,
> > +				NULL);
> > +	nvme_unfreeze(ctrl);
> > +	mutex_unlock(&ctrl->scan_lock);
> > +
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(nvme_set_power);
> 
> I believe without memory barriers at the end disks with HMB this will
> still kernel panic (Such as Toshiba BG3).

Well, the mutex has an implied memory barrier, but your HMB explanation
doesn't make much sense to me anyway. The "mb()" in this thread's original
patch is a CPU memory barrier, and the CPU had better not be accessing
HMB memory. Is there something else going on here?
 
> This still allows D3 which we found at least failed to go into deepest state and blocked
> platform s0ix for the following SSDs (maybe others):
> Hynix PC601
> LiteOn CL1

We usually write features to spec first, then quirk non-compliant
devices after.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle
  2019-05-09 21:54                                 ` Keith Busch
@ 2019-05-09 22:19                                   ` Mario.Limonciello
  2019-05-10  6:05                                     ` Kai-Heng Feng
  0 siblings, 1 reply; 38+ messages in thread
From: Mario.Limonciello @ 2019-05-09 22:19 UTC (permalink / raw)
  To: kbusch
  Cc: kai.heng.feng, hch, axboe, sagi, rafael, linux-pm,
	rafael.j.wysocki, linux-kernel, linux-nvme, keith.busch

> -----Original Message-----
> From: Keith Busch <kbusch@kernel.org>
> Sent: Thursday, May 9, 2019 4:54 PM
> To: Limonciello, Mario
> Cc: kai.heng.feng@canonical.com; hch@lst.de; axboe@fb.com;
> sagi@grimberg.me; rafael@kernel.org; linux-pm@vger.kernel.org;
> rafael.j.wysocki@intel.com; linux-kernel@vger.kernel.org; linux-
> nvme@lists.infradead.org; keith.busch@intel.com
> Subject: Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on
> Suspend-to-Idle
> 
> 
> [EXTERNAL EMAIL]
> 
> On Thu, May 09, 2019 at 09:37:58PM +0000, Mario.Limonciello@dell.com wrote:
> > > +int nvme_set_power(struct nvme_ctrl *ctrl, unsigned npss)
> > > +{
> > > +	int ret;
> > > +
> > > +	mutex_lock(&ctrl->scan_lock);
> > > +	nvme_start_freeze(ctrl);
> > > +	nvme_wait_freeze(ctrl);
> > > +	ret = nvme_set_features(ctrl, NVME_FEAT_POWER_MGMT, npss, NULL, 0,
> > > +				NULL);
> > > +	nvme_unfreeze(ctrl);
> > > +	mutex_unlock(&ctrl->scan_lock);
> > > +
> > > +	return ret;
> > > +}
> > > +EXPORT_SYMBOL_GPL(nvme_set_power);
> >
> > I believe without memory barriers at the end disks with HMB this will
> > still kernel panic (Such as Toshiba BG3).
> 
> Well, the mutex has an implied memory barrier, but your HMB explanation
> doesn't make much sense to me anyway. The "mb()" in this thread's original
> patch is a CPU memory barrier, and the CPU had better not be accessing
> HMB memory. Is there something else going on here?

Kai Heng will need to speak up a bit in his time zone as he has this disk on hand,
but what I recall from our discussion was that DMA operation MemRd after
resume was the source of the hang.

> 
> > This still allows D3 which we found at least failed to go into deepest state and
> blocked
> > platform s0ix for the following SSDs (maybe others):
> > Hynix PC601
> > LiteOn CL1
> 
> We usually write features to spec first, then quirk non-compliant
> devices after.

NVME spec doesn't talk about a relationship between SetFeatures w/
NVME_FEAT_POWER_MGMGT and D3 support, nor order of events.

This is why we opened a dialog with storage vendors, including contrasting the behavior
of Microsoft Windows inbox NVME driver and Intel's Windows RST driver.

Those two I mention that come to mind immediately because they were most recently
tested to fail.  Our discussion with storage vendors overwhelmingly requested
that we don't use D3 under S2I because their current firmware architecture won't
support it.

For example one vendor told us with current implementation that receiving D3hot
after NVME shutdown will prevent being able to enter L1.2.  D3hot entry was supported
by an IRQ handler that isn't serviced in NVME shutdown state.

Another vendor told us that with current implementation it's impossible to transition
to PS4 (at least via APST) while L1.2 D3hot is active.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle
  2019-05-09 19:28                             ` Keith Busch
  2019-05-09 20:54                               ` Rafael J. Wysocki
  2019-05-09 21:37                               ` Mario.Limonciello
@ 2019-05-10  5:30                               ` Christoph Hellwig
  2019-05-10 13:51                                 ` Keith Busch
  2 siblings, 1 reply; 38+ messages in thread
From: Christoph Hellwig @ 2019-05-10  5:30 UTC (permalink / raw)
  To: Keith Busch
  Cc: Mario.Limonciello, kai.heng.feng, hch, axboe, sagi, rafael,
	linux-pm, rafael.j.wysocki, linux-kernel, linux-nvme,
	keith.busch

> +int nvme_set_power(struct nvme_ctrl *ctrl, unsigned npss)
> +{
> +	int ret;
> +
> +	mutex_lock(&ctrl->scan_lock);
> +	nvme_start_freeze(ctrl);
> +	nvme_wait_freeze(ctrl);
> +	ret = nvme_set_features(ctrl, NVME_FEAT_POWER_MGMT, npss, NULL, 0,
> +				NULL);
> +	nvme_unfreeze(ctrl);
> +	mutex_unlock(&ctrl->scan_lock);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(nvme_set_power);

I think we should have this in the PCIe driver, givn that while in
theory power states are generic in practice they are only applicable
to PCIe.  To be revisited if history proves me wrong.

Also I don't see any reason why we'd need to do the freeze game on
resume.  Even on suspend it looks a little odd to me, as in theory
the PM core should have already put the system into a quiescent state.
But maybe we actually need it there, in which case a comment would
be helpful.

> +	if (!pm_suspend_via_firmware())

pm_suspend_via_firmware is a weird name and has absolutely zero
documentation.  So I think we really need a big fat comment with the
wisdom from this thread here.

> +		return nvme_set_power(&ndev->ctrl, ndev->ctrl.npss);

I think we need to skip this code path is NPSS is zero, as that
indicates that the device doesn't actually do power states and fall
back to the full teardown.

Also I can't find anything except for this odd sentences in the spec:

  "Hosts that do not dynamically manage power should set the power
  state to the lowest numbered state that satisfies the PCI Express
  slot power limit control value.

that requires the power states to be ordered in any particular way.
So we probably have to read through the table at probing time and find
the lowest power state there.

Rafael also brought up the issue of not entering D3, and the somewhat
non-intuitive to me solution for it, so I'm not commenting on that
except for asking on a comment on that save_state call.

> +     if (!pm_suspend_via_firmware())
> +             return nvme_set_power(&ndev->ctrl, 0);

Don't we need to save the previous power state here and restore that?
For example if someone set a specific state through nvme-cli?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle
  2019-05-09 22:19                                   ` Mario.Limonciello
@ 2019-05-10  6:05                                     ` Kai-Heng Feng
  2019-05-10  8:23                                       ` Rafael J. Wysocki
  2019-05-10 14:02                                       ` Keith Busch
  0 siblings, 2 replies; 38+ messages in thread
From: Kai-Heng Feng @ 2019-05-10  6:05 UTC (permalink / raw)
  To: Mario.Limonciello
  Cc: Keith Busch, hch, axboe, sagi, rafael, linux-pm,
	rafael.j.wysocki, linux-kernel, linux-nvme, keith.busch

at 06:19, <Mario.Limonciello@dell.com> <Mario.Limonciello@dell.com> wrote:

>> -----Original Message-----
>> From: Keith Busch <kbusch@kernel.org>
>> Sent: Thursday, May 9, 2019 4:54 PM
>> To: Limonciello, Mario
>> Cc: kai.heng.feng@canonical.com; hch@lst.de; axboe@fb.com;
>> sagi@grimberg.me; rafael@kernel.org; linux-pm@vger.kernel.org;
>> rafael.j.wysocki@intel.com; linux-kernel@vger.kernel.org; linux-
>> nvme@lists.infradead.org; keith.busch@intel.com
>> Subject: Re: [PATCH] nvme-pci: Use non-operational power state instead  
>> of D3 on
>> Suspend-to-Idle
>>
>>
>> [EXTERNAL EMAIL]
>>
>> On Thu, May 09, 2019 at 09:37:58PM +0000, Mario.Limonciello@dell.com  
>> wrote:
>>>> +int nvme_set_power(struct nvme_ctrl *ctrl, unsigned npss)
>>>> +{
>>>> +	int ret;
>>>> +
>>>> +	mutex_lock(&ctrl->scan_lock);
>>>> +	nvme_start_freeze(ctrl);
>>>> +	nvme_wait_freeze(ctrl);
>>>> +	ret = nvme_set_features(ctrl, NVME_FEAT_POWER_MGMT, npss, NULL, 0,
>>>> +				NULL);
>>>> +	nvme_unfreeze(ctrl);
>>>> +	mutex_unlock(&ctrl->scan_lock);
>>>> +
>>>> +	return ret;
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(nvme_set_power);
>>>
>>> I believe without memory barriers at the end disks with HMB this will
>>> still kernel panic (Such as Toshiba BG3).
>>
>> Well, the mutex has an implied memory barrier, but your HMB explanation
>> doesn't make much sense to me anyway. The "mb()" in this thread's original
>> patch is a CPU memory barrier, and the CPU had better not be accessing
>> HMB memory. Is there something else going on here?
>
> Kai Heng will need to speak up a bit in his time zone as he has this disk  
> on hand,
> but what I recall from our discussion was that DMA operation MemRd after
> resume was the source of the hang.

Yes, that’ what I was told by the NVMe vendor, so all I know is to impose a  
memory barrier.
If mb() shouldn’t be used here, what’s the correct variant to use in this  
context?

>
>>> This still allows D3 which we found at least failed to go into deepest  
>>> state and
>> blocked
>>> platform s0ix for the following SSDs (maybe others):
>>> Hynix PC601
>>> LiteOn CL1
>>
>> We usually write features to spec first, then quirk non-compliant
>> devices after.
>
> NVME spec doesn't talk about a relationship between SetFeatures w/
> NVME_FEAT_POWER_MGMGT and D3 support, nor order of events.
>
> This is why we opened a dialog with storage vendors, including  
> contrasting the behavior
> of Microsoft Windows inbox NVME driver and Intel's Windows RST driver.
>
> Those two I mention that come to mind immediately because they were most  
> recently
> tested to fail.  Our discussion with storage vendors overwhelmingly  
> requested
> that we don't use D3 under S2I because their current firmware  
> architecture won't
> support it.
>
> For example one vendor told us with current implementation that receiving  
> D3hot
> after NVME shutdown will prevent being able to enter L1.2.  D3hot entry  
> was supported
> by an IRQ handler that isn't serviced in NVME shutdown state.
>
> Another vendor told us that with current implementation it's impossible  
> to transition
> to PS4 (at least via APST) while L1.2 D3hot is active.

I tested the patch from Keith and it has two issues just as simply skipping  
nvme_dev_disable():
1) It consumes more power in S2I
2) System freeze after resume

Also I don’t think it’s a spec. It’s a guide to let OEM/ODM pick and  
assemble Modern Standby compliant machines, so what Windows NVMe driver  
really does is still opaque to us.

Kai-Heng



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle
  2019-05-10  6:05                                     ` Kai-Heng Feng
@ 2019-05-10  8:23                                       ` Rafael J. Wysocki
  2019-05-10 13:52                                         ` Keith Busch
  2019-05-10 15:15                                         ` Kai Heng Feng
  2019-05-10 14:02                                       ` Keith Busch
  1 sibling, 2 replies; 38+ messages in thread
From: Rafael J. Wysocki @ 2019-05-10  8:23 UTC (permalink / raw)
  To: Kai-Heng Feng
  Cc: Mario Limonciello, Keith Busch, Christoph Hellwig, Jens Axboe,
	Sagi Grimberg, Rafael J. Wysocki, Linux PM, Rafael Wysocki,
	Linux Kernel Mailing List, linux-nvme, Keith Busch

On Fri, May 10, 2019 at 8:08 AM Kai-Heng Feng
<kai.heng.feng@canonical.com> wrote:
>
> at 06:19, <Mario.Limonciello@dell.com> <Mario.Limonciello@dell.com> wrote:
>
> >> -----Original Message-----
> >> From: Keith Busch <kbusch@kernel.org>
> >> Sent: Thursday, May 9, 2019 4:54 PM
> >> To: Limonciello, Mario
> >> Cc: kai.heng.feng@canonical.com; hch@lst.de; axboe@fb.com;
> >> sagi@grimberg.me; rafael@kernel.org; linux-pm@vger.kernel.org;
> >> rafael.j.wysocki@intel.com; linux-kernel@vger.kernel.org; linux-
> >> nvme@lists.infradead.org; keith.busch@intel.com
> >> Subject: Re: [PATCH] nvme-pci: Use non-operational power state instead
> >> of D3 on
> >> Suspend-to-Idle
> >>
> >>
> >> [EXTERNAL EMAIL]
> >>
> >> On Thu, May 09, 2019 at 09:37:58PM +0000, Mario.Limonciello@dell.com
> >> wrote:
> >>>> +int nvme_set_power(struct nvme_ctrl *ctrl, unsigned npss)
> >>>> +{
> >>>> +  int ret;
> >>>> +
> >>>> +  mutex_lock(&ctrl->scan_lock);
> >>>> +  nvme_start_freeze(ctrl);
> >>>> +  nvme_wait_freeze(ctrl);
> >>>> +  ret = nvme_set_features(ctrl, NVME_FEAT_POWER_MGMT, npss, NULL, 0,
> >>>> +                          NULL);
> >>>> +  nvme_unfreeze(ctrl);
> >>>> +  mutex_unlock(&ctrl->scan_lock);
> >>>> +
> >>>> +  return ret;
> >>>> +}
> >>>> +EXPORT_SYMBOL_GPL(nvme_set_power);
> >>>
> >>> I believe without memory barriers at the end disks with HMB this will
> >>> still kernel panic (Such as Toshiba BG3).
> >>
> >> Well, the mutex has an implied memory barrier, but your HMB explanation
> >> doesn't make much sense to me anyway. The "mb()" in this thread's original
> >> patch is a CPU memory barrier, and the CPU had better not be accessing
> >> HMB memory. Is there something else going on here?
> >
> > Kai Heng will need to speak up a bit in his time zone as he has this disk
> > on hand,
> > but what I recall from our discussion was that DMA operation MemRd after
> > resume was the source of the hang.
>
> Yes, that’ what I was told by the NVMe vendor, so all I know is to impose a
> memory barrier.
> If mb() shouldn’t be used here, what’s the correct variant to use in this
> context?
>
> >
> >>> This still allows D3 which we found at least failed to go into deepest
> >>> state and
> >> blocked
> >>> platform s0ix for the following SSDs (maybe others):
> >>> Hynix PC601
> >>> LiteOn CL1
> >>
> >> We usually write features to spec first, then quirk non-compliant
> >> devices after.
> >
> > NVME spec doesn't talk about a relationship between SetFeatures w/
> > NVME_FEAT_POWER_MGMGT and D3 support, nor order of events.
> >
> > This is why we opened a dialog with storage vendors, including
> > contrasting the behavior
> > of Microsoft Windows inbox NVME driver and Intel's Windows RST driver.
> >
> > Those two I mention that come to mind immediately because they were most
> > recently
> > tested to fail.  Our discussion with storage vendors overwhelmingly
> > requested
> > that we don't use D3 under S2I because their current firmware
> > architecture won't
> > support it.
> >
> > For example one vendor told us with current implementation that receiving
> > D3hot
> > after NVME shutdown will prevent being able to enter L1.2.  D3hot entry
> > was supported
> > by an IRQ handler that isn't serviced in NVME shutdown state.
> >
> > Another vendor told us that with current implementation it's impossible
> > to transition
> > to PS4 (at least via APST) while L1.2 D3hot is active.
>
> I tested the patch from Keith and it has two issues just as simply skipping
> nvme_dev_disable():
> 1) It consumes more power in S2I
> 2) System freeze after resume

Well, the Keith's patch doesn't prevent pci_pm_suspend_noirq() from
asking for D3 and both of the symptoms above may be consequences of
that in principle.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle
  2019-05-10  5:30                               ` Christoph Hellwig
@ 2019-05-10 13:51                                 ` Keith Busch
  0 siblings, 0 replies; 38+ messages in thread
From: Keith Busch @ 2019-05-10 13:51 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Mario.Limonciello, kai.heng.feng, axboe, sagi, rafael, linux-pm,
	Wysocki, Rafael J, linux-kernel, linux-nvme, Busch, Keith

On Thu, May 09, 2019 at 10:30:52PM -0700, Christoph Hellwig wrote:
> Also I don't see any reason why we'd need to do the freeze game on
> resume.  

Right, definitely no reason for resume.

> Even on suspend it looks a little odd to me, as in theory
> the PM core should have already put the system into a quiescent state.
> But maybe we actually need it there, in which case a comment would
> be helpful.

I wasn't sure if suspend prevents a kthread or work queue from
running. For example, if the device sends a namespace notify AEN during
S2I, does the nvme scan_work run?

Since I wasn't sure, I took a paranoid approach to ensure nothing was
in flight, but I'd be happy if this is unnecessary.

> > +     if (!pm_suspend_via_firmware())
> > +             return nvme_set_power(&ndev->ctrl, 0);
> 
> Don't we need to save the previous power state here and restore that?
> For example if someone set a specific state through nvme-cli?

Sure, we can do that. It would have been super if the spec had this
set feature command's CQE DW0 return the previous power state so we
could set the new and save the old in a single command, but two commands
is just a minor inconvenience.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle
  2019-05-10  8:23                                       ` Rafael J. Wysocki
@ 2019-05-10 13:52                                         ` Keith Busch
  2019-05-10 15:15                                         ` Kai Heng Feng
  1 sibling, 0 replies; 38+ messages in thread
From: Keith Busch @ 2019-05-10 13:52 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Kai-Heng Feng, Mario Limonciello, Christoph Hellwig, Jens Axboe,
	Sagi Grimberg, Linux PM, Wysocki, Rafael J,
	Linux Kernel Mailing List, linux-nvme, Busch, Keith

On Fri, May 10, 2019 at 01:23:11AM -0700, Rafael J. Wysocki wrote:
> On Fri, May 10, 2019 at 8:08 AM Kai-Heng Feng
> > I tested the patch from Keith and it has two issues just as simply skipping
> > nvme_dev_disable():
> > 1) It consumes more power in S2I
> > 2) System freeze after resume
> 
> Well, the Keith's patch doesn't prevent pci_pm_suspend_noirq() from
> asking for D3 and both of the symptoms above may be consequences of
> that in principle.

Right, I'll fix up the kernel's PCI D3 control and resend for
consideration.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle
  2019-05-10  6:05                                     ` Kai-Heng Feng
  2019-05-10  8:23                                       ` Rafael J. Wysocki
@ 2019-05-10 14:02                                       ` Keith Busch
  2019-05-10 15:18                                         ` Kai Heng Feng
  1 sibling, 1 reply; 38+ messages in thread
From: Keith Busch @ 2019-05-10 14:02 UTC (permalink / raw)
  To: Kai-Heng Feng
  Cc: Mario.Limonciello, hch, axboe, sagi, rafael, linux-pm, Wysocki,
	Rafael J, linux-kernel, linux-nvme, Busch, Keith

On Thu, May 09, 2019 at 11:05:42PM -0700, Kai-Heng Feng wrote:
> Yes, that’ what I was told by the NVMe vendor, so all I know is to impose a  
> memory barrier.
> If mb() shouldn’t be used here, what’s the correct variant to use in this  
> context?

I'm afraid the requirement is still not clear to me. AFAIK, all our
barriers routines ensure data is visible either between CPUs, or between
CPU and devices. The CPU never accesses HMB memory, so there must be some
other reasoning if this barrier is a real requirement for this device.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle
  2019-05-10  8:23                                       ` Rafael J. Wysocki
  2019-05-10 13:52                                         ` Keith Busch
@ 2019-05-10 15:15                                         ` Kai Heng Feng
  2019-05-10 15:36                                           ` Keith Busch
  1 sibling, 1 reply; 38+ messages in thread
From: Kai Heng Feng @ 2019-05-10 15:15 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Mario Limonciello, Keith Busch, Christoph Hellwig, Jens Axboe,
	Sagi Grimberg, Linux PM, Rafael Wysocki,
	Linux Kernel Mailing List, linux-nvme, Keith Busch



> On May 10, 2019, at 4:23 PM, Rafael J. Wysocki <rafael@kernel.org> wrote:
> 
> On Fri, May 10, 2019 at 8:08 AM Kai-Heng Feng
> <kai.heng.feng@canonical.com> wrote:
>> 
>> at 06:19, <Mario.Limonciello@dell.com> <Mario.Limonciello@dell.com> wrote:
>> 
>>>> -----Original Message-----
>>>> From: Keith Busch <kbusch@kernel.org>
>>>> Sent: Thursday, May 9, 2019 4:54 PM
>>>> To: Limonciello, Mario
>>>> Cc: kai.heng.feng@canonical.com; hch@lst.de; axboe@fb.com;
>>>> sagi@grimberg.me; rafael@kernel.org; linux-pm@vger.kernel.org;
>>>> rafael.j.wysocki@intel.com; linux-kernel@vger.kernel.org; linux-
>>>> nvme@lists.infradead.org; keith.busch@intel.com
>>>> Subject: Re: [PATCH] nvme-pci: Use non-operational power state instead
>>>> of D3 on
>>>> Suspend-to-Idle
>>>> 
>>>> 
>>>> [EXTERNAL EMAIL]
>>>> 
>>>> On Thu, May 09, 2019 at 09:37:58PM +0000, Mario.Limonciello@dell.com
>>>> wrote:
>>>>>> +int nvme_set_power(struct nvme_ctrl *ctrl, unsigned npss)
>>>>>> +{
>>>>>> +  int ret;
>>>>>> +
>>>>>> +  mutex_lock(&ctrl->scan_lock);
>>>>>> +  nvme_start_freeze(ctrl);
>>>>>> +  nvme_wait_freeze(ctrl);
>>>>>> +  ret = nvme_set_features(ctrl, NVME_FEAT_POWER_MGMT, npss, NULL, 0,
>>>>>> +                          NULL);
>>>>>> +  nvme_unfreeze(ctrl);
>>>>>> +  mutex_unlock(&ctrl->scan_lock);
>>>>>> +
>>>>>> +  return ret;
>>>>>> +}
>>>>>> +EXPORT_SYMBOL_GPL(nvme_set_power);
>>>>> 
>>>>> I believe without memory barriers at the end disks with HMB this will
>>>>> still kernel panic (Such as Toshiba BG3).
>>>> 
>>>> Well, the mutex has an implied memory barrier, but your HMB explanation
>>>> doesn't make much sense to me anyway. The "mb()" in this thread's original
>>>> patch is a CPU memory barrier, and the CPU had better not be accessing
>>>> HMB memory. Is there something else going on here?
>>> 
>>> Kai Heng will need to speak up a bit in his time zone as he has this disk
>>> on hand,
>>> but what I recall from our discussion was that DMA operation MemRd after
>>> resume was the source of the hang.
>> 
>> Yes, that’ what I was told by the NVMe vendor, so all I know is to impose a
>> memory barrier.
>> If mb() shouldn’t be used here, what’s the correct variant to use in this
>> context?
>> 
>>> 
>>>>> This still allows D3 which we found at least failed to go into deepest
>>>>> state and
>>>> blocked
>>>>> platform s0ix for the following SSDs (maybe others):
>>>>> Hynix PC601
>>>>> LiteOn CL1
>>>> 
>>>> We usually write features to spec first, then quirk non-compliant
>>>> devices after.
>>> 
>>> NVME spec doesn't talk about a relationship between SetFeatures w/
>>> NVME_FEAT_POWER_MGMGT and D3 support, nor order of events.
>>> 
>>> This is why we opened a dialog with storage vendors, including
>>> contrasting the behavior
>>> of Microsoft Windows inbox NVME driver and Intel's Windows RST driver.
>>> 
>>> Those two I mention that come to mind immediately because they were most
>>> recently
>>> tested to fail.  Our discussion with storage vendors overwhelmingly
>>> requested
>>> that we don't use D3 under S2I because their current firmware
>>> architecture won't
>>> support it.
>>> 
>>> For example one vendor told us with current implementation that receiving
>>> D3hot
>>> after NVME shutdown will prevent being able to enter L1.2.  D3hot entry
>>> was supported
>>> by an IRQ handler that isn't serviced in NVME shutdown state.
>>> 
>>> Another vendor told us that with current implementation it's impossible
>>> to transition
>>> to PS4 (at least via APST) while L1.2 D3hot is active.
>> 
>> I tested the patch from Keith and it has two issues just as simply skipping
>> nvme_dev_disable():
>> 1) It consumes more power in S2I
>> 2) System freeze after resume
> 
> Well, the Keith's patch doesn't prevent pci_pm_suspend_noirq() from
> asking for D3 and both of the symptoms above may be consequences of
> that in principle.

Sorry, I should mention that I use a slightly modified drivers/nvme/host/pci.c:

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 3e4fb891a95a..ece428ce6876 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -18,6 +18,7 @@
 #include <linux/mutex.h>
 #include <linux/once.h>
 #include <linux/pci.h>
+#include <linux/suspend.h>
 #include <linux/t10-pi.h>
 #include <linux/types.h>
 #include <linux/io-64-nonatomic-lo-hi.h>
@@ -2833,6 +2834,11 @@ static int nvme_suspend(struct device *dev)
        struct pci_dev *pdev = to_pci_dev(dev);
        struct nvme_dev *ndev = pci_get_drvdata(pdev);
 
+       if (!pm_suspend_via_firmware()) {
+               nvme_set_power(&ndev->ctrl, ndev->ctrl.npss);
+               pci_save_state(pdev);
+       }
+
        nvme_dev_disable(ndev, true);
        return 0;
 }
@@ -2842,6 +2848,10 @@ static int nvme_resume(struct device *dev)
        struct pci_dev *pdev = to_pci_dev(dev);
        struct nvme_dev *ndev = pci_get_drvdata(pdev);
 
+       if (!pm_resume_via_firmware()) {
+               return nvme_set_power(&ndev->ctrl, 0);
+       }
+
        nvme_reset_ctrl(&ndev->ctrl);
        return 0;
}

Does pci_save_state() here enough to prevent the device enter to D3?

Kai-Heng


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle
  2019-05-10 14:02                                       ` Keith Busch
@ 2019-05-10 15:18                                         ` Kai Heng Feng
  2019-05-10 15:49                                           ` hch
  0 siblings, 1 reply; 38+ messages in thread
From: Kai Heng Feng @ 2019-05-10 15:18 UTC (permalink / raw)
  To: Keith Busch
  Cc: Mario.Limonciello, hch, axboe, sagi, rafael, linux-pm, Wysocki,
	Rafael J, linux-kernel, linux-nvme, Busch, Keith



> On May 10, 2019, at 10:02 PM, Keith Busch <kbusch@kernel.org> wrote:
> 
> On Thu, May 09, 2019 at 11:05:42PM -0700, Kai-Heng Feng wrote:
>> Yes, that’ what I was told by the NVMe vendor, so all I know is to impose a  
>> memory barrier.
>> If mb() shouldn’t be used here, what’s the correct variant to use in this  
>> context?
> 
> I'm afraid the requirement is still not clear to me. AFAIK, all our
> barriers routines ensure data is visible either between CPUs, or between
> CPU and devices. The CPU never accesses HMB memory, so there must be some
> other reasoning if this barrier is a real requirement for this device.

Sure, I’ll ask vendor what that MemRd is for.

Kai-Heng

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle
  2019-05-10 15:15                                         ` Kai Heng Feng
@ 2019-05-10 15:36                                           ` Keith Busch
  0 siblings, 0 replies; 38+ messages in thread
From: Keith Busch @ 2019-05-10 15:36 UTC (permalink / raw)
  To: Kai Heng Feng
  Cc: Rafael J. Wysocki, Mario Limonciello, Christoph Hellwig,
	Jens Axboe, Sagi Grimberg, Linux PM, Rafael Wysocki,
	Linux Kernel Mailing List, linux-nvme, Keith Busch

On Fri, May 10, 2019 at 11:15:05PM +0800, Kai Heng Feng wrote:
> Sorry, I should mention that I use a slightly modified drivers/nvme/host/pci.c:
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 3e4fb891a95a..ece428ce6876 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -18,6 +18,7 @@
>  #include <linux/mutex.h>
>  #include <linux/once.h>
>  #include <linux/pci.h>
> +#include <linux/suspend.h>
>  #include <linux/t10-pi.h>
>  #include <linux/types.h>
>  #include <linux/io-64-nonatomic-lo-hi.h>
> @@ -2833,6 +2834,11 @@ static int nvme_suspend(struct device *dev)
>         struct pci_dev *pdev = to_pci_dev(dev);
>         struct nvme_dev *ndev = pci_get_drvdata(pdev);
>  
> +       if (!pm_suspend_via_firmware()) {
> +               nvme_set_power(&ndev->ctrl, ndev->ctrl.npss);
> +               pci_save_state(pdev);
> +       }
> +
>         nvme_dev_disable(ndev, true);

This won't work because you're falling through to nvme_dev_disable after
setting the power, so the resume side would certainly fail.

This is what you'd want:

       if (!pm_suspend_via_firmware()) {
               pci_save_state(pdev);
               return nvme_set_power(&ndev->ctrl, ndev->ctrl.npss);
       }

>         return 0;
>  }
> @@ -2842,6 +2848,10 @@ static int nvme_resume(struct device *dev)
>         struct pci_dev *pdev = to_pci_dev(dev);
>         struct nvme_dev *ndev = pci_get_drvdata(pdev);
>  
> +       if (!pm_resume_via_firmware()) {
> +               return nvme_set_power(&ndev->ctrl, 0);
> +       }
> +
>         nvme_reset_ctrl(&ndev->ctrl);
>         return 0;
> }
> 
> Does pci_save_state() here enough to prevent the device enter to D3?

Yes, saving the state during suspend will prevent pci pm from assuming
control over this device's power. It's a bit non-intuitive as Christoph
mentioned, so we'll need to make that clear in the comments. For
reference, here's the relevant part in pci-driver.c:

---
static int pci_pm_suspend_noirq(struct device *dev)
{
...

	if (!pci_dev->state_saved) {
		pci_save_state(pci_dev);
		if (pci_power_manageable(pci_dev))
			pci_prepare_to_sleep(pci_dev);
	}
...
}
--

So by saving the state in our suspend callback, pci will skip
pci_prepare_to_sleep(), which is what's setting your device most likely
to a D3hot, undermining our nvme power state setting.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle
  2019-05-10 15:18                                         ` Kai Heng Feng
@ 2019-05-10 15:49                                           ` hch
  0 siblings, 0 replies; 38+ messages in thread
From: hch @ 2019-05-10 15:49 UTC (permalink / raw)
  To: Kai Heng Feng
  Cc: Keith Busch, Mario.Limonciello, hch, axboe, sagi, rafael,
	linux-pm, Wysocki, Rafael J, linux-kernel, linux-nvme, Busch,
	Keith

On Fri, May 10, 2019 at 11:18:52PM +0800, Kai Heng Feng wrote:
> > I'm afraid the requirement is still not clear to me. AFAIK, all our
> > barriers routines ensure data is visible either between CPUs, or between
> > CPU and devices. The CPU never accesses HMB memory, so there must be some
> > other reasoning if this barrier is a real requirement for this device.
> 
> Sure, I’ll ask vendor what that MemRd is for.

I'd like to understand this bug, but this thread leaves me a little
confused.  So we have a NVMe driver with HMB.

Something crashes - the kernel or the firmware?
When does it crash?  suspend or resume?

That crash seems to be related to a related to a PCIe TLP that reads
memory from the host, probably due to the HMB.

But a device with a HMB has been told that it can access that memory
at any time.  So if in any given suspend state TLP to access RAM are
not allowed we'll have to tell the device to stop using the HMB.

So: what power states do not allow the device to DMA to / from host
memory?  How do we find out we are about to enter those from the
pm methods?  We'll then need to disable the HMB, which might suck
in terms of latency.

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2019-05-10 15:49 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-08 18:59 [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle Kai-Heng Feng
2019-05-08 19:15 ` Chaitanya Kulkarni
2019-05-08 19:16 ` Keith Busch
2019-05-08 19:30   ` Kai-Heng Feng
2019-05-08 19:38     ` Mario.Limonciello
2019-05-08 19:51       ` Christoph Hellwig
2019-05-08 20:28         ` Mario.Limonciello
2019-05-09  6:12           ` Christoph Hellwig
2019-05-09  6:48             ` Kai-Heng Feng
2019-05-09  6:52               ` Christoph Hellwig
2019-05-09  9:19                 ` Rafael J. Wysocki
2019-05-09  9:25                   ` Christoph Hellwig
2019-05-09 20:48                     ` Rafael J. Wysocki
2019-05-09  9:07               ` Rafael J. Wysocki
2019-05-09  9:42                 ` Kai-Heng Feng
2019-05-09  9:56                   ` Christoph Hellwig
2019-05-09 10:28                     ` Kai-Heng Feng
2019-05-09 10:31                       ` Christoph Hellwig
2019-05-09 11:59                         ` Kai-Heng Feng
2019-05-09 18:57                           ` Mario.Limonciello
2019-05-09 19:28                             ` Keith Busch
2019-05-09 20:54                               ` Rafael J. Wysocki
2019-05-09 21:16                                 ` Keith Busch
2019-05-09 21:39                                   ` Rafael J. Wysocki
2019-05-09 21:37                               ` Mario.Limonciello
2019-05-09 21:54                                 ` Keith Busch
2019-05-09 22:19                                   ` Mario.Limonciello
2019-05-10  6:05                                     ` Kai-Heng Feng
2019-05-10  8:23                                       ` Rafael J. Wysocki
2019-05-10 13:52                                         ` Keith Busch
2019-05-10 15:15                                         ` Kai Heng Feng
2019-05-10 15:36                                           ` Keith Busch
2019-05-10 14:02                                       ` Keith Busch
2019-05-10 15:18                                         ` Kai Heng Feng
2019-05-10 15:49                                           ` hch
2019-05-10  5:30                               ` Christoph Hellwig
2019-05-10 13:51                                 ` Keith Busch
2019-05-09 16:20                       ` Keith Busch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).