linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle
@ 2019-05-08 18:59 Kai-Heng Feng
  2019-05-08 19:15 ` Chaitanya Kulkarni
  2019-05-08 19:16 ` Keith Busch
  0 siblings, 2 replies; 38+ messages in thread
From: Kai-Heng Feng @ 2019-05-08 18:59 UTC (permalink / raw)
  To: keith.busch, axboe, hch, sagi
  Cc: linux-nvme, linux-kernel, Kai-Heng Feng, Mario Limonciello

Several NVMes consume lots of power during Suspend-to-Idle.

According to the NVMe vendors, APST should be used and there are two
things that should be prevented:
- Controller shutdown (CC.SHN)
- PCI D3

I think that's because the Windows Modern Standby design document [1]
states below as a requirement:
"
Akin to AHCI PCIe SSDs, NVMe SSDs need to provide the host with a
non-operational power state that is comparable to DEVSLP (<5mW draw,
<100ms exit latency) in order to allow the host to perform appropriate
transitions into Modern Standby. Should the NVMe SSD not expose such a
non-operational power state, autonomous power state transitions (APST)
is the only other option to enter Modern Standby successfully.
"

D3 wasn't mentioned at all, though for some vendors D3 still put the
device into a low power state, others disable APST after trasition to
D3.

So instead of disabling NVMe controller in suspend callback we do the
following instead:
- Make sure all IRQs are quiesced
- Stop all queues
- Prevent the device entering D3
- Use memory barrier to make sure DMA operations are flushed on HMB devices

This patch has been verified on several different NVMes:
- Intel
- Hynix
- LiteOn
- Micron
- WDC
- Samsung
- Tohiba

With the new suspend routine, the laptop I tested use roughly 0.8W
under Suspend-to-Idle, and resume time is faster than the original
D3 scheme.

The only one exception so far is Toshiba XG5, which works with the old
suspend routine, so introduce a new quirk for XG5.
We are working with Toshiba to root cause the issue on XG5.

[1] https://docs.microsoft.com/en-us/windows-hardware/design/device-experiences/part-selection#ssd-storage

Tested-by: Lucien Kao <Lucien_kao@compal.com>
Tested-by: Mice Lin <mice_lin@wistron.com>
Tested-by: Jason Tan <LiangChuan.Tan@wdc.com>
Tested-by: Cody Liu (codyliu) <codyliu@micron.com>
Tested-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Mario Limonciello <mario.limonciello@dell.com>
Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
---
 drivers/nvme/host/core.c |   8 +++
 drivers/nvme/host/nvme.h |   8 +++
 drivers/nvme/host/pci.c  | 102 +++++++++++++++++++++++++++++++++++++--
 3 files changed, 115 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index a6644a2c3ef7..929db749da98 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3903,6 +3903,14 @@ static inline void _nvme_check_size(void)
 	BUILD_BUG_ON(sizeof(struct nvme_directive_cmd) != 64);
 }
 
+void nvme_enter_deepest(struct nvme_ctrl *ctrl)
+{
+	int ret = nvme_set_features(ctrl, NVME_FEAT_POWER_MGMT, ctrl->npss,
+				    NULL, 0, NULL);
+	if (ret)
+		dev_warn(ctrl->device, "failed to set deepest non-operational state (%d)\n", ret);
+}
+EXPORT_SYMBOL_GPL(nvme_enter_deepest);
 
 static int __init nvme_core_init(void)
 {
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 5ee75b5ff83f..ea40a83314ae 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -92,6 +92,11 @@ enum nvme_quirks {
 	 * Broken Write Zeroes.
 	 */
 	NVME_QUIRK_DISABLE_WRITE_ZEROES		= (1 << 9),
+
+	/*
+	 * Use D3 on S2I regardless of NPSS.
+	 */
+	NVME_QUIRK_USE_D3_ON_S2I		= (1 << 10),
 };
 
 /*
@@ -229,6 +234,7 @@ struct nvme_ctrl {
 	/* Power saving configuration */
 	u64 ps_max_latency_us;
 	bool apst_enabled;
+	bool suspend_to_idle;
 
 	/* PCIe only: */
 	u32 hmpre;
@@ -467,6 +473,8 @@ int nvme_delete_ctrl(struct nvme_ctrl *ctrl);
 int nvme_get_log(struct nvme_ctrl *ctrl, u32 nsid, u8 log_page, u8 lsp,
 		void *log, size_t size, u64 offset);
 
+void nvme_enter_deepest(struct nvme_ctrl *ctrl);
+
 extern const struct attribute_group *nvme_ns_id_attr_groups[];
 extern const struct block_device_operations nvme_ns_head_ops;
 
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 3e4fb891a95a..dea42b41f9a8 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -23,6 +23,7 @@
 #include <linux/io-64-nonatomic-lo-hi.h>
 #include <linux/sed-opal.h>
 #include <linux/pci-p2pdma.h>
+#include <linux/suspend.h>
 
 #include "trace.h"
 #include "nvme.h"
@@ -2828,12 +2829,98 @@ static void nvme_remove(struct pci_dev *pdev)
 }
 
 #ifdef CONFIG_PM_SLEEP
+static void nvme_do_suspend_to_idle(struct pci_dev *pdev)
+{
+	struct nvme_dev *ndev = pci_get_drvdata(pdev);
+
+	pdev->dev_flags |= PCI_DEV_FLAGS_NO_D3;
+	ndev->ctrl.suspend_to_idle = true;
+
+	nvme_start_freeze(&ndev->ctrl);
+	nvme_wait_freeze_timeout(&ndev->ctrl, NVME_IO_TIMEOUT);
+	nvme_stop_queues(&ndev->ctrl);
+
+	nvme_enter_deepest(&ndev->ctrl);
+
+	if (ndev->ctrl.queue_count > 0) {
+		nvme_disable_io_queues(ndev);
+		nvme_poll_irqdisable(&ndev->queues[0], -1);
+	}
+
+	nvme_suspend_io_queues(ndev);
+	nvme_suspend_queue(&ndev->queues[0]);
+	pci_free_irq_vectors(pdev);
+
+	blk_mq_tagset_busy_iter(&ndev->tagset, nvme_cancel_request, &ndev->ctrl);
+	blk_mq_tagset_busy_iter(&ndev->admin_tagset, nvme_cancel_request, &ndev->ctrl);
+
+	/* Make sure all HMB DMA operations are done */
+	mb();
+}
+
+static int nvme_do_resume_from_idle(struct pci_dev *pdev)
+{
+	struct nvme_dev *ndev = pci_get_drvdata(pdev);
+	int result;
+
+	pdev->dev_flags &= ~PCI_DEV_FLAGS_NO_D3;
+	ndev->ctrl.suspend_to_idle = false;
+
+	result = pci_alloc_irq_vectors(pdev, 1, 1, PCI_IRQ_ALL_TYPES);
+	if (result < 0)
+		goto out;
+
+	result = nvme_pci_configure_admin_queue(ndev);
+	if (result)
+		goto out;
+
+	result = nvme_alloc_admin_tags(ndev);
+	if (result)
+		goto out;
+
+	ndev->ctrl.max_hw_sectors = NVME_MAX_KB_SZ << 1;
+	ndev->ctrl.max_segments = NVME_MAX_SEGS;
+	mutex_unlock(&ndev->shutdown_lock);
+
+	result = nvme_init_identify(&ndev->ctrl);
+	if (result)
+		goto out;
+
+	result = nvme_setup_io_queues(ndev);
+	if (result)
+		goto out;
+
+	if (ndev->online_queues < 2) {
+		dev_warn(ndev->ctrl.device, "IO queues not created\n");
+		nvme_kill_queues(&ndev->ctrl);
+		nvme_remove_namespaces(&ndev->ctrl);
+	} else {
+		nvme_start_queues(&ndev->ctrl);
+		nvme_wait_freeze(&ndev->ctrl);
+		nvme_dev_add(ndev);
+		nvme_unfreeze(&ndev->ctrl);
+	}
+
+	nvme_start_ctrl(&ndev->ctrl);
+
+	return 0;
+
+ out:
+	nvme_remove_dead_ctrl(ndev, result);
+	return result;
+}
+
 static int nvme_suspend(struct device *dev)
 {
 	struct pci_dev *pdev = to_pci_dev(dev);
 	struct nvme_dev *ndev = pci_get_drvdata(pdev);
 
-	nvme_dev_disable(ndev, true);
+	if (IS_ENABLED(CONFIG_ACPI) && pm_suspend_via_s2idle() &&
+	    ndev->ctrl.npss && !(ndev->ctrl.quirks & NVME_QUIRK_USE_D3_ON_S2I))
+		nvme_do_suspend_to_idle(pdev);
+	else
+		nvme_dev_disable(ndev, true);
+
 	return 0;
 }
 
@@ -2841,9 +2928,16 @@ static int nvme_resume(struct device *dev)
 {
 	struct pci_dev *pdev = to_pci_dev(dev);
 	struct nvme_dev *ndev = pci_get_drvdata(pdev);
+	int result = 0;
 
-	nvme_reset_ctrl(&ndev->ctrl);
-	return 0;
+	if (ndev->ctrl.suspend_to_idle) {
+		result = nvme_do_resume_from_idle(pdev);
+		if (result)
+			dev_warn(dev, "failed to resume from idle state (%d)\n", result);
+	} else
+		nvme_reset_ctrl(&ndev->ctrl);
+
+	return result;
 }
 #endif
 
@@ -2921,6 +3015,8 @@ static const struct pci_device_id nvme_id_table[] = {
 	{ PCI_VDEVICE(INTEL, 0x5845),	/* Qemu emulated controller */
 		.driver_data = NVME_QUIRK_IDENTIFY_CNS |
 				NVME_QUIRK_DISABLE_WRITE_ZEROES, },
+	{ PCI_DEVICE(0x1179, 0x0116),	/* Toshiba XG5 */
+		.driver_data = NVME_QUIRK_USE_D3_ON_S2I, },
 	{ PCI_DEVICE(0x1bb1, 0x0100),   /* Seagate Nytro Flash Storage */
 		.driver_data = NVME_QUIRK_DELAY_BEFORE_CHK_RDY, },
 	{ PCI_DEVICE(0x1c58, 0x0003),	/* HGST adapter */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2019-05-10 15:49 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-08 18:59 [PATCH] nvme-pci: Use non-operational power state instead of D3 on Suspend-to-Idle Kai-Heng Feng
2019-05-08 19:15 ` Chaitanya Kulkarni
2019-05-08 19:16 ` Keith Busch
2019-05-08 19:30   ` Kai-Heng Feng
2019-05-08 19:38     ` Mario.Limonciello
2019-05-08 19:51       ` Christoph Hellwig
2019-05-08 20:28         ` Mario.Limonciello
2019-05-09  6:12           ` Christoph Hellwig
2019-05-09  6:48             ` Kai-Heng Feng
2019-05-09  6:52               ` Christoph Hellwig
2019-05-09  9:19                 ` Rafael J. Wysocki
2019-05-09  9:25                   ` Christoph Hellwig
2019-05-09 20:48                     ` Rafael J. Wysocki
2019-05-09  9:07               ` Rafael J. Wysocki
2019-05-09  9:42                 ` Kai-Heng Feng
2019-05-09  9:56                   ` Christoph Hellwig
2019-05-09 10:28                     ` Kai-Heng Feng
2019-05-09 10:31                       ` Christoph Hellwig
2019-05-09 11:59                         ` Kai-Heng Feng
2019-05-09 18:57                           ` Mario.Limonciello
2019-05-09 19:28                             ` Keith Busch
2019-05-09 20:54                               ` Rafael J. Wysocki
2019-05-09 21:16                                 ` Keith Busch
2019-05-09 21:39                                   ` Rafael J. Wysocki
2019-05-09 21:37                               ` Mario.Limonciello
2019-05-09 21:54                                 ` Keith Busch
2019-05-09 22:19                                   ` Mario.Limonciello
2019-05-10  6:05                                     ` Kai-Heng Feng
2019-05-10  8:23                                       ` Rafael J. Wysocki
2019-05-10 13:52                                         ` Keith Busch
2019-05-10 15:15                                         ` Kai Heng Feng
2019-05-10 15:36                                           ` Keith Busch
2019-05-10 14:02                                       ` Keith Busch
2019-05-10 15:18                                         ` Kai Heng Feng
2019-05-10 15:49                                           ` hch
2019-05-10  5:30                               ` Christoph Hellwig
2019-05-10 13:51                                 ` Keith Busch
2019-05-09 16:20                       ` Keith Busch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).