All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/6] nvme-pci: Fix controller freeze wait disabling
@ 2019-05-15 16:36 Keith Busch
  2019-05-15 16:36 ` [PATCH 2/6] nvme-pci: Don't disable on timeout in reset state Keith Busch
                   ` (6 more replies)
  0 siblings, 7 replies; 51+ messages in thread
From: Keith Busch @ 2019-05-15 16:36 UTC (permalink / raw)


If a controller disabling didn't start a freeze, like when we disable
whilst in the connecting state, don't wait for freeze to complete.

Signed-off-by: Keith Busch <keith.busch at intel.com>
---
 drivers/nvme/host/pci.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 2a8708c9ac18..d4e442160048 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2376,7 +2376,7 @@ static void nvme_pci_disable(struct nvme_dev *dev)
 
 static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 {
-	bool dead = true;
+	bool dead = true, freeze = false;
 	struct pci_dev *pdev = to_pci_dev(dev->dev);
 
 	mutex_lock(&dev->shutdown_lock);
@@ -2384,8 +2384,10 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 		u32 csts = readl(dev->bar + NVME_REG_CSTS);
 
 		if (dev->ctrl.state == NVME_CTRL_LIVE ||
-		    dev->ctrl.state == NVME_CTRL_RESETTING)
+		    dev->ctrl.state == NVME_CTRL_RESETTING) {
+			freeze = true;
 			nvme_start_freeze(&dev->ctrl);
+		}
 		dead = !!((csts & NVME_CSTS_CFS) || !(csts & NVME_CSTS_RDY) ||
 			pdev->error_state  != pci_channel_io_normal);
 	}
@@ -2394,10 +2396,8 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 	 * Give the controller a chance to complete all entered requests if
 	 * doing a safe shutdown.
 	 */
-	if (!dead) {
-		if (shutdown)
-			nvme_wait_freeze_timeout(&dev->ctrl, NVME_IO_TIMEOUT);
-	}
+	if (!dead && shutdown && freeze)
+		nvme_wait_freeze_timeout(&dev->ctrl, NVME_IO_TIMEOUT);
 
 	nvme_stop_queues(&dev->ctrl);
 
-- 
2.14.4

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 2/6] nvme-pci: Don't disable on timeout in reset state
  2019-05-15 16:36 [PATCH 1/6] nvme-pci: Fix controller freeze wait disabling Keith Busch
@ 2019-05-15 16:36 ` Keith Busch
  2019-05-16  3:07   ` Ming Lei
  2019-05-16  6:27   ` Christoph Hellwig
  2019-05-15 16:36 ` [PATCH 3/6] nvme-pci: Unblock reset_work on IO failure Keith Busch
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 51+ messages in thread
From: Keith Busch @ 2019-05-15 16:36 UTC (permalink / raw)


The driver doesn't dispatch commands that it needs to wait for in the reset
state anymore. If a timeout occurs in this state, the reset work is
already disabling the controller, so just reset the request's timer.

Signed-off-by: Keith Busch <keith.busch at intel.com>
---
 drivers/nvme/host/pci.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index d4e442160048..c72755311ffa 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1298,13 +1298,14 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 		shutdown = true;
 		/* fall through */
 	case NVME_CTRL_CONNECTING:
-	case NVME_CTRL_RESETTING:
 		dev_warn_ratelimited(dev->ctrl.device,
 			 "I/O %d QID %d timeout, disable controller\n",
 			 req->tag, nvmeq->qid);
 		nvme_dev_disable(dev, shutdown);
 		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
 		return BLK_EH_DONE;
+	case NVME_CTRL_RESETTING:
+		return BLK_EH_RESET_TIMER;
 	default:
 		break;
 	}
-- 
2.14.4

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 3/6] nvme-pci: Unblock reset_work on IO failure
  2019-05-15 16:36 [PATCH 1/6] nvme-pci: Fix controller freeze wait disabling Keith Busch
  2019-05-15 16:36 ` [PATCH 2/6] nvme-pci: Don't disable on timeout in reset state Keith Busch
@ 2019-05-15 16:36 ` Keith Busch
  2019-05-16  3:13   ` Ming Lei
  2019-05-16  6:28   ` Christoph Hellwig
  2019-05-15 16:36 ` [PATCH 4/6] nvme-pci: Sync queues on reset Keith Busch
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 51+ messages in thread
From: Keith Busch @ 2019-05-15 16:36 UTC (permalink / raw)


The reset_work waits in the connecting state for previously queued IO to
complete before setting the controller to live. If these requests time
out, we won't be able to restart the controller because the reset_work
is already running, and any requeued IOs will block reset_work forever.

When connecting, shutdown the controller to flush all entered requests
to a failed completion if a timeout occurs, and ensure the controller
can't transition to the live state after we've unblocked it. The driver
will instead unbind from the controller if we can't complete IO during
initialization.

Signed-off-by: Keith Busch <keith.busch at intel.com>
---
 drivers/nvme/host/pci.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index c72755311ffa..8df176ffcbc1 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1257,7 +1257,6 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 	struct nvme_dev *dev = nvmeq->dev;
 	struct request *abort_req;
 	struct nvme_command cmd;
-	bool shutdown = false;
 	u32 csts = readl(dev->bar + NVME_REG_CSTS);
 
 	/* If PCI error recovery process is happening, we cannot reset or
@@ -1294,14 +1293,14 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 	 * shutdown, so we return BLK_EH_DONE.
 	 */
 	switch (dev->ctrl.state) {
-	case NVME_CTRL_DELETING:
-		shutdown = true;
-		/* fall through */
 	case NVME_CTRL_CONNECTING:
+		nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DELETING);
+		/* fall through */
+	case NVME_CTRL_DELETING:
 		dev_warn_ratelimited(dev->ctrl.device,
 			 "I/O %d QID %d timeout, disable controller\n",
 			 req->tag, nvmeq->qid);
-		nvme_dev_disable(dev, shutdown);
+		nvme_dev_disable(dev, true);
 		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
 		return BLK_EH_DONE;
 	case NVME_CTRL_RESETTING:
-- 
2.14.4

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 4/6] nvme-pci: Sync queues on reset
  2019-05-15 16:36 [PATCH 1/6] nvme-pci: Fix controller freeze wait disabling Keith Busch
  2019-05-15 16:36 ` [PATCH 2/6] nvme-pci: Don't disable on timeout in reset state Keith Busch
  2019-05-15 16:36 ` [PATCH 3/6] nvme-pci: Unblock reset_work on IO failure Keith Busch
@ 2019-05-15 16:36 ` Keith Busch
  2019-05-16  3:34   ` Ming Lei
                     ` (2 more replies)
  2019-05-15 16:36 ` [PATCH 5/6] nvme: Export get and set features Keith Busch
                   ` (3 subsequent siblings)
  6 siblings, 3 replies; 51+ messages in thread
From: Keith Busch @ 2019-05-15 16:36 UTC (permalink / raw)


A controller with multiple namespaces may have multiple request_queues
with their own timeout work. If a live controller fails with IO
outstanding to diffent namespaces, each request queue may attempt to
disable and reset the controller in different threads. Synchronize each
queue prior to starting the controller to ensure there no previously
scheduled timeout work is running.

Signed-off-by: Keith Busch <keith.busch at intel.com>
---
 drivers/nvme/host/core.c | 12 ++++++++++++
 drivers/nvme/host/nvme.h |  1 +
 drivers/nvme/host/pci.c  |  1 +
 3 files changed, 14 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index d7de0642c832..a116ea037f83 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3880,6 +3880,18 @@ void nvme_start_queues(struct nvme_ctrl *ctrl)
 }
 EXPORT_SYMBOL_GPL(nvme_start_queues);
 
+
+void nvme_sync_queues(struct nvme_ctrl *ctrl)
+{
+	struct nvme_ns *ns;
+
+	down_read(&ctrl->namespaces_rwsem);
+	list_for_each_entry(ns, &ctrl->namespaces, list)
+		blk_sync_queue(ns->queue);
+	up_read(&ctrl->namespaces_rwsem);
+}
+EXPORT_SYMBOL_GPL(nvme_sync_queues);
+
 /*
  * Check we didn't inadvertently grow the command structure sizes:
  */
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 5ee75b5ff83f..55553d293a98 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -441,6 +441,7 @@ void nvme_complete_async_event(struct nvme_ctrl *ctrl, __le16 status,
 void nvme_stop_queues(struct nvme_ctrl *ctrl);
 void nvme_start_queues(struct nvme_ctrl *ctrl);
 void nvme_kill_queues(struct nvme_ctrl *ctrl);
+void nvme_sync_queues(struct nvme_ctrl *ctrl);
 void nvme_unfreeze(struct nvme_ctrl *ctrl);
 void nvme_wait_freeze(struct nvme_ctrl *ctrl);
 void nvme_wait_freeze_timeout(struct nvme_ctrl *ctrl, long timeout);
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 8df176ffcbc1..599065ed6a32 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2492,6 +2492,7 @@ static void nvme_reset_work(struct work_struct *work)
 	 */
 	if (dev->ctrl.ctrl_config & NVME_CC_ENABLE)
 		nvme_dev_disable(dev, false);
+	nvme_sync_queues(&dev->ctrl);
 
 	mutex_lock(&dev->shutdown_lock);
 	result = nvme_pci_enable(dev);
-- 
2.14.4

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 5/6] nvme: Export get and set features
  2019-05-15 16:36 [PATCH 1/6] nvme-pci: Fix controller freeze wait disabling Keith Busch
                   ` (2 preceding siblings ...)
  2019-05-15 16:36 ` [PATCH 4/6] nvme-pci: Sync queues on reset Keith Busch
@ 2019-05-15 16:36 ` Keith Busch
  2019-05-16  6:26   ` Christoph Hellwig
  2019-05-16 13:47   ` Minwoo Im
  2019-05-15 16:36 ` [PATCHv2 6/6] nvme-pci: Use host managed power state for suspend Keith Busch
                   ` (2 subsequent siblings)
  6 siblings, 2 replies; 51+ messages in thread
From: Keith Busch @ 2019-05-15 16:36 UTC (permalink / raw)


Future use intends to make use of features, so export these functions. And
since their implementation is identical except for the opcode, provide
a new convenience function that implement each.

Signed-off-by: Keith Busch <keith.busch at intel.com>
---
 drivers/nvme/host/core.c | 22 +++++++++++++++++++---
 drivers/nvme/host/nvme.h |  4 ++++
 2 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index a116ea037f83..f2411d50e764 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1113,15 +1113,15 @@ static struct nvme_id_ns *nvme_identify_ns(struct nvme_ctrl *ctrl,
 	return id;
 }
 
-static int nvme_set_features(struct nvme_ctrl *dev, unsigned fid, unsigned dword11,
-		      void *buffer, size_t buflen, u32 *result)
+static int nvme_features(struct nvme_ctrl *dev, u8 op, unsigned fid,
+		unsigned dword11, void *buffer, size_t buflen, u32 *result)
 {
 	struct nvme_command c;
 	union nvme_result res;
 	int ret;
 
 	memset(&c, 0, sizeof(c));
-	c.features.opcode = nvme_admin_set_features;
+	c.features.opcode = op;
 	c.features.fid = cpu_to_le32(fid);
 	c.features.dword11 = cpu_to_le32(dword11);
 
@@ -1132,6 +1132,22 @@ static int nvme_set_features(struct nvme_ctrl *dev, unsigned fid, unsigned dword
 	return ret;
 }
 
+int nvme_set_features(struct nvme_ctrl *dev, unsigned fid, unsigned dword11,
+		      void *buffer, size_t buflen, u32 *result)
+{
+	return nvme_features(dev, nvme_admin_set_features, fid, dword11, buffer,
+			     buflen, result);
+}
+EXPORT_SYMBOL_GPL(nvme_set_features);
+
+int nvme_get_features(struct nvme_ctrl *dev, unsigned fid, unsigned dword11,
+		      void *buffer, size_t buflen, u32 *result)
+{
+	return nvme_features(dev, nvme_admin_get_features, fid, dword11, buffer,
+			     buflen, result);
+}
+EXPORT_SYMBOL_GPL(nvme_get_features);
+
 int nvme_set_queue_count(struct nvme_ctrl *ctrl, int *count)
 {
 	u32 q_count = (*count - 1) | ((*count - 1) << 16);
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 55553d293a98..90aa38c588ed 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -459,6 +459,10 @@ int __nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
 		union nvme_result *result, void *buffer, unsigned bufflen,
 		unsigned timeout, int qid, int at_head,
 		blk_mq_req_flags_t flags, bool poll);
+int nvme_set_features(struct nvme_ctrl *dev, unsigned fid, unsigned dword11,
+		      void *buffer, size_t buflen, u32 *result);
+int nvme_get_features(struct nvme_ctrl *dev, unsigned fid, unsigned dword11,
+		      void *buffer, size_t buflen, u32 *result);
 int nvme_set_queue_count(struct nvme_ctrl *ctrl, int *count);
 void nvme_stop_keep_alive(struct nvme_ctrl *ctrl);
 int nvme_reset_ctrl(struct nvme_ctrl *ctrl);
-- 
2.14.4

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCHv2 6/6] nvme-pci: Use host managed power state for suspend
  2019-05-15 16:36 [PATCH 1/6] nvme-pci: Fix controller freeze wait disabling Keith Busch
                   ` (3 preceding siblings ...)
  2019-05-15 16:36 ` [PATCH 5/6] nvme: Export get and set features Keith Busch
@ 2019-05-15 16:36 ` Keith Busch
  2019-05-15 19:33   ` Mario.Limonciello
                     ` (2 more replies)
  2019-05-16  2:43 ` [PATCH 1/6] nvme-pci: Fix controller freeze wait disabling Ming Lei
  2019-05-16  6:27 ` Christoph Hellwig
  6 siblings, 3 replies; 51+ messages in thread
From: Keith Busch @ 2019-05-15 16:36 UTC (permalink / raw)


The nvme pci driver prepares its devices for power loss during suspend
by shutting down the controllers. The power setting is deferred to
pci driver's power management before the platform removes power. The
suspend-to-idle mode, however, does not remove power.

NVMe devices that implement host managed power settings can achieve
lower power and better transition latencies than using generic PCI power
settings. Try to use this feature if the platform is not involved with
the suspend. If successful, restore the previous power state on resume.

Cc: Mario Limonciello <Mario.Limonciello at dell.com>
Cc: Kai Heng Feng <kai.heng.feng at canonical.com>
Signed-off-by: Keith Busch <keith.busch at intel.com>
---
v1 -> v2:

  Split prep patches for the get features

  Ensure queued and dispatch IO completes before attempting to set the low
  power state. This also required a sync to ensure that nothing timed
  out or reset the controller while we attempted the intermittent queue freeze.

  Disable HMB if enabled. It is not clear this should be necessary except
  through empirical reports.

 drivers/nvme/host/pci.c | 80 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 80 insertions(+)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 599065ed6a32..42d5c6369803 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -18,6 +18,7 @@
 #include <linux/mutex.h>
 #include <linux/once.h>
 #include <linux/pci.h>
+#include <linux/suspend.h>
 #include <linux/t10-pi.h>
 #include <linux/types.h>
 #include <linux/io-64-nonatomic-lo-hi.h>
@@ -116,6 +117,7 @@ struct nvme_dev {
 	u32 cmbsz;
 	u32 cmbloc;
 	struct nvme_ctrl ctrl;
+	u32 last_ps;
 
 	mempool_t *iod_mempool;
 
@@ -2829,11 +2831,87 @@ static void nvme_remove(struct pci_dev *pdev)
 }
 
 #ifdef CONFIG_PM_SLEEP
+static int nvme_deep_state(struct nvme_dev *dev)
+{
+	struct pci_dev *pdev = to_pci_dev(dev->dev);
+	struct nvme_ctrl *ctrl = &dev->ctrl;
+	int ret = -EBUSY;;
+
+	nvme_start_freeze(ctrl);
+	nvme_wait_freeze(ctrl);
+	nvme_sync_queues(ctrl);
+
+	if (ctrl->state != NVME_CTRL_LIVE &&
+	    ctrl->state != NVME_CTRL_ADMIN_ONLY)
+		goto unfreeze;
+
+	if (dev->host_mem_descs) {
+		ret = nvme_set_host_mem(dev, 0);
+		if (ret < 0)
+			goto unfreeze;
+	}
+
+	dev->last_ps = 0;
+	ret = nvme_get_features(ctrl, NVME_FEAT_POWER_MGMT, 0, NULL, 0,
+				&dev->last_ps);
+	if (ret < 0)
+		goto unfreeze;
+
+	ret = nvme_set_features(ctrl, NVME_FEAT_POWER_MGMT, dev->ctrl.npss,
+				NULL, 0, NULL);
+	if (ret < 0)
+		goto unfreeze;
+
+	if (ret) {
+		/*
+		 * Clearing npss forces a controller reset on resume. The
+		 * correct value will be resdicovered then.
+		 */
+		ctrl->npss = 0;
+		nvme_dev_disable(dev, true);
+		ret = 0;
+		goto unfreeze;
+	}
+
+	/*
+	 * A saved state prevents pci pm from generically controlling the
+	 * device's power. We're using protocol specific settings so we don't
+	 * want pci interfering.
+	 */
+	pci_save_state(pdev);
+unfreeze:
+	nvme_unfreeze(ctrl);
+	return ret;
+}
+
+static int nvme_make_operational(struct nvme_dev *dev)
+{
+	struct nvme_ctrl *ctrl = &dev->ctrl;
+	int ret;
+
+	ret = nvme_set_features(ctrl, NVME_FEAT_POWER_MGMT, dev->last_ps,
+				NULL, 0, NULL);
+	if (ret)
+		goto reset;
+
+	if (dev->host_mem_descs) {
+		ret = nvme_setup_host_mem(dev);
+		if (ret)
+			goto reset;
+	}
+	return 0;
+reset:
+	nvme_reset_ctrl(ctrl);
+	return 0;
+}
+
 static int nvme_suspend(struct device *dev)
 {
 	struct pci_dev *pdev = to_pci_dev(dev);
 	struct nvme_dev *ndev = pci_get_drvdata(pdev);
 
+	if (!pm_suspend_via_firmware() && ndev->ctrl.npss)
+		return nvme_deep_state(ndev);
 	nvme_dev_disable(ndev, true);
 	return 0;
 }
@@ -2843,6 +2921,8 @@ static int nvme_resume(struct device *dev)
 	struct pci_dev *pdev = to_pci_dev(dev);
 	struct nvme_dev *ndev = pci_get_drvdata(pdev);
 
+	if (!pm_resume_via_firmware() && ndev->ctrl.npss)
+		return nvme_make_operational(ndev);
 	nvme_reset_ctrl(&ndev->ctrl);
 	return 0;
 }
-- 
2.14.4

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCHv2 6/6] nvme-pci: Use host managed power state for suspend
  2019-05-15 16:36 ` [PATCHv2 6/6] nvme-pci: Use host managed power state for suspend Keith Busch
@ 2019-05-15 19:33   ` Mario.Limonciello
  2019-05-15 19:34     ` Keith Busch
  2019-05-16  6:25   ` Christoph Hellwig
  2019-05-16  9:29   ` Rafael J. Wysocki
  2 siblings, 1 reply; 51+ messages in thread
From: Mario.Limonciello @ 2019-05-15 19:33 UTC (permalink / raw)


> -----Original Message-----
> From: Keith Busch <keith.busch at intel.com>
> Sent: Wednesday, May 15, 2019 11:36 AM
> To: Christoph Hellwig; Sagi Grimberg; linux-nvme at lists.infradead.org
> Cc: Rafael Wysocki; Keith Busch; Limonciello, Mario; Kai Heng Feng
> Subject: [PATCHv2 6/6] nvme-pci: Use host managed power state for suspend
> 
> 
> [EXTERNAL EMAIL]
> 
> The nvme pci driver prepares its devices for power loss during suspend by
> shutting down the controllers. The power setting is deferred to pci driver's
> power management before the platform removes power. The suspend-to-idle
> mode, however, does not remove power.
> 
> NVMe devices that implement host managed power settings can achieve lower
> power and better transition latencies than using generic PCI power settings. Try
> to use this feature if the platform is not involved with the suspend. If successful,
> restore the previous power state on resume.
> 
> Cc: Mario Limonciello <Mario.Limonciello at dell.com>
> Cc: Kai Heng Feng <kai.heng.feng at canonical.com>
> Signed-off-by: Keith Busch <keith.busch at intel.com>
> ---
> v1 -> v2:
> 
>   Split prep patches for the get features
> 
>   Ensure queued and dispatch IO completes before attempting to set the low
>   power state. This also required a sync to ensure that nothing timed
>   out or reset the controller while we attempted the intermittent queue freeze.
> 
>   Disable HMB if enabled. It is not clear this should be necessary except
>   through empirical reports.

Reviewing this, it looks like this series should have all the same important tenants
from the original functional patch, and looks promising to me.

I'll arrange some more testing on it.

> 
>  drivers/nvme/host/pci.c | 80
> +++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 80 insertions(+)
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index
> 599065ed6a32..42d5c6369803 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -18,6 +18,7 @@
>  #include <linux/mutex.h>
>  #include <linux/once.h>
>  #include <linux/pci.h>
> +#include <linux/suspend.h>
>  #include <linux/t10-pi.h>
>  #include <linux/types.h>
>  #include <linux/io-64-nonatomic-lo-hi.h> @@ -116,6 +117,7 @@ struct
> nvme_dev {
>  	u32 cmbsz;
>  	u32 cmbloc;
>  	struct nvme_ctrl ctrl;
> +	u32 last_ps;
> 
>  	mempool_t *iod_mempool;
> 
> @@ -2829,11 +2831,87 @@ static void nvme_remove(struct pci_dev *pdev)  }
> 
>  #ifdef CONFIG_PM_SLEEP
> +static int nvme_deep_state(struct nvme_dev *dev) {
> +	struct pci_dev *pdev = to_pci_dev(dev->dev);
> +	struct nvme_ctrl *ctrl = &dev->ctrl;
> +	int ret = -EBUSY;;

Looks like a small typographical error with the double ;;

> +
> +	nvme_start_freeze(ctrl);
> +	nvme_wait_freeze(ctrl);
> +	nvme_sync_queues(ctrl);
> +
> +	if (ctrl->state != NVME_CTRL_LIVE &&
> +	    ctrl->state != NVME_CTRL_ADMIN_ONLY)
> +		goto unfreeze;
> +
> +	if (dev->host_mem_descs) {
> +		ret = nvme_set_host_mem(dev, 0);
> +		if (ret < 0)
> +			goto unfreeze;
> +	}
> +
> +	dev->last_ps = 0;
> +	ret = nvme_get_features(ctrl, NVME_FEAT_POWER_MGMT, 0, NULL, 0,
> +				&dev->last_ps);
> +	if (ret < 0)
> +		goto unfreeze;
> +
> +	ret = nvme_set_features(ctrl, NVME_FEAT_POWER_MGMT, dev-
> >ctrl.npss,
> +				NULL, 0, NULL);
> +	if (ret < 0)
> +		goto unfreeze;
> +
> +	if (ret) {
> +		/*
> +		 * Clearing npss forces a controller reset on resume. The
> +		 * correct value will be resdicovered then.
> +		 */
> +		ctrl->npss = 0;
> +		nvme_dev_disable(dev, true);
> +		ret = 0;
> +		goto unfreeze;
> +	}
> +
> +	/*
> +	 * A saved state prevents pci pm from generically controlling the
> +	 * device's power. We're using protocol specific settings so we don't
> +	 * want pci interfering.
> +	 */
> +	pci_save_state(pdev);
> +unfreeze:
> +	nvme_unfreeze(ctrl);
> +	return ret;
> +}
> +
> +static int nvme_make_operational(struct nvme_dev *dev) {
> +	struct nvme_ctrl *ctrl = &dev->ctrl;
> +	int ret;
> +
> +	ret = nvme_set_features(ctrl, NVME_FEAT_POWER_MGMT, dev-
> >last_ps,
> +				NULL, 0, NULL);
> +	if (ret)
> +		goto reset;
> +
> +	if (dev->host_mem_descs) {
> +		ret = nvme_setup_host_mem(dev);
> +		if (ret)
> +			goto reset;
> +	}
> +	return 0;
> +reset:
> +	nvme_reset_ctrl(ctrl);
> +	return 0;
> +}
> +
>  static int nvme_suspend(struct device *dev)  {
>  	struct pci_dev *pdev = to_pci_dev(dev);
>  	struct nvme_dev *ndev = pci_get_drvdata(pdev);
> 
> +	if (!pm_suspend_via_firmware() && ndev->ctrl.npss)
> +		return nvme_deep_state(ndev);
>  	nvme_dev_disable(ndev, true);
>  	return 0;
>  }
> @@ -2843,6 +2921,8 @@ static int nvme_resume(struct device *dev)
>  	struct pci_dev *pdev = to_pci_dev(dev);
>  	struct nvme_dev *ndev = pci_get_drvdata(pdev);
> 
> +	if (!pm_resume_via_firmware() && ndev->ctrl.npss)
> +		return nvme_make_operational(ndev);
>  	nvme_reset_ctrl(&ndev->ctrl);
>  	return 0;
>  }
> --
> 2.14.4

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCHv2 6/6] nvme-pci: Use host managed power state for suspend
  2019-05-15 19:33   ` Mario.Limonciello
@ 2019-05-15 19:34     ` Keith Busch
  2019-05-15 19:43       ` Mario.Limonciello
  0 siblings, 1 reply; 51+ messages in thread
From: Keith Busch @ 2019-05-15 19:34 UTC (permalink / raw)


On Wed, May 15, 2019@07:33:45PM +0000, Mario.Limonciello@dell.com wrote:
> > +static int nvme_deep_state(struct nvme_dev *dev) {
> > +	struct pci_dev *pdev = to_pci_dev(dev->dev);
> > +	struct nvme_ctrl *ctrl = &dev->ctrl;
> > +	int ret = -EBUSY;;
> 
> Looks like a small typographical error with the double ;;

Good eye. I won't respin for that just yet. :)

Just fyi, I accidently didn't explicitly CC you on patches 1-5, and most
of those are necessary for patch 6/6 to compile. They whole series is
on the mailing list though.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCHv2 6/6] nvme-pci: Use host managed power state for suspend
  2019-05-15 19:34     ` Keith Busch
@ 2019-05-15 19:43       ` Mario.Limonciello
  0 siblings, 0 replies; 51+ messages in thread
From: Mario.Limonciello @ 2019-05-15 19:43 UTC (permalink / raw)


> -----Original Message-----
> From: Keith Busch <kbusch at kernel.org>
> Sent: Wednesday, May 15, 2019 2:34 PM
> To: Limonciello, Mario
> Cc: keith.busch at intel.com; hch at lst.de; sagi at grimberg.me; linux-
> nvme at lists.infradead.org; rafael at kernel.org; kai.heng.feng at canonical.com
> Subject: Re: [PATCHv2 6/6] nvme-pci: Use host managed power state for
> suspend
> 
> 
> [EXTERNAL EMAIL]
> 
> On Wed, May 15, 2019 at 07:33:45PM +0000, Mario.Limonciello at dell.com
> wrote:
> > > +static int nvme_deep_state(struct nvme_dev *dev) {
> > > +	struct pci_dev *pdev = to_pci_dev(dev->dev);
> > > +	struct nvme_ctrl *ctrl = &dev->ctrl;
> > > +	int ret = -EBUSY;;
> >
> > Looks like a small typographical error with the double ;;
> 
> Good eye. I won't respin for that just yet. :)
> 
> Just fyi, I accidently didn't explicitly CC you on patches 1-5, and most of those
> are necessary for patch 6/6 to compile. They whole series is on the mailing list
> though.

Yes I noticed this, and at least one of those needs stuff in -next.
Do you feel this is going to be valuable to test on something earlier via backports
(such as 4.19 LTS), or do you want to see results directly on -next?

The series pretty much applies to 4.19 after backporting 2d4bcc16481ebc395b2d6220804a1ef2531e5ede

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 1/6] nvme-pci: Fix controller freeze wait disabling
  2019-05-15 16:36 [PATCH 1/6] nvme-pci: Fix controller freeze wait disabling Keith Busch
                   ` (4 preceding siblings ...)
  2019-05-15 16:36 ` [PATCHv2 6/6] nvme-pci: Use host managed power state for suspend Keith Busch
@ 2019-05-16  2:43 ` Ming Lei
  2019-05-17 18:40   ` Keith Busch
  2019-05-16  6:27 ` Christoph Hellwig
  6 siblings, 1 reply; 51+ messages in thread
From: Ming Lei @ 2019-05-16  2:43 UTC (permalink / raw)


On Wed, May 15, 2019@10:36:20AM -0600, Keith Busch wrote:
> If a controller disabling didn't start a freeze, like when we disable
> whilst in the connecting state, don't wait for freeze to complete.
> 
> Signed-off-by: Keith Busch <keith.busch at intel.com>
> ---
>  drivers/nvme/host/pci.c | 12 ++++++------
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 2a8708c9ac18..d4e442160048 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -2376,7 +2376,7 @@ static void nvme_pci_disable(struct nvme_dev *dev)
>  
>  static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
>  {
> -	bool dead = true;
> +	bool dead = true, freeze = false;
>  	struct pci_dev *pdev = to_pci_dev(dev->dev);
>  
>  	mutex_lock(&dev->shutdown_lock);
> @@ -2384,8 +2384,10 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
>  		u32 csts = readl(dev->bar + NVME_REG_CSTS);
>  
>  		if (dev->ctrl.state == NVME_CTRL_LIVE ||
> -		    dev->ctrl.state == NVME_CTRL_RESETTING)
> +		    dev->ctrl.state == NVME_CTRL_RESETTING) {
> +			freeze = true;
>  			nvme_start_freeze(&dev->ctrl);
> +		}
>  		dead = !!((csts & NVME_CSTS_CFS) || !(csts & NVME_CSTS_RDY) ||
>  			pdev->error_state  != pci_channel_io_normal);
>  	}
> @@ -2394,10 +2396,8 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
>  	 * Give the controller a chance to complete all entered requests if
>  	 * doing a safe shutdown.
>  	 */
> -	if (!dead) {
> -		if (shutdown)
> -			nvme_wait_freeze_timeout(&dev->ctrl, NVME_IO_TIMEOUT);
> -	}
> +	if (!dead && shutdown && freeze)
> +		nvme_wait_freeze_timeout(&dev->ctrl, NVME_IO_TIMEOUT);
>  
>  	nvme_stop_queues(&dev->ctrl);
>  

Looks fine:

Reviewed-by: Ming Lei <ming.lei at rehda.com>


Thanks,
Ming

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 2/6] nvme-pci: Don't disable on timeout in reset state
  2019-05-15 16:36 ` [PATCH 2/6] nvme-pci: Don't disable on timeout in reset state Keith Busch
@ 2019-05-16  3:07   ` Ming Lei
  2019-05-16 14:33     ` Keith Busch
  2019-05-16  6:27   ` Christoph Hellwig
  1 sibling, 1 reply; 51+ messages in thread
From: Ming Lei @ 2019-05-16  3:07 UTC (permalink / raw)


On Wed, May 15, 2019@10:36:21AM -0600, Keith Busch wrote:
> The driver doesn't dispatch commands that it needs to wait for in the reset
> state anymore. If a timeout occurs in this state, the reset work is
> already disabling the controller, so just reset the request's timer.
> 
> Signed-off-by: Keith Busch <keith.busch at intel.com>
> ---
>  drivers/nvme/host/pci.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index d4e442160048..c72755311ffa 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -1298,13 +1298,14 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
>  		shutdown = true;
>  		/* fall through */
>  	case NVME_CTRL_CONNECTING:
> -	case NVME_CTRL_RESETTING:
>  		dev_warn_ratelimited(dev->ctrl.device,
>  			 "I/O %d QID %d timeout, disable controller\n",
>  			 req->tag, nvmeq->qid);
>  		nvme_dev_disable(dev, shutdown);
>  		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
>  		return BLK_EH_DONE;
> +	case NVME_CTRL_RESETTING:
> +		return BLK_EH_RESET_TIMER;
>  	default:
>  		break;
>  	}

RESET follows controller shutdown(via nvme_dev_disable()), the only
possible timeout should be on admin requests staggered between shutdown
and changing to NVME_CTRL_CONNECTING, given admin queue isn't frozen.

And the admin queue should be fully workable after it is unquiesced
by nvme_alloc_admin_tags(), so if timeout happens after nvme_alloc_admin_tags(),
I guess these requests should be handled as in NVME_CTRL_CONNECTING.

Another related problem is about handling timeout in NVME_CTRL_CONNECTING, and
the following failure still can be observed:

[ 1078.775969] nvme nvme0: I/O 20 QID 0 timeout, disable controller
[ 1078.791730] nvme nvme0: Identify Controller failed (-4)
[ 1078.792538] nvme nvme0: Removing after probe failure status: -5


Thanks,
Ming

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 3/6] nvme-pci: Unblock reset_work on IO failure
  2019-05-15 16:36 ` [PATCH 3/6] nvme-pci: Unblock reset_work on IO failure Keith Busch
@ 2019-05-16  3:13   ` Ming Lei
  2019-05-16 14:14     ` Keith Busch
  2019-05-16  6:28   ` Christoph Hellwig
  1 sibling, 1 reply; 51+ messages in thread
From: Ming Lei @ 2019-05-16  3:13 UTC (permalink / raw)


On Wed, May 15, 2019@10:36:22AM -0600, Keith Busch wrote:
> The reset_work waits in the connecting state for previously queued IO to
> complete before setting the controller to live. If these requests time
> out, we won't be able to restart the controller because the reset_work
> is already running, and any requeued IOs will block reset_work forever.
> 
> When connecting, shutdown the controller to flush all entered requests
> to a failed completion if a timeout occurs, and ensure the controller
> can't transition to the live state after we've unblocked it. The driver
> will instead unbind from the controller if we can't complete IO during
> initialization.
> 
> Signed-off-by: Keith Busch <keith.busch at intel.com>
> ---
>  drivers/nvme/host/pci.c | 9 ++++-----
>  1 file changed, 4 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index c72755311ffa..8df176ffcbc1 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -1257,7 +1257,6 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
>  	struct nvme_dev *dev = nvmeq->dev;
>  	struct request *abort_req;
>  	struct nvme_command cmd;
> -	bool shutdown = false;
>  	u32 csts = readl(dev->bar + NVME_REG_CSTS);
>  
>  	/* If PCI error recovery process is happening, we cannot reset or
> @@ -1294,14 +1293,14 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
>  	 * shutdown, so we return BLK_EH_DONE.
>  	 */
>  	switch (dev->ctrl.state) {
> -	case NVME_CTRL_DELETING:
> -		shutdown = true;
> -		/* fall through */
>  	case NVME_CTRL_CONNECTING:
> +		nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DELETING);
> +		/* fall through */
> +	case NVME_CTRL_DELETING:
>  		dev_warn_ratelimited(dev->ctrl.device,
>  			 "I/O %d QID %d timeout, disable controller\n",
>  			 req->tag, nvmeq->qid);
> -		nvme_dev_disable(dev, shutdown);
> +		nvme_dev_disable(dev, true);
>  		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
>  		return BLK_EH_DONE;
>  	case NVME_CTRL_RESETTING:

Then the controller is dead, and can't work any more together with data
loss. I guess this way is too violent from user view.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 4/6] nvme-pci: Sync queues on reset
  2019-05-15 16:36 ` [PATCH 4/6] nvme-pci: Sync queues on reset Keith Busch
@ 2019-05-16  3:34   ` Ming Lei
  2019-05-16  6:29   ` Christoph Hellwig
  2019-05-16 13:43   ` Minwoo Im
  2 siblings, 0 replies; 51+ messages in thread
From: Ming Lei @ 2019-05-16  3:34 UTC (permalink / raw)


On Wed, May 15, 2019@10:36:23AM -0600, Keith Busch wrote:
> A controller with multiple namespaces may have multiple request_queues
> with their own timeout work. If a live controller fails with IO
> outstanding to diffent namespaces, each request queue may attempt to
> disable and reset the controller in different threads. Synchronize each
> queue prior to starting the controller to ensure there no previously
> scheduled timeout work is running.
> 
> Signed-off-by: Keith Busch <keith.busch at intel.com>
> ---
>  drivers/nvme/host/core.c | 12 ++++++++++++
>  drivers/nvme/host/nvme.h |  1 +
>  drivers/nvme/host/pci.c  |  1 +
>  3 files changed, 14 insertions(+)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index d7de0642c832..a116ea037f83 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -3880,6 +3880,18 @@ void nvme_start_queues(struct nvme_ctrl *ctrl)
>  }
>  EXPORT_SYMBOL_GPL(nvme_start_queues);
>  
> +
> +void nvme_sync_queues(struct nvme_ctrl *ctrl)
> +{
> +	struct nvme_ns *ns;
> +
> +	down_read(&ctrl->namespaces_rwsem);
> +	list_for_each_entry(ns, &ctrl->namespaces, list)
> +		blk_sync_queue(ns->queue);
> +	up_read(&ctrl->namespaces_rwsem);
> +}
> +EXPORT_SYMBOL_GPL(nvme_sync_queues);
> +
>  /*
>   * Check we didn't inadvertently grow the command structure sizes:
>   */
> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> index 5ee75b5ff83f..55553d293a98 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -441,6 +441,7 @@ void nvme_complete_async_event(struct nvme_ctrl *ctrl, __le16 status,
>  void nvme_stop_queues(struct nvme_ctrl *ctrl);
>  void nvme_start_queues(struct nvme_ctrl *ctrl);
>  void nvme_kill_queues(struct nvme_ctrl *ctrl);
> +void nvme_sync_queues(struct nvme_ctrl *ctrl);
>  void nvme_unfreeze(struct nvme_ctrl *ctrl);
>  void nvme_wait_freeze(struct nvme_ctrl *ctrl);
>  void nvme_wait_freeze_timeout(struct nvme_ctrl *ctrl, long timeout);
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 8df176ffcbc1..599065ed6a32 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -2492,6 +2492,7 @@ static void nvme_reset_work(struct work_struct *work)
>  	 */
>  	if (dev->ctrl.ctrl_config & NVME_CC_ENABLE)
>  		nvme_dev_disable(dev, false);
> +	nvme_sync_queues(&dev->ctrl);

Looks fine:

Reviewed-by: Ming Lei <ming.lei at redhat.com>

Thanks, 
Ming

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCHv2 6/6] nvme-pci: Use host managed power state for suspend
  2019-05-15 16:36 ` [PATCHv2 6/6] nvme-pci: Use host managed power state for suspend Keith Busch
  2019-05-15 19:33   ` Mario.Limonciello
@ 2019-05-16  6:25   ` Christoph Hellwig
  2019-05-16 14:24     ` Keith Busch
  2019-05-16  9:29   ` Rafael J. Wysocki
  2 siblings, 1 reply; 51+ messages in thread
From: Christoph Hellwig @ 2019-05-16  6:25 UTC (permalink / raw)


>  
>  #ifdef CONFIG_PM_SLEEP
> +static int nvme_deep_state(struct nvme_dev *dev)
> +{
> +	struct pci_dev *pdev = to_pci_dev(dev->dev);
> +	struct nvme_ctrl *ctrl = &dev->ctrl;
> +	int ret = -EBUSY;;
> +
> +	nvme_start_freeze(ctrl);
> +	nvme_wait_freeze(ctrl);
> +	nvme_sync_queues(ctrl);
> +
> +	if (ctrl->state != NVME_CTRL_LIVE &&
> +	    ctrl->state != NVME_CTRL_ADMIN_ONLY)
> +		goto unfreeze;
> +
> +	if (dev->host_mem_descs) {
> +		ret = nvme_set_host_mem(dev, 0);
> +		if (ret < 0)
> +			goto unfreeze;
> +	}

I've still not heard an explanation of anyone why we need to disable
the HMB here. Even if there are states where we have to disable it
we need to restrict it to just those system power states where we have
to, as reclaiming and recreating the HMB is very costly for the device.

> +	if (ret) {
> +		/*
> +		 * Clearing npss forces a controller reset on resume. The
> +		 * correct value will be resdicovered then.
> +		 */
> +		ctrl->npss = 0;
> +		nvme_dev_disable(dev, true);

Can't we just reuse the nvme_dev_disable call in the caller?

> +	if (dev->host_mem_descs) {
> +		ret = nvme_setup_host_mem(dev);
> +		if (ret)
> +			goto reset;
> +	}

This call could/should set the Memory Return bit in the Set Features
call to enable the HMB.

>  static int nvme_suspend(struct device *dev)
>  {
>  	struct pci_dev *pdev = to_pci_dev(dev);
>  	struct nvme_dev *ndev = pci_get_drvdata(pdev);
>  
> +	if (!pm_suspend_via_firmware() && ndev->ctrl.npss)
> +		return nvme_deep_state(ndev);

And this still needs a comment why we try to just meddle with the
power state and queues instead of turning the device off only if
not suspended via firmware.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 5/6] nvme: Export get and set features
  2019-05-15 16:36 ` [PATCH 5/6] nvme: Export get and set features Keith Busch
@ 2019-05-16  6:26   ` Christoph Hellwig
  2019-05-16 13:47   ` Minwoo Im
  1 sibling, 0 replies; 51+ messages in thread
From: Christoph Hellwig @ 2019-05-16  6:26 UTC (permalink / raw)


> +static int nvme_features(struct nvme_ctrl *dev, u8 op, unsigned fid,
> +		unsigned dword11, void *buffer, size_t buflen, u32 *result)

Nitpick:  an we call this something like
__nvme_getset_features?

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 1/6] nvme-pci: Fix controller freeze wait disabling
  2019-05-15 16:36 [PATCH 1/6] nvme-pci: Fix controller freeze wait disabling Keith Busch
                   ` (5 preceding siblings ...)
  2019-05-16  2:43 ` [PATCH 1/6] nvme-pci: Fix controller freeze wait disabling Ming Lei
@ 2019-05-16  6:27 ` Christoph Hellwig
  6 siblings, 0 replies; 51+ messages in thread
From: Christoph Hellwig @ 2019-05-16  6:27 UTC (permalink / raw)


Looks good,

Reviewed-by: Christoph Hellwig <hch at lst.de>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 2/6] nvme-pci: Don't disable on timeout in reset state
  2019-05-15 16:36 ` [PATCH 2/6] nvme-pci: Don't disable on timeout in reset state Keith Busch
  2019-05-16  3:07   ` Ming Lei
@ 2019-05-16  6:27   ` Christoph Hellwig
  1 sibling, 0 replies; 51+ messages in thread
From: Christoph Hellwig @ 2019-05-16  6:27 UTC (permalink / raw)


On Wed, May 15, 2019@10:36:21AM -0600, Keith Busch wrote:
> The driver doesn't dispatch commands that it needs to wait for in the reset
> state anymore. If a timeout occurs in this state, the reset work is
> already disabling the controller, so just reset the request's timer.

Looks good,

Reviewed-by: Christoph Hellwig <hch at lst.de>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 3/6] nvme-pci: Unblock reset_work on IO failure
  2019-05-15 16:36 ` [PATCH 3/6] nvme-pci: Unblock reset_work on IO failure Keith Busch
  2019-05-16  3:13   ` Ming Lei
@ 2019-05-16  6:28   ` Christoph Hellwig
  1 sibling, 0 replies; 51+ messages in thread
From: Christoph Hellwig @ 2019-05-16  6:28 UTC (permalink / raw)


Looks good,

Reviewed-by: Christoph Hellwig <hch at lst.de>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 4/6] nvme-pci: Sync queues on reset
  2019-05-15 16:36 ` [PATCH 4/6] nvme-pci: Sync queues on reset Keith Busch
  2019-05-16  3:34   ` Ming Lei
@ 2019-05-16  6:29   ` Christoph Hellwig
  2019-05-16 14:08     ` Keith Busch
  2019-05-16 13:43   ` Minwoo Im
  2 siblings, 1 reply; 51+ messages in thread
From: Christoph Hellwig @ 2019-05-16  6:29 UTC (permalink / raw)


On Wed, May 15, 2019@10:36:23AM -0600, Keith Busch wrote:
> A controller with multiple namespaces may have multiple request_queues
> with their own timeout work. If a live controller fails with IO
> outstanding to diffent namespaces, each request queue may attempt to
> disable and reset the controller in different threads. Synchronize each
> queue prior to starting the controller to ensure there no previously
> scheduled timeout work is running.
> 
> Signed-off-by: Keith Busch <keith.busch at intel.com>

Looks good,

Reviewed-by: Christoph Hellwig <hch at lst.de>

I wonder if we should move the timeouts to a per-tagset single threaded
workqueue in the block layer, so that we don't process them in parallel?

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCHv2 6/6] nvme-pci: Use host managed power state for suspend
  2019-05-15 16:36 ` [PATCHv2 6/6] nvme-pci: Use host managed power state for suspend Keith Busch
  2019-05-15 19:33   ` Mario.Limonciello
  2019-05-16  6:25   ` Christoph Hellwig
@ 2019-05-16  9:29   ` Rafael J. Wysocki
  2019-05-16 14:26     ` Keith Busch
  2 siblings, 1 reply; 51+ messages in thread
From: Rafael J. Wysocki @ 2019-05-16  9:29 UTC (permalink / raw)


On Wed, May 15, 2019@6:41 PM Keith Busch <keith.busch@intel.com> wrote:
>
> The nvme pci driver prepares its devices for power loss during suspend
> by shutting down the controllers. The power setting is deferred to
> pci driver's power management before the platform removes power. The
> suspend-to-idle mode, however, does not remove power.
>
> NVMe devices that implement host managed power settings can achieve
> lower power and better transition latencies than using generic PCI power
> settings. Try to use this feature if the platform is not involved with
> the suspend. If successful, restore the previous power state on resume.
>
> Cc: Mario Limonciello <Mario.Limonciello at dell.com>
> Cc: Kai Heng Feng <kai.heng.feng at canonical.com>
> Signed-off-by: Keith Busch <keith.busch at intel.com>
> ---
> v1 -> v2:
>
>   Split prep patches for the get features
>
>   Ensure queued and dispatch IO completes before attempting to set the low
>   power state. This also required a sync to ensure that nothing timed
>   out or reset the controller while we attempted the intermittent queue freeze.
>
>   Disable HMB if enabled. It is not clear this should be necessary except
>   through empirical reports.
>
>  drivers/nvme/host/pci.c | 80 +++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 80 insertions(+)
>
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 599065ed6a32..42d5c6369803 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -18,6 +18,7 @@
>  #include <linux/mutex.h>
>  #include <linux/once.h>
>  #include <linux/pci.h>
> +#include <linux/suspend.h>
>  #include <linux/t10-pi.h>
>  #include <linux/types.h>
>  #include <linux/io-64-nonatomic-lo-hi.h>
> @@ -116,6 +117,7 @@ struct nvme_dev {
>         u32 cmbsz;
>         u32 cmbloc;
>         struct nvme_ctrl ctrl;
> +       u32 last_ps;
>
>         mempool_t *iod_mempool;
>
> @@ -2829,11 +2831,87 @@ static void nvme_remove(struct pci_dev *pdev)
>  }
>
>  #ifdef CONFIG_PM_SLEEP
> +static int nvme_deep_state(struct nvme_dev *dev)
> +{
> +       struct pci_dev *pdev = to_pci_dev(dev->dev);
> +       struct nvme_ctrl *ctrl = &dev->ctrl;
> +       int ret = -EBUSY;;
> +
> +       nvme_start_freeze(ctrl);
> +       nvme_wait_freeze(ctrl);
> +       nvme_sync_queues(ctrl);
> +
> +       if (ctrl->state != NVME_CTRL_LIVE &&
> +           ctrl->state != NVME_CTRL_ADMIN_ONLY)
> +               goto unfreeze;
> +
> +       if (dev->host_mem_descs) {
> +               ret = nvme_set_host_mem(dev, 0);
> +               if (ret < 0)
> +                       goto unfreeze;
> +       }
> +
> +       dev->last_ps = 0;
> +       ret = nvme_get_features(ctrl, NVME_FEAT_POWER_MGMT, 0, NULL, 0,
> +                               &dev->last_ps);
> +       if (ret < 0)
> +               goto unfreeze;
> +
> +       ret = nvme_set_features(ctrl, NVME_FEAT_POWER_MGMT, dev->ctrl.npss,
> +                               NULL, 0, NULL);
> +       if (ret < 0)
> +               goto unfreeze;
> +
> +       if (ret) {
> +               /*
> +                * Clearing npss forces a controller reset on resume. The
> +                * correct value will be resdicovered then.
> +                */
> +               ctrl->npss = 0;
> +               nvme_dev_disable(dev, true);
> +               ret = 0;
> +               goto unfreeze;
> +       }
> +
> +       /*
> +        * A saved state prevents pci pm from generically controlling the
> +        * device's power. We're using protocol specific settings so we don't
> +        * want pci interfering.
> +        */
> +       pci_save_state(pdev);
> +unfreeze:
> +       nvme_unfreeze(ctrl);
> +       return ret;
> +}
> +
> +static int nvme_make_operational(struct nvme_dev *dev)
> +{
> +       struct nvme_ctrl *ctrl = &dev->ctrl;
> +       int ret;
> +
> +       ret = nvme_set_features(ctrl, NVME_FEAT_POWER_MGMT, dev->last_ps,
> +                               NULL, 0, NULL);
> +       if (ret)
> +               goto reset;
> +
> +       if (dev->host_mem_descs) {
> +               ret = nvme_setup_host_mem(dev);
> +               if (ret)
> +                       goto reset;
> +       }
> +       return 0;
> +reset:
> +       nvme_reset_ctrl(ctrl);
> +       return 0;
> +}
> +
>  static int nvme_suspend(struct device *dev)
>  {
>         struct pci_dev *pdev = to_pci_dev(dev);
>         struct nvme_dev *ndev = pci_get_drvdata(pdev);
>
> +       if (!pm_suspend_via_firmware() && ndev->ctrl.npss)
> +               return nvme_deep_state(ndev);
>         nvme_dev_disable(ndev, true);
>         return 0;
>  }
> @@ -2843,6 +2921,8 @@ static int nvme_resume(struct device *dev)
>         struct pci_dev *pdev = to_pci_dev(dev);
>         struct nvme_dev *ndev = pci_get_drvdata(pdev);
>
> +       if (!pm_resume_via_firmware() && ndev->ctrl.npss)
> +               return nvme_make_operational(ndev);

This is a bit problematic with respect to hibernation.

Sorry for failing to mention that before, it escaped me somehow.

So the problem is that this will be invoked as the ->restore()
callback and it need not be suitable for that depending on whether or
not the driver is present in the restore kernel.  [There is another
issue related to the fact that the last stage of hibernation doesn't
return "true" from pm_suspend_via_firmware(), but that needs to be
fixed elsewhere.]  If it is there, it will run nvme_deep_state(ndev),
so all should be fine, but if it isn't there, the PCI bus type will
apply the default handling to the device and that's to disable it via
do_pci_disable_device() it this particular case, as per
pci_pm_freeze(), so nvme_make_operational(ndev) should not be called
then AFAICS.

IMO it is better to just do the nvme_dev_disable()/nvme_reset_ctrl()
things in the hibernation path anyway, so I would rename the existing
nvme_suspend/resume() to nvme_simple_suspend/resume() (or similar),
respectively, and define new nvme_suspend/resume() to do something
like:

return pm_suspend_via_firmware() || !ndev->ctrl.npss ?
nvme_simple_suspend(dev) : nvme_deep_state(ndev);

and

return pm_resume_via_firmware() || !ndev->ctrl.npss ?
nvme_simple_resume(dev) : nvme_make_operational(ndev);

respectively.

>         nvme_reset_ctrl(&ndev->ctrl);
>         return 0;
>  }
> --

Then, you can populate nvme_dev_pm_ops as follows:

static const struct dev_pm_ops = {
    .suspend = nvme_suspend,
    .resume = nvme_resume,
    .freeze = nvme_simple_suspend,
    .thaw = nvme_simple_resume,
    .poweroff = nvme_simple_suspend,
    .restore = nvme_simple_resume,
};

and it should all work.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 4/6] nvme-pci: Sync queues on reset
  2019-05-15 16:36 ` [PATCH 4/6] nvme-pci: Sync queues on reset Keith Busch
  2019-05-16  3:34   ` Ming Lei
  2019-05-16  6:29   ` Christoph Hellwig
@ 2019-05-16 13:43   ` Minwoo Im
  2 siblings, 0 replies; 51+ messages in thread
From: Minwoo Im @ 2019-05-16 13:43 UTC (permalink / raw)


This looks good to me.

Reviewed-by: Minwoo Im <minwoo.im.dev at gmail.com>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 5/6] nvme: Export get and set features
  2019-05-15 16:36 ` [PATCH 5/6] nvme: Export get and set features Keith Busch
  2019-05-16  6:26   ` Christoph Hellwig
@ 2019-05-16 13:47   ` Minwoo Im
  1 sibling, 0 replies; 51+ messages in thread
From: Minwoo Im @ 2019-05-16 13:47 UTC (permalink / raw)


This looks good to me.  Also fine with what Christoph has mentioned.

Reviewed-by: Minwoo Im <minwoo.im.dev at gmail.com>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 4/6] nvme-pci: Sync queues on reset
  2019-05-16  6:29   ` Christoph Hellwig
@ 2019-05-16 14:08     ` Keith Busch
  0 siblings, 0 replies; 51+ messages in thread
From: Keith Busch @ 2019-05-16 14:08 UTC (permalink / raw)


On Wed, May 15, 2019@11:29:59PM -0700, Christoph Hellwig wrote:
> On Wed, May 15, 2019@10:36:23AM -0600, Keith Busch wrote:
> > A controller with multiple namespaces may have multiple request_queues
> > with their own timeout work. If a live controller fails with IO
> > outstanding to diffent namespaces, each request queue may attempt to
> > disable and reset the controller in different threads. Synchronize each
> > queue prior to starting the controller to ensure there no previously
> > scheduled timeout work is running.
> > 
> > Signed-off-by: Keith Busch <keith.busch at intel.com>
> 
> Looks good,
> 
> Reviewed-by: Christoph Hellwig <hch at lst.de>
> 
> I wonder if we should move the timeouts to a per-tagset single threaded
> workqueue in the block layer, so that we don't process them in parallel?

I thought that made sense too, and tried that last year:

  https://patchwork.kernel.org/patch/10532983/

There was something wrong with detecting idle queues, but taking a
second look now, I think that's fixable.

But this is a good patch too.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 3/6] nvme-pci: Unblock reset_work on IO failure
  2019-05-16  3:13   ` Ming Lei
@ 2019-05-16 14:14     ` Keith Busch
  2019-05-17  2:31       ` Ming Lei
  0 siblings, 1 reply; 51+ messages in thread
From: Keith Busch @ 2019-05-16 14:14 UTC (permalink / raw)


On Wed, May 15, 2019@08:13:35PM -0700, Ming Lei wrote:
> On Wed, May 15, 2019@10:36:22AM -0600, Keith Busch wrote:
> > +		nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DELETING);
> > +		/* fall through */
> > +	case NVME_CTRL_DELETING:
> >  		dev_warn_ratelimited(dev->ctrl.device,
> >  			 "I/O %d QID %d timeout, disable controller\n",
> >  			 req->tag, nvmeq->qid);
> > -		nvme_dev_disable(dev, shutdown);
> > +		nvme_dev_disable(dev, true);
> >  		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
> >  		return BLK_EH_DONE;
> >  	case NVME_CTRL_RESETTING:
> 
> Then the controller is dead, and can't work any more together with data
> loss. I guess this way is too violent from user view.

Indeed, it is a bit harsh; however, it is definitely better than having a
stuck controller unable to make forward progress. We may be able to do
better, but I think this patch is a step in the right direction.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCHv2 6/6] nvme-pci: Use host managed power state for suspend
  2019-05-16  6:25   ` Christoph Hellwig
@ 2019-05-16 14:24     ` Keith Busch
  2019-05-17  9:08       ` Christoph Hellwig
  0 siblings, 1 reply; 51+ messages in thread
From: Keith Busch @ 2019-05-16 14:24 UTC (permalink / raw)


On Wed, May 15, 2019@11:25:28PM -0700, Christoph Hellwig wrote:
> >  
> >  #ifdef CONFIG_PM_SLEEP
> > +static int nvme_deep_state(struct nvme_dev *dev)
> > +{
> > +	struct pci_dev *pdev = to_pci_dev(dev->dev);
> > +	struct nvme_ctrl *ctrl = &dev->ctrl;
> > +	int ret = -EBUSY;;
> > +
> > +	nvme_start_freeze(ctrl);
> > +	nvme_wait_freeze(ctrl);
> > +	nvme_sync_queues(ctrl);
> > +
> > +	if (ctrl->state != NVME_CTRL_LIVE &&
> > +	    ctrl->state != NVME_CTRL_ADMIN_ONLY)
> > +		goto unfreeze;
> > +
> > +	if (dev->host_mem_descs) {
> > +		ret = nvme_set_host_mem(dev, 0);
> > +		if (ret < 0)
> > +			goto unfreeze;
> > +	}
> 
> I've still not heard an explanation of anyone why we need to disable
> the HMB here. Even if there are states where we have to disable it
> we need to restrict it to just those system power states where we have
> to, as reclaiming and recreating the HMB is very costly for the device.

We're not really reclaiming the HMB here. We're just telling the
controller to stop using it for a moment, and we return it back to the
device in the same condition it left it via NVME_HOST_MEM_RETURN. That
should minimize bringup latency.

But also I don't think we should have to do this, or would at least like
to see something more authoritative explaining which device or platform
power states require disabling HMB.

> > +	if (ret) {
> > +		/*
> > +		 * Clearing npss forces a controller reset on resume. The
> > +		 * correct value will be resdicovered then.
> > +		 */
> > +		ctrl->npss = 0;
> > +		nvme_dev_disable(dev, true);
> 
> Can't we just reuse the nvme_dev_disable call in the caller?

We could, but it looks less pleasent for the caller since needs to handle
three return possibilities: success, disable, and abort suspend.


> > +	if (dev->host_mem_descs) {
> > +		ret = nvme_setup_host_mem(dev);
> > +		if (ret)
> > +			goto reset;
> > +	}
> 
> This call could/should set the Memory Return bit in the Set Features
> call to enable the HMB.

Yep.
 
> >  static int nvme_suspend(struct device *dev)
> >  {
> >  	struct pci_dev *pdev = to_pci_dev(dev);
> >  	struct nvme_dev *ndev = pci_get_drvdata(pdev);
> >  
> > +	if (!pm_suspend_via_firmware() && ndev->ctrl.npss)
> > +		return nvme_deep_state(ndev);
> 
> And this still needs a comment why we try to just meddle with the
> power state and queues instead of turning the device off only if
> not suspended via firmware.

Sure thing.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCHv2 6/6] nvme-pci: Use host managed power state for suspend
  2019-05-16  9:29   ` Rafael J. Wysocki
@ 2019-05-16 14:26     ` Keith Busch
  2019-05-16 18:27       ` Kai-Heng Feng
  0 siblings, 1 reply; 51+ messages in thread
From: Keith Busch @ 2019-05-16 14:26 UTC (permalink / raw)


On Thu, May 16, 2019@02:29:10AM -0700, Rafael J. Wysocki wrote:
> This is a bit problematic with respect to hibernation.
> 
> Sorry for failing to mention that before, it escaped me somehow.
> 
> So the problem is that this will be invoked as the ->restore()
> callback and it need not be suitable for that depending on whether or
> not the driver is present in the restore kernel.  [There is another
> issue related to the fact that the last stage of hibernation doesn't
> return "true" from pm_suspend_via_firmware(), but that needs to be
> fixed elsewhere.]  If it is there, it will run nvme_deep_state(ndev),
> so all should be fine, but if it isn't there, the PCI bus type will
> apply the default handling to the device and that's to disable it via
> do_pci_disable_device() it this particular case, as per
> pci_pm_freeze(), so nvme_make_operational(ndev) should not be called
> then AFAICS.
> 
> IMO it is better to just do the nvme_dev_disable()/nvme_reset_ctrl()
> things in the hibernation path anyway, so I would rename the existing
> nvme_suspend/resume() to nvme_simple_suspend/resume() (or similar),
> respectively, and define new nvme_suspend/resume() to do something
> like:
> 
> return pm_suspend_via_firmware() || !ndev->ctrl.npss ?
> nvme_simple_suspend(dev) : nvme_deep_state(ndev);
> 
> and
> 
> return pm_resume_via_firmware() || !ndev->ctrl.npss ?
> nvme_simple_resume(dev) : nvme_make_operational(ndev);
> 
> respectively.
> 
> >         nvme_reset_ctrl(&ndev->ctrl);
> >         return 0;
> >  }
> > --
> 
> Then, you can populate nvme_dev_pm_ops as follows:
> 
> static const struct dev_pm_ops = {
>     .suspend = nvme_suspend,
>     .resume = nvme_resume,
>     .freeze = nvme_simple_suspend,
>     .thaw = nvme_simple_resume,
>     .poweroff = nvme_simple_suspend,
>     .restore = nvme_simple_resume,
> };
> 
> and it should all work.

Thanks for the pointers, I'll give that idea a shot.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 2/6] nvme-pci: Don't disable on timeout in reset state
  2019-05-16  3:07   ` Ming Lei
@ 2019-05-16 14:33     ` Keith Busch
  0 siblings, 0 replies; 51+ messages in thread
From: Keith Busch @ 2019-05-16 14:33 UTC (permalink / raw)


On Wed, May 15, 2019@08:07:09PM -0700, Ming Lei wrote:
> On Wed, May 15, 2019@10:36:21AM -0600, Keith Busch wrote:
> > The driver doesn't dispatch commands that it needs to wait for in the reset
> > state anymore. If a timeout occurs in this state, the reset work is
> > already disabling the controller, so just reset the request's timer.
> > 
> > Signed-off-by: Keith Busch <keith.busch at intel.com>
> > ---
> >  drivers/nvme/host/pci.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> > index d4e442160048..c72755311ffa 100644
> > --- a/drivers/nvme/host/pci.c
> > +++ b/drivers/nvme/host/pci.c
> > @@ -1298,13 +1298,14 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
> >  		shutdown = true;
> >  		/* fall through */
> >  	case NVME_CTRL_CONNECTING:
> > -	case NVME_CTRL_RESETTING:
> >  		dev_warn_ratelimited(dev->ctrl.device,
> >  			 "I/O %d QID %d timeout, disable controller\n",
> >  			 req->tag, nvmeq->qid);
> >  		nvme_dev_disable(dev, shutdown);
> >  		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
> >  		return BLK_EH_DONE;
> > +	case NVME_CTRL_RESETTING:
> > +		return BLK_EH_RESET_TIMER;
> >  	default:
> >  		break;
> >  	}
> 
> RESET follows controller shutdown(via nvme_dev_disable()), the only
> possible timeout should be on admin requests staggered between shutdown
> and changing to NVME_CTRL_CONNECTING, given admin queue isn't frozen.
> 
> And the admin queue should be fully workable after it is unquiesced
> by nvme_alloc_admin_tags(), so if timeout happens after nvme_alloc_admin_tags(),
> I guess these requests should be handled as in NVME_CTRL_CONNECTING.

Yep, the only timeouts here should be requests that we've already
reclaimed, or are about to reclaim, via nvme_dev_disable called
from either another timeout work or directly in the reset_work. And
nvme_dev_disable handles its timeout, so we don't need timeout work to
unblock it. Either way, we're never blocked in the RESETTING state.

> Another related problem is about handling timeout in NVME_CTRL_CONNECTING, and
> the following failure still can be observed:
> 
> [ 1078.775969] nvme nvme0: I/O 20 QID 0 timeout, disable controller
> [ 1078.791730] nvme nvme0: Identify Controller failed (-4)
> [ 1078.792538] nvme nvme0: Removing after probe failure status: -5

Right, we will fail the controller if it fails to produce a response to
any initialization commands. It's either that, or try the same thing
atateain, but I haven't seen much support for doing the latter.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCHv2 6/6] nvme-pci: Use host managed power state for suspend
  2019-05-16 14:26     ` Keith Busch
@ 2019-05-16 18:27       ` Kai-Heng Feng
  2019-05-16 18:33         ` Mario.Limonciello
  2019-05-16 20:24         ` Rafael J. Wysocki
  0 siblings, 2 replies; 51+ messages in thread
From: Kai-Heng Feng @ 2019-05-16 18:27 UTC (permalink / raw)


Hi Keith,

at 22:26, Keith Busch <kbusch@kernel.org> wrote:

> On Thu, May 16, 2019@02:29:10AM -0700, Rafael J. Wysocki wrote:
>> This is a bit problematic with respect to hibernation.
>>
>> Sorry for failing to mention that before, it escaped me somehow.
>>
>> So the problem is that this will be invoked as the ->restore()
>> callback and it need not be suitable for that depending on whether or
>> not the driver is present in the restore kernel.  [There is another
>> issue related to the fact that the last stage of hibernation doesn't
>> return "true" from pm_suspend_via_firmware(), but that needs to be
>> fixed elsewhere.]  If it is there, it will run nvme_deep_state(ndev),
>> so all should be fine, but if it isn't there, the PCI bus type will
>> apply the default handling to the device and that's to disable it via
>> do_pci_disable_device() it this particular case, as per
>> pci_pm_freeze(), so nvme_make_operational(ndev) should not be called
>> then AFAICS.
>>
>> IMO it is better to just do the nvme_dev_disable()/nvme_reset_ctrl()
>> things in the hibernation path anyway, so I would rename the existing
>> nvme_suspend/resume() to nvme_simple_suspend/resume() (or similar),
>> respectively, and define new nvme_suspend/resume() to do something
>> like:
>>
>> return pm_suspend_via_firmware() || !ndev->ctrl.npss ?
>> nvme_simple_suspend(dev) : nvme_deep_state(ndev);
>>
>> and
>>
>> return pm_resume_via_firmware() || !ndev->ctrl.npss ?
>> nvme_simple_resume(dev) : nvme_make_operational(ndev);
>>
>> respectively.
>>
>>>        nvme_reset_ctrl(&ndev->ctrl);
>>>         return 0;
>>>  }
>>> --
>>
>> Then, you can populate nvme_dev_pm_ops as follows:
>>
>> static const struct dev_pm_ops = {
>>     .suspend = nvme_suspend,
>>     .resume = nvme_resume,
>>     .freeze = nvme_simple_suspend,
>>     .thaw = nvme_simple_resume,
>>     .poweroff = nvme_simple_suspend,
>>     .restore = nvme_simple_resume,
>> };
>>
>> and it should all work.
>
> Thanks for the pointers, I'll give that idea a shot.

Thanks for your work, I?ve tested your patch series on top of nvme-5.2  
branch. It works well once I apply flag PCI_DEV_FLAGS_NO_D3.
After some digging I think it?s another bug though:

[   69.883743] PM: suspend entry (s2idle)
[   69.883746] PM: Syncing filesystems ... done.
[   69.892848] Freezing user space processes ... (elapsed 0.003 seconds)  
done.
[   69.896109] OOM killer disabled.
[   69.896111] Freezing remaining freezable tasks ... (elapsed 0.001  
seconds) done.
[   69.897432] printk: Suspending console(s) (use no_console_suspend to  
debug)
[   70.004120] wlp1s0: deauthenticating from d8:8f:76:7c:52:a9 by local  
choice (Reason: 3=DEAUTH_LEAVING)
[   70.007475] intel-lpss 0000:00:15.1: power state changed by ACPI to D0
[   70.114626] intel-lpss 0000:00:15.1: restoring config space at offset  
0x10 (was 0x4, writing 0x7f801004)
[   70.216084] e1000e: EEE TX LPI TIMER: 00000011
[   70.218418] intel-lpss 0000:00:15.0: power state changed by ACPI to D0
[   70.257955] e1000e 0000:00:1f.6: disabling bus mastering
[   70.325930] intel-lpss 0000:00:15.0: restoring config space at offset  
0x10 (was 0x4, writing 0x7f800004)
[   70.358779] nvme 0000:03:00.0: PCI PM: Suspend power state: D0
[   70.362587] i801_smbus 0000:00:1f.4: PCI PM: Suspend power state: D0
[   70.363712] e1000e 0000:00:1f.6: PME# enabled
[   70.366507] pcieport 0000:00:1d.4: PME# enabled
[   70.366512] pcieport 0000:00:1d.2: PME# enabled
[   70.366904] xhci_hcd 0000:00:14.0: PME# enabled
[   70.377166] snd_hda_intel 0000:00:1f.3: power state changed by ACPI to  
D3hot
[   70.377187] snd_hda_intel 0000:00:1f.3: PCI PM: Suspend power state: D3hot
[   70.385042] pcieport 0000:00:1d.2: PCI PM: Suspend power state: D3hot
[   70.385364] ath10k_pci 0000:01:00.0: PCI PM: Suspend power state: D3hot
[   70.385401] mei_me 0000:00:16.0: PCI PM: Suspend power state: D3hot
[   70.385445] intel_pch_thermal 0000:00:12.0: PCI PM: Suspend power state:  
D3hot
[   70.385980] intel-lpss 0000:00:15.1: power state changed by ACPI to D3cold
[   70.386229] e1000e 0000:00:1f.6: power state changed by ACPI to D3hot
[   70.386248] e1000e 0000:00:1f.6: PCI PM: Suspend power state: D3hot
[   70.386328] intel-lpss 0000:00:15.0: power state changed by ACPI to D3cold
[   70.386441] xhci_hcd 0000:00:14.0: power state changed by ACPI to D3hot
[   70.386500] xhci_hcd 0000:00:14.0: PCI PM: Suspend power state: D3hot
[   70.386536] intel-lpss 0000:00:15.1: PCI PM: Suspend power state: D3hot
[   70.386602] intel-lpss 0000:00:15.0: PCI PM: Suspend power state: D3hot
[   70.389034] pcieport 0000:00:1d.4: PCI PM: Suspend power state: D3hot
[   70.390059] xhci_hcd 0000:00:14.0: power state changed by ACPI to D0
[   70.390407] e1000e 0000:00:1f.6: power state changed by ACPI to D0
[   70.390420] intel-lpss 0000:00:15.0: power state changed by ACPI to D0
[   70.390504] intel-lpss 0000:00:15.1: power state changed by ACPI to D0
[   70.390934] snd_hda_intel 0000:00:1f.3: power state changed by ACPI to D0
[   70.411038] e1000e 0000:00:1f.6: restoring config space at offset 0x4  
(was 0x100000, writing 0x100002)
[   70.412050] pcieport 0000:00:1d.2: restoring config space at offset 0x2c  
(was 0x0, writing 0x0)
[   70.412348] pcieport 0000:00:1d.4: restoring config space at offset 0x2c  
(was 0x0, writing 0x0)
[   70.412637] pcieport 0000:00:1d.2: restoring config space at offset 0x28  
(was 0x0, writing 0x0)
[   70.412866] pcieport 0000:00:1d.4: restoring config space at offset 0x28  
(was 0x0, writing 0x0)
[   70.413097] pcieport 0000:00:1d.2: restoring config space at offset 0x24  
(was 0x1fff1, writing 0x1fff1)
[   70.413312] pcieport 0000:00:1d.4: restoring config space at offset 0x24  
(was 0x1fff1, writing 0x1fff1)
[   70.419876] intel-lpss 0000:00:15.0: restoring config space at offset  
0x10 (was 0x4, writing 0x7f800004)
[   70.420094] intel-lpss 0000:00:15.1: restoring config space at offset  
0x10 (was 0x4, writing 0x7f801004)
[   70.437001] rtsx_pci 0000:02:00.0: restoring config space at offset 0x4  
(was 0x100402, writing 0x100406)
[   70.451991] i801_smbus 0000:00:1f.4: PCI PM: Suspend power state: D0
[   70.474839] mei_me 0000:00:16.0: PCI PM: Suspend power state: D3hot
[   70.475511] snd_hda_intel 0000:00:1f.3: power state changed by ACPI to  
D3hot
[   70.475523] e1000e 0000:00:1f.6: power state changed by ACPI to D3hot
[   70.475535] e1000e 0000:00:1f.6: PCI PM: Suspend power state: D3hot
[   70.475537] snd_hda_intel 0000:00:1f.3: PCI PM: Suspend power state: D3hot
[   70.475543] i915 0000:00:02.0: PCI PM: Suspend power state: D3hot
[   70.492594] intel_pch_thermal 0000:00:12.0: PCI PM: Suspend power state:  
D3hot
[   70.492960] ath10k_pci 0000:01:00.0: PCI PM: Suspend power state: D3hot
[   70.493056] pcieport 0000:00:1d.2: PCI PM: Suspend power state: D3hot
[   70.493192] pcieport 0000:00:1d.0: PME# enabled
[   70.493358] intel-lpss 0000:00:15.1: power state changed by ACPI to D3cold
[   70.493650] intel-lpss 0000:00:15.0: power state changed by ACPI to D3cold
[   70.493747] intel-lpss 0000:00:15.1: PCI PM: Suspend power state: D3hot
[   70.493862] intel-lpss 0000:00:15.0: PCI PM: Suspend power state: D3hot
[   70.496908] nvme 0000:03:00.0: PCI PM: Suspend power state: D3hot

pci_pm_suspend_noirq() gets called twice:
[   70.358779] nvme 0000:03:00.0: PCI PM: Suspend power state: D0
[   70.496908] nvme 0000:03:00.0: PCI PM: Suspend power state: D3hot
So it?s still being put to D3.

I?ll check why this bug happens.

Kai-Heng

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCHv2 6/6] nvme-pci: Use host managed power state for suspend
  2019-05-16 18:27       ` Kai-Heng Feng
@ 2019-05-16 18:33         ` Mario.Limonciello
  2019-05-16 19:38           ` Keith Busch
  2019-05-16 20:24         ` Rafael J. Wysocki
  1 sibling, 1 reply; 51+ messages in thread
From: Mario.Limonciello @ 2019-05-16 18:33 UTC (permalink / raw)


> -----Original Message-----
> From: Kai-Heng Feng <kai.heng.feng at canonical.com>
> Sent: Thursday, May 16, 2019 11:28 AM
> To: Keith Busch
> Cc: Rafael J. Wysocki; Busch, Keith; Christoph Hellwig; Sagi Grimberg; linux-
> nvme; Limonciello, Mario
> Subject: Re: [PATCHv2 6/6] nvme-pci: Use host managed power state for
> suspend
> 
> 
> [EXTERNAL EMAIL]
> 
> Hi Keith,
> 
>@22:26, Keith Busch <kbusch@kernel.org> wrote:
> 
> > On Thu, May 16, 2019@02:29:10AM -0700, Rafael J. Wysocki wrote:
> >> This is a bit problematic with respect to hibernation.
> >>
> >> Sorry for failing to mention that before, it escaped me somehow.
> >>
> >> So the problem is that this will be invoked as the ->restore()
> >> callback and it need not be suitable for that depending on whether or
> >> not the driver is present in the restore kernel.  [There is another
> >> issue related to the fact that the last stage of hibernation doesn't
> >> return "true" from pm_suspend_via_firmware(), but that needs to be
> >> fixed elsewhere.]  If it is there, it will run nvme_deep_state(ndev),
> >> so all should be fine, but if it isn't there, the PCI bus type will
> >> apply the default handling to the device and that's to disable it via
> >> do_pci_disable_device() it this particular case, as per
> >> pci_pm_freeze(), so nvme_make_operational(ndev) should not be called
> >> then AFAICS.
> >>
> >> IMO it is better to just do the nvme_dev_disable()/nvme_reset_ctrl()
> >> things in the hibernation path anyway, so I would rename the existing
> >> nvme_suspend/resume() to nvme_simple_suspend/resume() (or similar),
> >> respectively, and define new nvme_suspend/resume() to do something
> >> like:
> >>
> >> return pm_suspend_via_firmware() || !ndev->ctrl.npss ?
> >> nvme_simple_suspend(dev) : nvme_deep_state(ndev);
> >>
> >> and
> >>
> >> return pm_resume_via_firmware() || !ndev->ctrl.npss ?
> >> nvme_simple_resume(dev) : nvme_make_operational(ndev);
> >>
> >> respectively.
> >>
> >>>        nvme_reset_ctrl(&ndev->ctrl);
> >>>         return 0;
> >>>  }
> >>> --
> >>
> >> Then, you can populate nvme_dev_pm_ops as follows:
> >>
> >> static const struct dev_pm_ops = {
> >>     .suspend = nvme_suspend,
> >>     .resume = nvme_resume,
> >>     .freeze = nvme_simple_suspend,
> >>     .thaw = nvme_simple_resume,
> >>     .poweroff = nvme_simple_suspend,
> >>     .restore = nvme_simple_resume,
> >> };
> >>
> >> and it should all work.
> >
> > Thanks for the pointers, I'll give that idea a shot.
> 
> Thanks for your work, I?ve tested your patch series on top of nvme-5.2 branch. It
> works well once I apply flag PCI_DEV_FLAGS_NO_D3.
> After some digging I think it?s another bug though:
> 
> [   69.883743] PM: suspend entry (s2idle)
> [   69.883746] PM: Syncing filesystems ... done.
> [   69.892848] Freezing user space processes ... (elapsed 0.003 seconds)
> done.
> [   69.896109] OOM killer disabled.
> [   69.896111] Freezing remaining freezable tasks ... (elapsed 0.001
> seconds) done.
> [   69.897432] printk: Suspending console(s) (use no_console_suspend to
> debug)
> [   70.004120] wlp1s0: deauthenticating from d8:8f:76:7c:52:a9 by local
> choice (Reason: 3=DEAUTH_LEAVING)
> [   70.007475] intel-lpss 0000:00:15.1: power state changed by ACPI to D0
> [   70.114626] intel-lpss 0000:00:15.1: restoring config space at offset
> 0x10 (was 0x4, writing 0x7f801004)
> [   70.216084] e1000e: EEE TX LPI TIMER: 00000011
> [   70.218418] intel-lpss 0000:00:15.0: power state changed by ACPI to D0
> [   70.257955] e1000e 0000:00:1f.6: disabling bus mastering
> [   70.325930] intel-lpss 0000:00:15.0: restoring config space at offset
> 0x10 (was 0x4, writing 0x7f800004)
> [   70.358779] nvme 0000:03:00.0: PCI PM: Suspend power state: D0
> [   70.362587] i801_smbus 0000:00:1f.4: PCI PM: Suspend power state: D0
> [   70.363712] e1000e 0000:00:1f.6: PME# enabled
> [   70.366507] pcieport 0000:00:1d.4: PME# enabled
> [   70.366512] pcieport 0000:00:1d.2: PME# enabled
> [   70.366904] xhci_hcd 0000:00:14.0: PME# enabled
> [   70.377166] snd_hda_intel 0000:00:1f.3: power state changed by ACPI to
> D3hot
> [   70.377187] snd_hda_intel 0000:00:1f.3: PCI PM: Suspend power state: D3hot
> [   70.385042] pcieport 0000:00:1d.2: PCI PM: Suspend power state: D3hot
> [   70.385364] ath10k_pci 0000:01:00.0: PCI PM: Suspend power state: D3hot
> [   70.385401] mei_me 0000:00:16.0: PCI PM: Suspend power state: D3hot
> [   70.385445] intel_pch_thermal 0000:00:12.0: PCI PM: Suspend power state:
> D3hot
> [   70.385980] intel-lpss 0000:00:15.1: power state changed by ACPI to D3cold
> [   70.386229] e1000e 0000:00:1f.6: power state changed by ACPI to D3hot
> [   70.386248] e1000e 0000:00:1f.6: PCI PM: Suspend power state: D3hot
> [   70.386328] intel-lpss 0000:00:15.0: power state changed by ACPI to D3cold
> [   70.386441] xhci_hcd 0000:00:14.0: power state changed by ACPI to D3hot
> [   70.386500] xhci_hcd 0000:00:14.0: PCI PM: Suspend power state: D3hot
> [   70.386536] intel-lpss 0000:00:15.1: PCI PM: Suspend power state: D3hot
> [   70.386602] intel-lpss 0000:00:15.0: PCI PM: Suspend power state: D3hot
> [   70.389034] pcieport 0000:00:1d.4: PCI PM: Suspend power state: D3hot
> [   70.390059] xhci_hcd 0000:00:14.0: power state changed by ACPI to D0
> [   70.390407] e1000e 0000:00:1f.6: power state changed by ACPI to D0
> [   70.390420] intel-lpss 0000:00:15.0: power state changed by ACPI to D0
> [   70.390504] intel-lpss 0000:00:15.1: power state changed by ACPI to D0
> [   70.390934] snd_hda_intel 0000:00:1f.3: power state changed by ACPI to D0
> [   70.411038] e1000e 0000:00:1f.6: restoring config space at offset 0x4
> (was 0x100000, writing 0x100002)
> [   70.412050] pcieport 0000:00:1d.2: restoring config space at offset 0x2c
> (was 0x0, writing 0x0)
> [   70.412348] pcieport 0000:00:1d.4: restoring config space at offset 0x2c
> (was 0x0, writing 0x0)
> [   70.412637] pcieport 0000:00:1d.2: restoring config space at offset 0x28
> (was 0x0, writing 0x0)
> [   70.412866] pcieport 0000:00:1d.4: restoring config space at offset 0x28
> (was 0x0, writing 0x0)
> [   70.413097] pcieport 0000:00:1d.2: restoring config space at offset 0x24
> (was 0x1fff1, writing 0x1fff1)
> [   70.413312] pcieport 0000:00:1d.4: restoring config space at offset 0x24
> (was 0x1fff1, writing 0x1fff1)
> [   70.419876] intel-lpss 0000:00:15.0: restoring config space at offset
> 0x10 (was 0x4, writing 0x7f800004)
> [   70.420094] intel-lpss 0000:00:15.1: restoring config space at offset
> 0x10 (was 0x4, writing 0x7f801004)
> [   70.437001] rtsx_pci 0000:02:00.0: restoring config space at offset 0x4
> (was 0x100402, writing 0x100406)
> [   70.451991] i801_smbus 0000:00:1f.4: PCI PM: Suspend power state: D0
> [   70.474839] mei_me 0000:00:16.0: PCI PM: Suspend power state: D3hot
> [   70.475511] snd_hda_intel 0000:00:1f.3: power state changed by ACPI to
> D3hot
> [   70.475523] e1000e 0000:00:1f.6: power state changed by ACPI to D3hot
> [   70.475535] e1000e 0000:00:1f.6: PCI PM: Suspend power state: D3hot
> [   70.475537] snd_hda_intel 0000:00:1f.3: PCI PM: Suspend power state: D3hot
> [   70.475543] i915 0000:00:02.0: PCI PM: Suspend power state: D3hot
> [   70.492594] intel_pch_thermal 0000:00:12.0: PCI PM: Suspend power state:
> D3hot
> [   70.492960] ath10k_pci 0000:01:00.0: PCI PM: Suspend power state: D3hot
> [   70.493056] pcieport 0000:00:1d.2: PCI PM: Suspend power state: D3hot
> [   70.493192] pcieport 0000:00:1d.0: PME# enabled
> [   70.493358] intel-lpss 0000:00:15.1: power state changed by ACPI to D3cold
> [   70.493650] intel-lpss 0000:00:15.0: power state changed by ACPI to D3cold
> [   70.493747] intel-lpss 0000:00:15.1: PCI PM: Suspend power state: D3hot
> [   70.493862] intel-lpss 0000:00:15.0: PCI PM: Suspend power state: D3hot
> [   70.496908] nvme 0000:03:00.0: PCI PM: Suspend power state: D3hot
> 
> pci_pm_suspend_noirq() gets called twice:
> [   70.358779] nvme 0000:03:00.0: PCI PM: Suspend power state: D0
> [   70.496908] nvme 0000:03:00.0: PCI PM: Suspend power state: D3hot
> So it?s still being put to D3.
> 
> I?ll check why this bug happens.
> 
> Kai-Heng

Can you please try this patch:
https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/1612437

It hasn't been submitted upstream yet, but should soon and I think it will fix this behavior.

Thanks,

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCHv2 6/6] nvme-pci: Use host managed power state for suspend
  2019-05-16 18:33         ` Mario.Limonciello
@ 2019-05-16 19:38           ` Keith Busch
  2019-05-16 20:25             ` Rafael J. Wysocki
  0 siblings, 1 reply; 51+ messages in thread
From: Keith Busch @ 2019-05-16 19:38 UTC (permalink / raw)


On Thu, May 16, 2019@06:33:56PM +0000, Mario.Limonciello@dell.com wrote:
> > >> Then, you can populate nvme_dev_pm_ops as follows:
> > >>
> > >> static const struct dev_pm_ops = {
> > >>     .suspend = nvme_suspend,
> > >>     .resume = nvme_resume,
> > >>     .freeze = nvme_simple_suspend,
> > >>     .thaw = nvme_simple_resume,
> > >>     .poweroff = nvme_simple_suspend,
> > >>     .restore = nvme_simple_resume,
> > >> };
> > >>
> > >> and it should all work.
> > >
> > > Thanks for the pointers, I'll give that idea a shot.
> > 
> > Thanks for your work, I?ve tested your patch series on top of nvme-5.2 branch. It
> > works well once I apply flag PCI_DEV_FLAGS_NO_D3.
> > After some digging I think it?s another bug though:

....

> > pci_pm_suspend_noirq() gets called twice:
> > [   70.358779] nvme 0000:03:00.0: PCI PM: Suspend power state: D0
> > [   70.496908] nvme 0000:03:00.0: PCI PM: Suspend power state: D3hot
> > So it?s still being put to D3.
> > 
> > I?ll check why this bug happens.
> > 
> > Kai-Heng
> 
> Can you please try this patch:
> https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/1612437
> 
> It hasn't been submitted upstream yet, but should soon and I think it will fix this behavior.
> 
> Thanks,

If we're going to replace our SIMPLE_DEV_PM_OPS as Rafael suggests,
might as well add a .suspend_noirq callback. We can just save the
state again to work around this too.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCHv2 6/6] nvme-pci: Use host managed power state for suspend
  2019-05-16 18:27       ` Kai-Heng Feng
  2019-05-16 18:33         ` Mario.Limonciello
@ 2019-05-16 20:24         ` Rafael J. Wysocki
  1 sibling, 0 replies; 51+ messages in thread
From: Rafael J. Wysocki @ 2019-05-16 20:24 UTC (permalink / raw)


On Thu, May 16, 2019 at 8:27 PM Kai-Heng Feng
<kai.heng.feng@canonical.com> wrote:
>
> Hi Keith,
>
>@22:26, Keith Busch <kbusch@kernel.org> wrote:
>
> > On Thu, May 16, 2019@02:29:10AM -0700, Rafael J. Wysocki wrote:
> >> This is a bit problematic with respect to hibernation.
> >>
> >> Sorry for failing to mention that before, it escaped me somehow.
> >>
> >> So the problem is that this will be invoked as the ->restore()
> >> callback and it need not be suitable for that depending on whether or
> >> not the driver is present in the restore kernel.  [There is another
> >> issue related to the fact that the last stage of hibernation doesn't
> >> return "true" from pm_suspend_via_firmware(), but that needs to be
> >> fixed elsewhere.]  If it is there, it will run nvme_deep_state(ndev),
> >> so all should be fine, but if it isn't there, the PCI bus type will
> >> apply the default handling to the device and that's to disable it via
> >> do_pci_disable_device() it this particular case, as per
> >> pci_pm_freeze(), so nvme_make_operational(ndev) should not be called
> >> then AFAICS.
> >>
> >> IMO it is better to just do the nvme_dev_disable()/nvme_reset_ctrl()
> >> things in the hibernation path anyway, so I would rename the existing
> >> nvme_suspend/resume() to nvme_simple_suspend/resume() (or similar),
> >> respectively, and define new nvme_suspend/resume() to do something
> >> like:
> >>
> >> return pm_suspend_via_firmware() || !ndev->ctrl.npss ?
> >> nvme_simple_suspend(dev) : nvme_deep_state(ndev);
> >>
> >> and
> >>
> >> return pm_resume_via_firmware() || !ndev->ctrl.npss ?
> >> nvme_simple_resume(dev) : nvme_make_operational(ndev);
> >>
> >> respectively.
> >>
> >>>        nvme_reset_ctrl(&ndev->ctrl);
> >>>         return 0;
> >>>  }
> >>> --
> >>
> >> Then, you can populate nvme_dev_pm_ops as follows:
> >>
> >> static const struct dev_pm_ops = {
> >>     .suspend = nvme_suspend,
> >>     .resume = nvme_resume,
> >>     .freeze = nvme_simple_suspend,
> >>     .thaw = nvme_simple_resume,
> >>     .poweroff = nvme_simple_suspend,
> >>     .restore = nvme_simple_resume,
> >> };
> >>
> >> and it should all work.
> >
> > Thanks for the pointers, I'll give that idea a shot.
>
> Thanks for your work, I?ve tested your patch series on top of nvme-5.2
> branch. It works well once I apply flag PCI_DEV_FLAGS_NO_D3.
> After some digging I think it?s another bug though:
>
> [   69.883743] PM: suspend entry (s2idle)
> [   69.883746] PM: Syncing filesystems ... done.
> [   69.892848] Freezing user space processes ... (elapsed 0.003 seconds)
> done.
> [   69.896109] OOM killer disabled.
> [   69.896111] Freezing remaining freezable tasks ... (elapsed 0.001
> seconds) done.
> [   69.897432] printk: Suspending console(s) (use no_console_suspend to
> debug)
> [   70.004120] wlp1s0: deauthenticating from d8:8f:76:7c:52:a9 by local
> choice (Reason: 3=DEAUTH_LEAVING)
> [   70.007475] intel-lpss 0000:00:15.1: power state changed by ACPI to D0
> [   70.114626] intel-lpss 0000:00:15.1: restoring config space at offset
> 0x10 (was 0x4, writing 0x7f801004)
> [   70.216084] e1000e: EEE TX LPI TIMER: 00000011
> [   70.218418] intel-lpss 0000:00:15.0: power state changed by ACPI to D0
> [   70.257955] e1000e 0000:00:1f.6: disabling bus mastering
> [   70.325930] intel-lpss 0000:00:15.0: restoring config space at offset
> 0x10 (was 0x4, writing 0x7f800004)
> [   70.358779] nvme 0000:03:00.0: PCI PM: Suspend power state: D0
> [   70.362587] i801_smbus 0000:00:1f.4: PCI PM: Suspend power state: D0
> [   70.363712] e1000e 0000:00:1f.6: PME# enabled
> [   70.366507] pcieport 0000:00:1d.4: PME# enabled
> [   70.366512] pcieport 0000:00:1d.2: PME# enabled
> [   70.366904] xhci_hcd 0000:00:14.0: PME# enabled
> [   70.377166] snd_hda_intel 0000:00:1f.3: power state changed by ACPI to
> D3hot
> [   70.377187] snd_hda_intel 0000:00:1f.3: PCI PM: Suspend power state: D3hot
> [   70.385042] pcieport 0000:00:1d.2: PCI PM: Suspend power state: D3hot
> [   70.385364] ath10k_pci 0000:01:00.0: PCI PM: Suspend power state: D3hot
> [   70.385401] mei_me 0000:00:16.0: PCI PM: Suspend power state: D3hot
> [   70.385445] intel_pch_thermal 0000:00:12.0: PCI PM: Suspend power state:
> D3hot
> [   70.385980] intel-lpss 0000:00:15.1: power state changed by ACPI to D3cold
> [   70.386229] e1000e 0000:00:1f.6: power state changed by ACPI to D3hot
> [   70.386248] e1000e 0000:00:1f.6: PCI PM: Suspend power state: D3hot
> [   70.386328] intel-lpss 0000:00:15.0: power state changed by ACPI to D3cold
> [   70.386441] xhci_hcd 0000:00:14.0: power state changed by ACPI to D3hot
> [   70.386500] xhci_hcd 0000:00:14.0: PCI PM: Suspend power state: D3hot
> [   70.386536] intel-lpss 0000:00:15.1: PCI PM: Suspend power state: D3hot
> [   70.386602] intel-lpss 0000:00:15.0: PCI PM: Suspend power state: D3hot
> [   70.389034] pcieport 0000:00:1d.4: PCI PM: Suspend power state: D3hot
> [   70.390059] xhci_hcd 0000:00:14.0: power state changed by ACPI to D0
> [   70.390407] e1000e 0000:00:1f.6: power state changed by ACPI to D0
> [   70.390420] intel-lpss 0000:00:15.0: power state changed by ACPI to D0
> [   70.390504] intel-lpss 0000:00:15.1: power state changed by ACPI to D0
> [   70.390934] snd_hda_intel 0000:00:1f.3: power state changed by ACPI to D0
> [   70.411038] e1000e 0000:00:1f.6: restoring config space at offset 0x4
> (was 0x100000, writing 0x100002)
> [   70.412050] pcieport 0000:00:1d.2: restoring config space at offset 0x2c
> (was 0x0, writing 0x0)
> [   70.412348] pcieport 0000:00:1d.4: restoring config space at offset 0x2c
> (was 0x0, writing 0x0)
> [   70.412637] pcieport 0000:00:1d.2: restoring config space at offset 0x28
> (was 0x0, writing 0x0)
> [   70.412866] pcieport 0000:00:1d.4: restoring config space at offset 0x28
> (was 0x0, writing 0x0)
> [   70.413097] pcieport 0000:00:1d.2: restoring config space at offset 0x24
> (was 0x1fff1, writing 0x1fff1)
> [   70.413312] pcieport 0000:00:1d.4: restoring config space at offset 0x24
> (was 0x1fff1, writing 0x1fff1)
> [   70.419876] intel-lpss 0000:00:15.0: restoring config space at offset
> 0x10 (was 0x4, writing 0x7f800004)
> [   70.420094] intel-lpss 0000:00:15.1: restoring config space at offset
> 0x10 (was 0x4, writing 0x7f801004)
> [   70.437001] rtsx_pci 0000:02:00.0: restoring config space at offset 0x4
> (was 0x100402, writing 0x100406)
> [   70.451991] i801_smbus 0000:00:1f.4: PCI PM: Suspend power state: D0
> [   70.474839] mei_me 0000:00:16.0: PCI PM: Suspend power state: D3hot
> [   70.475511] snd_hda_intel 0000:00:1f.3: power state changed by ACPI to
> D3hot
> [   70.475523] e1000e 0000:00:1f.6: power state changed by ACPI to D3hot
> [   70.475535] e1000e 0000:00:1f.6: PCI PM: Suspend power state: D3hot
> [   70.475537] snd_hda_intel 0000:00:1f.3: PCI PM: Suspend power state: D3hot
> [   70.475543] i915 0000:00:02.0: PCI PM: Suspend power state: D3hot
> [   70.492594] intel_pch_thermal 0000:00:12.0: PCI PM: Suspend power state:
> D3hot
> [   70.492960] ath10k_pci 0000:01:00.0: PCI PM: Suspend power state: D3hot
> [   70.493056] pcieport 0000:00:1d.2: PCI PM: Suspend power state: D3hot
> [   70.493192] pcieport 0000:00:1d.0: PME# enabled
> [   70.493358] intel-lpss 0000:00:15.1: power state changed by ACPI to D3cold
> [   70.493650] intel-lpss 0000:00:15.0: power state changed by ACPI to D3cold
> [   70.493747] intel-lpss 0000:00:15.1: PCI PM: Suspend power state: D3hot
> [   70.493862] intel-lpss 0000:00:15.0: PCI PM: Suspend power state: D3hot
> [   70.496908] nvme 0000:03:00.0: PCI PM: Suspend power state: D3hot
>
> pci_pm_suspend_noirq() gets called twice:

I don't see how that can happen.

pci_pm_suspend_noirq() is only called as the ->suspend_noirq callback
by the PCI bus type and that occurs exactly once per suspend
transition for any given device.

And the Keith's patches don't touch the PCI bus type AFAICS.

> [   70.358779] nvme 0000:03:00.0: PCI PM: Suspend power state: D0
> [   70.496908] nvme 0000:03:00.0: PCI PM: Suspend power state: D3hot

> So it?s still being put to D3.

Are you sure that there are no extra changes in your kernel?

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCHv2 6/6] nvme-pci: Use host managed power state for suspend
  2019-05-16 19:38           ` Keith Busch
@ 2019-05-16 20:25             ` Rafael J. Wysocki
  2019-05-16 20:39               ` Keith Busch
  0 siblings, 1 reply; 51+ messages in thread
From: Rafael J. Wysocki @ 2019-05-16 20:25 UTC (permalink / raw)


On Thu, May 16, 2019@9:43 PM Keith Busch <kbusch@kernel.org> wrote:
>
> On Thu, May 16, 2019@06:33:56PM +0000, Mario.Limonciello@dell.com wrote:
> > > >> Then, you can populate nvme_dev_pm_ops as follows:
> > > >>
> > > >> static const struct dev_pm_ops = {
> > > >>     .suspend = nvme_suspend,
> > > >>     .resume = nvme_resume,
> > > >>     .freeze = nvme_simple_suspend,
> > > >>     .thaw = nvme_simple_resume,
> > > >>     .poweroff = nvme_simple_suspend,
> > > >>     .restore = nvme_simple_resume,
> > > >> };
> > > >>
> > > >> and it should all work.
> > > >
> > > > Thanks for the pointers, I'll give that idea a shot.
> > >
> > > Thanks for your work, I?ve tested your patch series on top of nvme-5.2 branch. It
> > > works well once I apply flag PCI_DEV_FLAGS_NO_D3.
> > > After some digging I think it?s another bug though:
>
> ....
>
> > > pci_pm_suspend_noirq() gets called twice:
> > > [   70.358779] nvme 0000:03:00.0: PCI PM: Suspend power state: D0
> > > [   70.496908] nvme 0000:03:00.0: PCI PM: Suspend power state: D3hot
> > > So it?s still being put to D3.
> > >
> > > I?ll check why this bug happens.
> > >
> > > Kai-Heng
> >
> > Can you please try this patch:
> > https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/1612437
> >
> > It hasn't been submitted upstream yet, but should soon and I think it will fix this behavior.

I wonder how.

>
> If we're going to replace our SIMPLE_DEV_PM_OPS as Rafael suggests,
> might as well add a .suspend_noirq callback. We can just save the
> state again to work around this too.

Relax, pretty please.  Saving the state again shouldn't be necessary.

Let's first understand what's going on.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCHv2 6/6] nvme-pci: Use host managed power state for suspend
  2019-05-16 20:25             ` Rafael J. Wysocki
@ 2019-05-16 20:39               ` Keith Busch
  2019-05-16 20:56                 ` Rafael J. Wysocki
  0 siblings, 1 reply; 51+ messages in thread
From: Keith Busch @ 2019-05-16 20:39 UTC (permalink / raw)


On Thu, May 16, 2019@10:25:47PM +0200, Rafael J. Wysocki wrote:
> On Thu, May 16, 2019@9:43 PM Keith Busch <kbusch@kernel.org> wrote:
> > If we're going to replace our SIMPLE_DEV_PM_OPS as Rafael suggests,
> > might as well add a .suspend_noirq callback. We can just save the
> > state again to work around this too.
> 
> Relax, pretty please.  Saving the state again shouldn't be necessary.
> 
> Let's first understand what's going on.

Hah, okay fair enough.

FWIW, I've tried current mainline on two different platforms and I see
only the expected number of calls to pci_pm_suspend_noirq, so everything
works for me.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCHv2 6/6] nvme-pci: Use host managed power state for suspend
  2019-05-16 20:39               ` Keith Busch
@ 2019-05-16 20:56                 ` Rafael J. Wysocki
  2019-05-17  8:39                   ` Rafael J. Wysocki
  0 siblings, 1 reply; 51+ messages in thread
From: Rafael J. Wysocki @ 2019-05-16 20:56 UTC (permalink / raw)


On Thu, May 16, 2019@10:45 PM Keith Busch <kbusch@kernel.org> wrote:
>
> On Thu, May 16, 2019@10:25:47PM +0200, Rafael J. Wysocki wrote:
> > On Thu, May 16, 2019@9:43 PM Keith Busch <kbusch@kernel.org> wrote:
> > > If we're going to replace our SIMPLE_DEV_PM_OPS as Rafael suggests,
> > > might as well add a .suspend_noirq callback. We can just save the
> > > state again to work around this too.
> >
> > Relax, pretty please.  Saving the state again shouldn't be necessary.
> >
> > Let's first understand what's going on.
>
> Hah, okay fair enough.
>
> FWIW, I've tried current mainline on two different platforms and I see
> only the expected number of calls to pci_pm_suspend_noirq, so everything
> works for me.

And I don't see why it might not work anywhere else other than a
kernel source hacked into pieces or a platform with fundamental
problems. :-)

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 3/6] nvme-pci: Unblock reset_work on IO failure
  2019-05-16 14:14     ` Keith Busch
@ 2019-05-17  2:31       ` Ming Lei
  0 siblings, 0 replies; 51+ messages in thread
From: Ming Lei @ 2019-05-17  2:31 UTC (permalink / raw)


On Thu, May 16, 2019@08:14:36AM -0600, Keith Busch wrote:
> On Wed, May 15, 2019@08:13:35PM -0700, Ming Lei wrote:
> > On Wed, May 15, 2019@10:36:22AM -0600, Keith Busch wrote:
> > > +		nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DELETING);
> > > +		/* fall through */
> > > +	case NVME_CTRL_DELETING:
> > >  		dev_warn_ratelimited(dev->ctrl.device,
> > >  			 "I/O %d QID %d timeout, disable controller\n",
> > >  			 req->tag, nvmeq->qid);
> > > -		nvme_dev_disable(dev, shutdown);
> > > +		nvme_dev_disable(dev, true);
> > >  		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
> > >  		return BLK_EH_DONE;
> > >  	case NVME_CTRL_RESETTING:
> > 
> > Then the controller is dead, and can't work any more together with data
> > loss. I guess this way is too violent from user view.
> 
> Indeed, it is a bit harsh; however, it is definitely better than having a
> stuck controller unable to make forward progress.

The controller may be stuck at the exact time, and in theory any sane
hardware should be capable of being resetted to its normal state by
software.

The current issue is that NVMe driver is stuck when timeout happens
during reset.

> We may be able to do
> better, but I think this patch is a step in the right direction.

Fare enough, this patch at least makes NVMe driver not stuck together
with cost of data loss and device removal, so

	Reviewed-by: Ming Lei <ming.lei at redhat.com>

And we might have to support timeout during reset in future for making
NVMe system more reliable, cause timeout handling is the final guard.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCHv2 6/6] nvme-pci: Use host managed power state for suspend
  2019-05-16 20:56                 ` Rafael J. Wysocki
@ 2019-05-17  8:39                   ` Rafael J. Wysocki
  2019-05-17  9:05                     ` Christoph Hellwig
  2019-05-17  9:22                     ` Kai-Heng Feng
  0 siblings, 2 replies; 51+ messages in thread
From: Rafael J. Wysocki @ 2019-05-17  8:39 UTC (permalink / raw)


On Thu, May 16, 2019@10:56 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> On Thu, May 16, 2019@10:45 PM Keith Busch <kbusch@kernel.org> wrote:
> >
> > On Thu, May 16, 2019@10:25:47PM +0200, Rafael J. Wysocki wrote:
> > > On Thu, May 16, 2019@9:43 PM Keith Busch <kbusch@kernel.org> wrote:
> > > > If we're going to replace our SIMPLE_DEV_PM_OPS as Rafael suggests,
> > > > might as well add a .suspend_noirq callback. We can just save the
> > > > state again to work around this too.
> > >
> > > Relax, pretty please.  Saving the state again shouldn't be necessary.
> > >
> > > Let's first understand what's going on.
> >
> > Hah, okay fair enough.
> >
> > FWIW, I've tried current mainline on two different platforms and I see
> > only the expected number of calls to pci_pm_suspend_noirq, so everything
> > works for me.
>
> And I don't see why it might not work anywhere else other than a
> kernel source hacked into pieces or a platform with fundamental
> problems. :-)

I forgot about one thing which is relevant here, sorry about that.

If there is a spurious EC wakeup during s2idle, pci_pm_suspend_noirq()
*will* be run again after pci_pm_resume_noirq() without going through
the entire system resume and next suspend.  In that case the second
iteration of pci_pm_suspend_noirq() will put the device into D3, if
that's the target power state of the device as determined by
pci_prepare_to_sleep(), because pci_pm_resume_noirq() calls
pci_restore_state() which clears state_saved.

That's not what appears to happen on the test platform as per the
posted log, but nevertheless it can happen.

Arguably, the driver should not need to worry about that, so that
needs to be addressed in the PCI core IMO.

The attached patch should help (not inline to avoid gmail-induced
white space damage).

I will post it properly with a changelog shortly.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pci-pm-skip-noirq.patch
Type: text/x-patch
Size: 1957 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-nvme/attachments/20190517/036ffbe1/attachment.bin>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCHv2 6/6] nvme-pci: Use host managed power state for suspend
  2019-05-17  8:39                   ` Rafael J. Wysocki
@ 2019-05-17  9:05                     ` Christoph Hellwig
  2019-05-17  9:17                       ` Rafael J. Wysocki
  2019-05-17  9:22                     ` Kai-Heng Feng
  1 sibling, 1 reply; 51+ messages in thread
From: Christoph Hellwig @ 2019-05-17  9:05 UTC (permalink / raw)


On Fri, May 17, 2019@10:39:19AM +0200, Rafael J. Wysocki wrote:
> I forgot about one thing which is relevant here, sorry about that.

While we got your attention, let me repeat two questions / requests
that seem to have got lost:

 (1) in what system power states are the devices not allowed to
     every use DMA to access host memory?  Disabling the host memory
     buffer for NVMe devices can be somewhat expensive, so we'd prefer
     to only do that if we really have to
 (2) can we get some good kerneldoc comments for
     pm_suspend_via_firmware and pm_suspend_via_s2idle explaining
     when exactly they will return true, and how a driver can make
     use of these facts?  Right now these appear a little too much
     black magic to be useful

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCHv2 6/6] nvme-pci: Use host managed power state for suspend
  2019-05-16 14:24     ` Keith Busch
@ 2019-05-17  9:08       ` Christoph Hellwig
  0 siblings, 0 replies; 51+ messages in thread
From: Christoph Hellwig @ 2019-05-17  9:08 UTC (permalink / raw)


On Thu, May 16, 2019@08:24:11AM -0600, Keith Busch wrote:
> > I've still not heard an explanation of anyone why we need to disable
> > the HMB here. Even if there are states where we have to disable it
> > we need to restrict it to just those system power states where we have
> > to, as reclaiming and recreating the HMB is very costly for the device.
> 
> We're not really reclaiming the HMB here. We're just telling the
> controller to stop using it for a moment,

Well, at that point the device can't use it and has to reclaim that
memory internally, depending on how it is used.

> and we return it back to the
> device in the same condition it left it via NVME_HOST_MEM_RETURN. That
> should minimize bringup latency.

Assuming it can just suspend that usage, which might not always be true.

Because of that we should only disable the HMB if we have to.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCHv2 6/6] nvme-pci: Use host managed power state for suspend
  2019-05-17  9:05                     ` Christoph Hellwig
@ 2019-05-17  9:17                       ` Rafael J. Wysocki
  2019-05-17  9:35                         ` Christoph Hellwig
  0 siblings, 1 reply; 51+ messages in thread
From: Rafael J. Wysocki @ 2019-05-17  9:17 UTC (permalink / raw)


On Fri, May 17, 2019@11:05 AM Christoph Hellwig <hch@lst.de> wrote:
>
> On Fri, May 17, 2019@10:39:19AM +0200, Rafael J. Wysocki wrote:
> > I forgot about one thing which is relevant here, sorry about that.
>
> While we got your attention, let me repeat two questions / requests
> that seem to have got lost:
>
>  (1) in what system power states are the devices not allowed to
>      every use DMA to access host memory?  Disabling the host memory
>      buffer for NVMe devices can be somewhat expensive, so we'd prefer
>      to only do that if we really have to

AFAICS, using DMA while creating a hibernation snapshot image of
memory would not be a good idea, so the host memory buffer should be
disabled during the "freeze" stage or hibernation (and it can be
re-enabled when that stage is complete).

Other than this I don't see hard reasons for disabling DMA transfers
unless that it a prerequisite for putting the device into D3.

>  (2) can we get some good kerneldoc comments for
>      pm_suspend_via_firmware and pm_suspend_via_s2idle explaining
>      when exactly they will return true, and how a driver can make
>      use of these facts?

I suppose so. :-)

That's in my list of things to do in fact.

>  Right now these appear a little too much black magic to be useful

Fair enough.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCHv2 6/6] nvme-pci: Use host managed power state for suspend
  2019-05-17  8:39                   ` Rafael J. Wysocki
  2019-05-17  9:05                     ` Christoph Hellwig
@ 2019-05-17  9:22                     ` Kai-Heng Feng
  2019-05-17  9:32                       ` Rafael J. Wysocki
  1 sibling, 1 reply; 51+ messages in thread
From: Kai-Heng Feng @ 2019-05-17  9:22 UTC (permalink / raw)


at 16:39, Rafael J. Wysocki <rafael@kernel.org> wrote:

> On Thu, May 16, 2019 at 10:56 PM Rafael J. Wysocki <rafael at kernel.org>  
> wrote:
>> On Thu, May 16, 2019@10:45 PM Keith Busch <kbusch@kernel.org> wrote:
>>> On Thu, May 16, 2019@10:25:47PM +0200, Rafael J. Wysocki wrote:
>>>> On Thu, May 16, 2019@9:43 PM Keith Busch <kbusch@kernel.org> wrote:
>>>>> If we're going to replace our SIMPLE_DEV_PM_OPS as Rafael suggests,
>>>>> might as well add a .suspend_noirq callback. We can just save the
>>>>> state again to work around this too.
>>>>
>>>> Relax, pretty please.  Saving the state again shouldn't be necessary.
>>>>
>>>> Let's first understand what's going on.
>>>
>>> Hah, okay fair enough.
>>>
>>> FWIW, I've tried current mainline on two different platforms and I see
>>> only the expected number of calls to pci_pm_suspend_noirq, so everything
>>> works for me.
>>
>> And I don't see why it might not work anywhere else other than a
>> kernel source hacked into pieces or a platform with fundamental
>> problems. :-)
>
> I forgot about one thing which is relevant here, sorry about that.
>
> If there is a spurious EC wakeup during s2idle, pci_pm_suspend_noirq()
> *will* be run again after pci_pm_resume_noirq() without going through
> the entire system resume and next suspend.  In that case the second
> iteration of pci_pm_suspend_noirq() will put the device into D3, if
> that's the target power state of the device as determined by
> pci_prepare_to_sleep(), because pci_pm_resume_noirq() calls
> pci_restore_state() which clears state_saved.
>
> That's not what appears to happen on the test platform as per the
> posted log, but nevertheless it can happen.

The patch Mario pointed solves the issue for the test platform I used.

I?ve tested a dozen NVMes, all of them work properly with Keith?s patch  
series, once the LPIT fix is applied.

>
> Arguably, the driver should not need to worry about that, so that
> needs to be addressed in the PCI core IMO.
>
> The attached patch should help (not inline to avoid gmail-induced
> white space damage).
>
> I will post it properly with a changelog shortly.
> <pci-pm-skip-noirq.patch>

Since the patch Mario suggested solves the issue, do you still want me to  
test this patch?

Kai-Heng

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCHv2 6/6] nvme-pci: Use host managed power state for suspend
  2019-05-17  9:22                     ` Kai-Heng Feng
@ 2019-05-17  9:32                       ` Rafael J. Wysocki
  0 siblings, 0 replies; 51+ messages in thread
From: Rafael J. Wysocki @ 2019-05-17  9:32 UTC (permalink / raw)


On Fri, May 17, 2019 at 11:23 AM Kai-Heng Feng
<kai.heng.feng@canonical.com> wrote:
>
>@16:39, Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> > On Thu, May 16, 2019 at 10:56 PM Rafael J. Wysocki <rafael at kernel.org>
> > wrote:
> >> On Thu, May 16, 2019@10:45 PM Keith Busch <kbusch@kernel.org> wrote:
> >>> On Thu, May 16, 2019@10:25:47PM +0200, Rafael J. Wysocki wrote:
> >>>> On Thu, May 16, 2019@9:43 PM Keith Busch <kbusch@kernel.org> wrote:
> >>>>> If we're going to replace our SIMPLE_DEV_PM_OPS as Rafael suggests,
> >>>>> might as well add a .suspend_noirq callback. We can just save the
> >>>>> state again to work around this too.
> >>>>
> >>>> Relax, pretty please.  Saving the state again shouldn't be necessary.
> >>>>
> >>>> Let's first understand what's going on.
> >>>
> >>> Hah, okay fair enough.
> >>>
> >>> FWIW, I've tried current mainline on two different platforms and I see
> >>> only the expected number of calls to pci_pm_suspend_noirq, so everything
> >>> works for me.
> >>
> >> And I don't see why it might not work anywhere else other than a
> >> kernel source hacked into pieces or a platform with fundamental
> >> problems. :-)
> >
> > I forgot about one thing which is relevant here, sorry about that.
> >
> > If there is a spurious EC wakeup during s2idle, pci_pm_suspend_noirq()
> > *will* be run again after pci_pm_resume_noirq() without going through
> > the entire system resume and next suspend.  In that case the second
> > iteration of pci_pm_suspend_noirq() will put the device into D3, if
> > that's the target power state of the device as determined by
> > pci_prepare_to_sleep(), because pci_pm_resume_noirq() calls
> > pci_restore_state() which clears state_saved.
> >
> > That's not what appears to happen on the test platform as per the
> > posted log, but nevertheless it can happen.
>
> The patch Mario pointed solves the issue for the test platform I used.
>
> I?ve tested a dozen NVMes, all of them work properly with Keith?s patch
> series, once the LPIT fix is applied.
>
> >
> > Arguably, the driver should not need to worry about that, so that
> > needs to be addressed in the PCI core IMO.
> >
> > The attached patch should help (not inline to avoid gmail-induced
> > white space damage).
> >
> > I will post it properly with a changelog shortly.
> > <pci-pm-skip-noirq.patch>
>
> Since the patch Mario suggested solves the issue, do you still want me to
> test this patch?

I don't think it will matter in that case.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCHv2 6/6] nvme-pci: Use host managed power state for suspend
  2019-05-17  9:17                       ` Rafael J. Wysocki
@ 2019-05-17  9:35                         ` Christoph Hellwig
  2019-05-17 10:34                           ` Rafael J. Wysocki
  0 siblings, 1 reply; 51+ messages in thread
From: Christoph Hellwig @ 2019-05-17  9:35 UTC (permalink / raw)


On Fri, May 17, 2019@11:17:52AM +0200, Rafael J. Wysocki wrote:
> AFAICS, using DMA while creating a hibernation snapshot image of
> memory would not be a good idea, so the host memory buffer should be
> disabled during the "freeze" stage or hibernation (and it can be
> re-enabled when that stage is complete).

Once we'd disable the HMB the harm is done, so no point in re-enabling
it.  What would this mean in terms of ops?  Only disable in
freeze_noirq?  Or stick to the simple ops and find another helper
that indicates we are doing a hibernation?

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCHv2 6/6] nvme-pci: Use host managed power state for suspend
  2019-05-17  9:35                         ` Christoph Hellwig
@ 2019-05-17 10:34                           ` Rafael J. Wysocki
  2019-05-22  6:47                             ` Kai Heng Feng
  0 siblings, 1 reply; 51+ messages in thread
From: Rafael J. Wysocki @ 2019-05-17 10:34 UTC (permalink / raw)


On Fri, May 17, 2019@11:35 AM Christoph Hellwig <hch@lst.de> wrote:
>
> On Fri, May 17, 2019@11:17:52AM +0200, Rafael J. Wysocki wrote:
> > AFAICS, using DMA while creating a hibernation snapshot image of
> > memory would not be a good idea, so the host memory buffer should be
> > disabled during the "freeze" stage or hibernation (and it can be
> > re-enabled when that stage is complete).
>
> Once we'd disable the HMB the harm is done, so no point in re-enabling it.

But after the "freeze" stage the image may need to be written to the
device, so I guess it would be useful to enable the HMB for that?

> What would this mean in terms of ops?  Only disable in freeze_noirq?

It can be disabled in any of the "freeze" callbacks (freeze,
freeze_late, freeze_irq) where it makes most sense.  I also would
re-enable it in the complementary "thaw" one, in case the device will
be used to store the image.

> Or stick to the simple ops and find another helper that indicates we are doing a hibernation?

There are no helpers for that, it needs to be done manually, sort of.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 1/6] nvme-pci: Fix controller freeze wait disabling
  2019-05-16  2:43 ` [PATCH 1/6] nvme-pci: Fix controller freeze wait disabling Ming Lei
@ 2019-05-17 18:40   ` Keith Busch
  0 siblings, 0 replies; 51+ messages in thread
From: Keith Busch @ 2019-05-17 18:40 UTC (permalink / raw)


I've applied 1-4 for 5.2. The remaining in the series are new features
that need further consideration, and I'd feel better targeting the next
merge window.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCHv2 6/6] nvme-pci: Use host managed power state for suspend
  2019-05-17 10:34                           ` Rafael J. Wysocki
@ 2019-05-22  6:47                             ` Kai Heng Feng
  2019-05-22 15:52                               ` Christoph Hellwig
  0 siblings, 1 reply; 51+ messages in thread
From: Kai Heng Feng @ 2019-05-22  6:47 UTC (permalink / raw)


at 6:34 PM, Rafael J. Wysocki <rafael@kernel.org> wrote:

> On Fri, May 17, 2019@11:35 AM Christoph Hellwig <hch@lst.de> wrote:
>> On Fri, May 17, 2019@11:17:52AM +0200, Rafael J. Wysocki wrote:
>>> AFAICS, using DMA while creating a hibernation snapshot image of
>>> memory would not be a good idea, so the host memory buffer should be
>>> disabled during the "freeze" stage or hibernation (and it can be
>>> re-enabled when that stage is complete).
>>
>> Once we'd disable the HMB the harm is done, so no point in re-enabling it.
>
> But after the "freeze" stage the image may need to be written to the
> device, so I guess it would be useful to enable the HMB for that?
>
>> What would this mean in terms of ops?  Only disable in freeze_noirq?
>
> It can be disabled in any of the "freeze" callbacks (freeze,
> freeze_late, freeze_irq) where it makes most sense.  I also would
> re-enable it in the complementary "thaw" one, in case the device will
> be used to store the image.
>
>> Or stick to the simple ops and find another helper that indicates we are  
>> doing a hibernation?
>
> There are no helpers for that, it needs to be done manually, sort of.

I've got confirmation from Vendor, disabling HMB isn?t necessary when the  
host stays at D0, which is what we use in S2I.
I?ve done some testing and I can confirm S2I suspend/resume can work  
without disabling HMB.

Kai-Heng

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCHv2 6/6] nvme-pci: Use host managed power state for suspend
  2019-05-22  6:47                             ` Kai Heng Feng
@ 2019-05-22 15:52                               ` Christoph Hellwig
  2019-05-22 16:02                                 ` Keith Busch
  0 siblings, 1 reply; 51+ messages in thread
From: Christoph Hellwig @ 2019-05-22 15:52 UTC (permalink / raw)


On Wed, May 22, 2019@02:47:57PM +0800, Kai Heng Feng wrote:
> I've got confirmation from Vendor, disabling HMB isn?t necessary when the 
> host stays at D0, which is what we use in S2I.
> I?ve done some testing and I can confirm S2I suspend/resume can work 
> without disabling HMB.

Great.  Who is going to cook up a new patch taking all the recent
findings into account?

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCHv2 6/6] nvme-pci: Use host managed power state for suspend
  2019-05-22 15:52                               ` Christoph Hellwig
@ 2019-05-22 16:02                                 ` Keith Busch
  2019-05-22 16:35                                   ` Mario.Limonciello
  0 siblings, 1 reply; 51+ messages in thread
From: Keith Busch @ 2019-05-22 16:02 UTC (permalink / raw)


On Wed, May 22, 2019@05:52:53PM +0200, Christoph Hellwig wrote:
> On Wed, May 22, 2019@02:47:57PM +0800, Kai Heng Feng wrote:
> > I've got confirmation from Vendor, disabling HMB isn?t necessary when the 
> > host stays at D0, which is what we use in S2I.
> > I?ve done some testing and I can confirm S2I suspend/resume can work 
> > without disabling HMB.
> 
> Great.  Who is going to cook up a new patch taking all the recent
> findings into account?

I've a branch here that I'll send to the mailing list after some testing
this afternoon

  https://git.kernel.org/pub/scm/linux/kernel/git/kbusch/linux.git/log/?h=nvme-power

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCHv2 6/6] nvme-pci: Use host managed power state for suspend
  2019-05-22 16:02                                 ` Keith Busch
@ 2019-05-22 16:35                                   ` Mario.Limonciello
  2019-05-22 16:56                                     ` Keith Busch
  2019-05-22 23:08                                     ` Keith Busch
  0 siblings, 2 replies; 51+ messages in thread
From: Mario.Limonciello @ 2019-05-22 16:35 UTC (permalink / raw)


>I've a branch here that I'll send to the mailing list after some testing
>this afternoon

 > https://git.kernel.org/pub/scm/linux/kernel/git/kbusch/linux.git/log/?h=nvme-power

Make sure you do your testing with Rafael's patch that keeps the device in D0 across the
various suspend steps, I didn't see it obviously on your branch in the last 2 weeks of
commits.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCHv2 6/6] nvme-pci: Use host managed power state for suspend
  2019-05-22 16:35                                   ` Mario.Limonciello
@ 2019-05-22 16:56                                     ` Keith Busch
  2019-05-22 23:08                                     ` Keith Busch
  1 sibling, 0 replies; 51+ messages in thread
From: Keith Busch @ 2019-05-22 16:56 UTC (permalink / raw)


On Wed, May 22, 2019@04:35:50PM +0000, Mario.Limonciello@dell.com wrote:
> >I've a branch here that I'll send to the mailing list after some testing
> >this afternoon
> 
>  > https://git.kernel.org/pub/scm/linux/kernel/git/kbusch/linux.git/log/?h=nvme-power
> 
> Make sure you do your testing with Rafael's patch that keeps the device in D0 across the
> various suspend steps, I didn't see it obviously on your branch in the last 2 weeks of
> commits.

Sure, I'll test with that fix included in the kernel image, though its
upstream destiny will not be through the nvme tree.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCHv2 6/6] nvme-pci: Use host managed power state for suspend
  2019-05-22 16:35                                   ` Mario.Limonciello
  2019-05-22 16:56                                     ` Keith Busch
@ 2019-05-22 23:08                                     ` Keith Busch
  2019-05-23 15:27                                       ` Keith Busch
  1 sibling, 1 reply; 51+ messages in thread
From: Keith Busch @ 2019-05-22 23:08 UTC (permalink / raw)


On Wed, May 22, 2019@04:35:50PM +0000, Mario.Limonciello@dell.com wrote:
> >I've a branch here that I'll send to the mailing list after some testing
> >this afternoon
> 
>  > https://git.kernel.org/pub/scm/linux/kernel/git/kbusch/linux.git/log/?h=nvme-power
> 
> Make sure you do your testing with Rafael's patch that keeps the device in D0 across the
> various suspend steps, I didn't see it obviously on your branch in the last 2 weeks of
> commits.

Well, I may have picked an unlucky rebase point. I can't run the tests,
but it doesn't look like it has anything to do with the nvme changes. I'll
go back to my original stable point and try again tomorrow.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCHv2 6/6] nvme-pci: Use host managed power state for suspend
  2019-05-22 23:08                                     ` Keith Busch
@ 2019-05-23 15:27                                       ` Keith Busch
  0 siblings, 0 replies; 51+ messages in thread
From: Keith Busch @ 2019-05-23 15:27 UTC (permalink / raw)


On Wed, May 22, 2019@04:08:10PM -0700, Keith Busch wrote:
> On Wed, May 22, 2019@04:35:50PM +0000, Mario.Limonciello@dell.com wrote:
> > >I've a branch here that I'll send to the mailing list after some testing
> > >this afternoon
> > 
> >  > https://git.kernel.org/pub/scm/linux/kernel/git/kbusch/linux.git/log/?h=nvme-power
> > 
> > Make sure you do your testing with Rafael's patch that keeps the device in D0 across the
> > various suspend steps, I didn't see it obviously on your branch in the last 2 weeks of
> > commits.
> 
> Well, I may have picked an unlucky rebase point. I can't run the tests,
> but it doesn't look like it has anything to do with the nvme changes. I'll
> go back to my original stable point and try again tomorrow.

Kernel repo was fine, I just had the wrong kconfig that was intended for my
minimal vm's.

Tests are successful now (though still only faking deeper NPSS), posting
new series shortly.

^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2019-05-23 15:27 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-15 16:36 [PATCH 1/6] nvme-pci: Fix controller freeze wait disabling Keith Busch
2019-05-15 16:36 ` [PATCH 2/6] nvme-pci: Don't disable on timeout in reset state Keith Busch
2019-05-16  3:07   ` Ming Lei
2019-05-16 14:33     ` Keith Busch
2019-05-16  6:27   ` Christoph Hellwig
2019-05-15 16:36 ` [PATCH 3/6] nvme-pci: Unblock reset_work on IO failure Keith Busch
2019-05-16  3:13   ` Ming Lei
2019-05-16 14:14     ` Keith Busch
2019-05-17  2:31       ` Ming Lei
2019-05-16  6:28   ` Christoph Hellwig
2019-05-15 16:36 ` [PATCH 4/6] nvme-pci: Sync queues on reset Keith Busch
2019-05-16  3:34   ` Ming Lei
2019-05-16  6:29   ` Christoph Hellwig
2019-05-16 14:08     ` Keith Busch
2019-05-16 13:43   ` Minwoo Im
2019-05-15 16:36 ` [PATCH 5/6] nvme: Export get and set features Keith Busch
2019-05-16  6:26   ` Christoph Hellwig
2019-05-16 13:47   ` Minwoo Im
2019-05-15 16:36 ` [PATCHv2 6/6] nvme-pci: Use host managed power state for suspend Keith Busch
2019-05-15 19:33   ` Mario.Limonciello
2019-05-15 19:34     ` Keith Busch
2019-05-15 19:43       ` Mario.Limonciello
2019-05-16  6:25   ` Christoph Hellwig
2019-05-16 14:24     ` Keith Busch
2019-05-17  9:08       ` Christoph Hellwig
2019-05-16  9:29   ` Rafael J. Wysocki
2019-05-16 14:26     ` Keith Busch
2019-05-16 18:27       ` Kai-Heng Feng
2019-05-16 18:33         ` Mario.Limonciello
2019-05-16 19:38           ` Keith Busch
2019-05-16 20:25             ` Rafael J. Wysocki
2019-05-16 20:39               ` Keith Busch
2019-05-16 20:56                 ` Rafael J. Wysocki
2019-05-17  8:39                   ` Rafael J. Wysocki
2019-05-17  9:05                     ` Christoph Hellwig
2019-05-17  9:17                       ` Rafael J. Wysocki
2019-05-17  9:35                         ` Christoph Hellwig
2019-05-17 10:34                           ` Rafael J. Wysocki
2019-05-22  6:47                             ` Kai Heng Feng
2019-05-22 15:52                               ` Christoph Hellwig
2019-05-22 16:02                                 ` Keith Busch
2019-05-22 16:35                                   ` Mario.Limonciello
2019-05-22 16:56                                     ` Keith Busch
2019-05-22 23:08                                     ` Keith Busch
2019-05-23 15:27                                       ` Keith Busch
2019-05-17  9:22                     ` Kai-Heng Feng
2019-05-17  9:32                       ` Rafael J. Wysocki
2019-05-16 20:24         ` Rafael J. Wysocki
2019-05-16  2:43 ` [PATCH 1/6] nvme-pci: Fix controller freeze wait disabling Ming Lei
2019-05-17 18:40   ` Keith Busch
2019-05-16  6:27 ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.