All of lore.kernel.org
 help / color / mirror / Atom feed
* Make NVME shutdown two-pass - Version 5
@ 2024-01-29 18:19 Jeremy Allison
  2024-01-29 18:19 ` [PATCH 1/5] driver core: Support two-pass driver shutdown Jeremy Allison
                   ` (4 more replies)
  0 siblings, 5 replies; 20+ messages in thread
From: Jeremy Allison @ 2024-01-29 18:19 UTC (permalink / raw)
  To: jallison, jra, tansuresh, hch, gregkh, rafael, bhelgaas, sagi; +Cc: linux-nvme

This is version 5 of a patchset originally written by
Tanjore Suresh <tansuresh@google.com> to make shutdown
of nvme devices two-pass.

Changes from version 4:

1). Isolated 'enum shutdown_type' change into
drivers/nvme/host/pci.c only.

2). Added a new function: nvme_request_shutdown()
that requests shutdown but doesn't wait for completion,
called from within nvme_dev_disable() when shutdown_type
is NVME_PCI_DISABLE_SHUTDOWN_TWOPASS.

I didn't get feedback on my 'bool or enum' request,
so decided to go with the enum as it reduced the amount
of code changes and makes things cleaner (IMHO).

-------------------------------------------------------------
Currently the Linux nvme driver shutdown code steps
through each connected drive, sets the NVME_CC_SHN_NORMAL
(normal shutdown) flag and then polls the given drive
waiting for the response NVME_CSTS_SHST_CMPLT flag
(shutdown complete).

Each drive is taking around 13 seconds to respond to this.

The customer has 20+ drives on the box so this time adds
up on shutdown when the nvme driver is being shut down.

This patchset changes shutdown to proceed in parallel,
so the NVME_CC_SHN_NORMAL (normal shutdown) flag is
sent to all drives first, and then it polls waiting
for the NVME_CSTS_SHST_CMPLT flag (shutdown complete)
for all drives.

In the specific customer case it reduces the NVME
shutdown time from over 300 seconds to around 15
seconds.
-------------------------------------------------------------

Thanks for your consideration,

Jeremy Allison.
CIQ / Samba Team.




^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 1/5] driver core: Support two-pass driver shutdown.
  2024-01-29 18:19 Make NVME shutdown two-pass - Version 5 Jeremy Allison
@ 2024-01-29 18:19 ` Jeremy Allison
  2024-01-30 10:00   ` Sagi Grimberg
  2024-02-14  3:36   ` Saravana Kannan
  2024-01-29 18:19 ` [PATCH 2/5] PCI: Support two-pass shutdown Jeremy Allison
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 20+ messages in thread
From: Jeremy Allison @ 2024-01-29 18:19 UTC (permalink / raw)
  To: jallison, jra, tansuresh, hch, gregkh, rafael, bhelgaas, sagi; +Cc: linux-nvme

From: Tanjore Suresh <tansuresh@google.com>

This changes the bus driver interface with an additional entry point
to enable devices to implement two-pass shutdown. The existing
synchronous interface to shutdown is called, and if a shutdown_wait
method is defined the device is moved to an alternate list.

Once the shutdown method is called for all devices, the
shutdown_wait method is then called synchronously for
all devices on the alternate list.

Signed-off-by: Tanjore Suresh <tansuresh@google.com>
Signed-off-by: Jeremy Allison <jallison@ciq.com>
---
 drivers/base/core.c        | 35 +++++++++++++++++++++++++++++++++++
 include/linux/device/bus.h |  6 +++++-
 2 files changed, 40 insertions(+), 1 deletion(-)

diff --git a/drivers/base/core.c b/drivers/base/core.c
index 14d46af40f9a..c8b0d4c9abed 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -4725,6 +4725,7 @@ EXPORT_SYMBOL_GPL(device_change_owner);
 void device_shutdown(void)
 {
 	struct device *dev, *parent;
+	LIST_HEAD(shutdown_wait_list);
 
 	wait_for_device_probe();
 	device_block_probing();
@@ -4769,10 +4770,17 @@ void device_shutdown(void)
 				dev_info(dev, "shutdown_pre\n");
 			dev->class->shutdown_pre(dev);
 		}
+
 		if (dev->bus && dev->bus->shutdown) {
 			if (initcall_debug)
 				dev_info(dev, "shutdown\n");
 			dev->bus->shutdown(dev);
+			/*
+			 * Only put the device on the shutdown_wait_list
+			 * if a shutdown_wait() method is also defined.
+			 */
+			if (dev->bus->shutdown_wait)
+				list_add(&dev->kobj.entry, &shutdown_wait_list);
 		} else if (dev->driver && dev->driver->shutdown) {
 			if (initcall_debug)
 				dev_info(dev, "shutdown\n");
@@ -4789,6 +4797,33 @@ void device_shutdown(void)
 		spin_lock(&devices_kset->list_lock);
 	}
 	spin_unlock(&devices_kset->list_lock);
+
+	/*
+	 * Second pass only for devices that have configured
+	 * a shutdown_wait() method.
+	 */
+	while (!list_empty(&shutdown_wait_list)) {
+		dev = list_entry(shutdown_wait_list.next, struct device,
+				kobj.entry);
+		parent = get_device(dev->parent);
+		get_device(dev);
+		/*
+		 * Make sure the device is off the  list
+		 */
+		list_del_init(&dev->kobj.entry);
+		if (parent)
+			device_lock(parent);
+		device_lock(dev);
+		if (initcall_debug)
+			dev_info(dev,
+			"shutdown_wait called\n");
+		dev->bus->shutdown_wait(dev);
+		device_unlock(dev);
+		if (parent)
+			device_unlock(parent);
+		put_device(dev);
+		put_device(parent);
+	}
 }
 
 /*
diff --git a/include/linux/device/bus.h b/include/linux/device/bus.h
index 5ef4ec1c36c3..a4132c44dac3 100644
--- a/include/linux/device/bus.h
+++ b/include/linux/device/bus.h
@@ -48,7 +48,10 @@ struct fwnode_handle;
  *		will never get called until they do.
  * @remove:	Called when a device removed from this bus.
  * @shutdown:	Called at shut-down time to quiesce the device.
- *
+ * @shutdown_wait:	If this method exists, devices are stored on a separate
+ *			list after shutdown() has been called and
+ *			shutdown_wait() is called synchronously on each device
+ *			in this list in turn.
  * @online:	Called to put the device back online (after offlining it).
  * @offline:	Called to put the device offline for hot-removal. May fail.
  *
@@ -87,6 +90,7 @@ struct bus_type {
 	void (*sync_state)(struct device *dev);
 	void (*remove)(struct device *dev);
 	void (*shutdown)(struct device *dev);
+	void (*shutdown_wait)(struct device *dev);
 
 	int (*online)(struct device *dev);
 	int (*offline)(struct device *dev);
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 2/5] PCI: Support two-pass shutdown
  2024-01-29 18:19 Make NVME shutdown two-pass - Version 5 Jeremy Allison
  2024-01-29 18:19 ` [PATCH 1/5] driver core: Support two-pass driver shutdown Jeremy Allison
@ 2024-01-29 18:19 ` Jeremy Allison
  2024-01-30 10:00   ` Sagi Grimberg
  2024-01-30 17:54   ` Bjorn Helgaas
  2024-01-29 18:19 ` [PATCH 3/5] nvme: Change 'bool shutdown' into an enum shutdown_type for nvme_dev_disable() Jeremy Allison
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 20+ messages in thread
From: Jeremy Allison @ 2024-01-29 18:19 UTC (permalink / raw)
  To: jallison, jra, tansuresh, hch, gregkh, rafael, bhelgaas, sagi; +Cc: linux-nvme

From: Tanjore Suresh <tansuresh@google.com>

Enhance the base PCI driver to add support for two-pass
shutdown. Add shutdown_wait() method.

Assume a device takes n secs to shutdown. If a machine has been
populated with M such devices, the total time spent in shutting down
all the devices will be M * n secs if the shutdown is done
synchronously. For example, if NVMe PCI Controllers take 5 secs
to shutdown and if there are 16 such NVMe controllers in a system,
system will spend a total of 80 secs to shutdown all
NVMe devices in that system.

In order to speed up the shutdown time, a two-pass interface to
shutdown has been implemented. The caller calls the shutdown method
for each device in turn to allow a shutdown request to be sent,
then the caller walks the list of devices and calls shutdown_wait()
to synchronously wait for the shutdown to complete.

In the NVMe case above, all 16 devices will process the shutdown
in parallel taking the total time to shutdown down to the time
for one NVMe PCI Controller to shut down.

This will significantly reduce the machine reboot time.

Signed-off-by: Tanjore Suresh <tansuresh@google.com>
Signed-off-by: Jeremy Allison <jallison@ciq.com>
---
 drivers/pci/pci-driver.c | 9 +++++++++
 include/linux/pci.h      | 2 ++
 2 files changed, 11 insertions(+)

diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index 51ec9e7e784f..257bbb04c806 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -547,6 +547,14 @@ static int pci_restore_standard_config(struct pci_dev *pci_dev)
 }
 #endif /* CONFIG_PM_SLEEP */
 
+static void pci_device_shutdown_wait(struct device *dev)
+{
+	struct pci_dev *pci_dev = to_pci_dev(dev);
+	struct pci_driver *drv = pci_dev->driver;
+
+	if (drv && drv->shutdown_wait)
+		drv->shutdown_wait(pci_dev);
+}
 #ifdef CONFIG_PM
 
 /* Auxiliary functions used for system resume and run-time resume */
@@ -1682,6 +1690,7 @@ struct bus_type pci_bus_type = {
 	.probe		= pci_device_probe,
 	.remove		= pci_device_remove,
 	.shutdown	= pci_device_shutdown,
+	.shutdown_wait	= pci_device_shutdown_wait,
 	.dev_groups	= pci_dev_groups,
 	.bus_groups	= pci_bus_groups,
 	.drv_groups	= pci_drv_groups,
diff --git a/include/linux/pci.h b/include/linux/pci.h
index add9368e6314..5ef014ac84f2 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -917,6 +917,7 @@ struct module;
  *		Useful for enabling wake-on-lan (NIC) or changing
  *		the power state of a device before reboot.
  *		e.g. drivers/net/e100.c.
+ * @shutdown_wait: Optional driver callback to allow two-pass shutdown.
  * @sriov_configure: Optional driver callback to allow configuration of
  *		number of VFs to enable via sysfs "sriov_numvfs" file.
  * @sriov_set_msix_vec_count: PF Driver callback to change number of MSI-X
@@ -947,6 +948,7 @@ struct pci_driver {
 	int  (*suspend)(struct pci_dev *dev, pm_message_t state);	/* Device suspended */
 	int  (*resume)(struct pci_dev *dev);	/* Device woken up */
 	void (*shutdown)(struct pci_dev *dev);
+	void (*shutdown_wait)(struct pci_dev *dev);
 	int  (*sriov_configure)(struct pci_dev *dev, int num_vfs); /* On PF */
 	int  (*sriov_set_msix_vec_count)(struct pci_dev *vf, int msix_vec_count); /* On PF */
 	u32  (*sriov_get_vf_total_msix)(struct pci_dev *pf);
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 3/5] nvme: Change 'bool shutdown' into an enum shutdown_type for nvme_dev_disable().
  2024-01-29 18:19 Make NVME shutdown two-pass - Version 5 Jeremy Allison
  2024-01-29 18:19 ` [PATCH 1/5] driver core: Support two-pass driver shutdown Jeremy Allison
  2024-01-29 18:19 ` [PATCH 2/5] PCI: Support two-pass shutdown Jeremy Allison
@ 2024-01-29 18:19 ` Jeremy Allison
  2024-01-30 11:00   ` Sagi Grimberg
  2024-01-31  6:19   ` Christoph Hellwig
  2024-01-29 18:19 ` [PATCH 4/5] nvme: Add a new exported function nvme_request_shutdown() Jeremy Allison
  2024-01-29 18:19 ` [PATCH 5/5] nvme: Add two-pass shutdown support Jeremy Allison
  4 siblings, 2 replies; 20+ messages in thread
From: Jeremy Allison @ 2024-01-29 18:19 UTC (permalink / raw)
  To: jallison, jra, tansuresh, hch, gregkh, rafael, bhelgaas, sagi; +Cc: linux-nvme

Convert nvme_dev_disable() and nvme_disable_prepare_reset()
inside drivers/nvme/host/pci.c to use this:

bool shutdown = false == NVME_PCI_DISABLE_RESET
bool shutdown = true == (NVME_PCI_DISABLE_SHUTDOWN_SYNC ||
			 NVME_PCI_DISABLE_SHUTDOWN_TWOPASS)

The third state NVME_PCI_DISABLE_SHUTDOWN_TWOPASS
will be used later to implement two-pass PCI nvme
shutdown.

Signed-off-by: Jeremy Allison <jallison@ciq.com>
---
 drivers/nvme/host/pci.c | 47 +++++++++++++++++++++++++----------------
 1 file changed, 29 insertions(+), 18 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 25eb72779541..8ee77f755d9d 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -108,7 +108,14 @@ MODULE_PARM_DESC(noacpi, "disable acpi bios quirks");
 struct nvme_dev;
 struct nvme_queue;
 
-static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown);
+enum shutdown_type {
+	NVME_PCI_DISABLE_RESET = 0,
+	NVME_PCI_DISABLE_SHUTDOWN = 1,
+	NVME_PCI_DISABLE_SHUTDOWN_TWOPASS = 2
+};
+
+static void nvme_dev_disable(struct nvme_dev *dev,
+		enum shutdown_type shutdown_type);
 static void nvme_delete_io_queues(struct nvme_dev *dev);
 static void nvme_update_attrs(struct nvme_dev *dev);
 
@@ -1331,7 +1338,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req)
 			 "I/O tag %d (%04x) QID %d timeout, disable controller\n",
 			 req->tag, nvme_cid(req), nvmeq->qid);
 		nvme_req(req)->flags |= NVME_REQ_CANCELLED;
-		nvme_dev_disable(dev, true);
+		nvme_dev_disable(dev, NVME_PCI_DISABLE_SHUTDOWN);
 		return BLK_EH_DONE;
 	case NVME_CTRL_RESETTING:
 		return BLK_EH_RESET_TIMER;
@@ -1393,7 +1400,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req)
 	if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING))
 		return BLK_EH_DONE;
 
-	nvme_dev_disable(dev, false);
+	nvme_dev_disable(dev, NVME_PCI_DISABLE_RESET);
 	if (nvme_try_sched_reset(&dev->ctrl))
 		nvme_unquiesce_io_queues(&dev->ctrl);
 	return BLK_EH_DONE;
@@ -2574,11 +2581,14 @@ static bool nvme_pci_ctrl_is_dead(struct nvme_dev *dev)
 	return (csts & NVME_CSTS_CFS) || !(csts & NVME_CSTS_RDY);
 }
 
-static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
+static void nvme_dev_disable(struct nvme_dev *dev,
+		enum shutdown_type shutdown_type)
 {
 	enum nvme_ctrl_state state = nvme_ctrl_state(&dev->ctrl);
 	struct pci_dev *pdev = to_pci_dev(dev->dev);
 	bool dead;
+	bool shutdown = (shutdown_type == NVME_PCI_DISABLE_SHUTDOWN ||
+			 shutdown_type == NVME_PCI_DISABLE_SHUTDOWN_TWOPASS);
 
 	mutex_lock(&dev->shutdown_lock);
 	dead = nvme_pci_ctrl_is_dead(dev);
@@ -2623,11 +2633,12 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
 	mutex_unlock(&dev->shutdown_lock);
 }
 
-static int nvme_disable_prepare_reset(struct nvme_dev *dev, bool shutdown)
+static int nvme_disable_prepare_reset(struct nvme_dev *dev,
+		enum shutdown_type shutdown_type)
 {
 	if (!nvme_wait_reset(&dev->ctrl))
 		return -EBUSY;
-	nvme_dev_disable(dev, shutdown);
+	nvme_dev_disable(dev, shutdown_type);
 	return 0;
 }
 
@@ -2705,7 +2716,7 @@ static void nvme_reset_work(struct work_struct *work)
 	 * moving on.
 	 */
 	if (dev->ctrl.ctrl_config & NVME_CC_ENABLE)
-		nvme_dev_disable(dev, false);
+		nvme_dev_disable(dev, NVME_PCI_DISABLE_RESET);
 	nvme_sync_queues(&dev->ctrl);
 
 	mutex_lock(&dev->shutdown_lock);
@@ -2783,7 +2794,7 @@ static void nvme_reset_work(struct work_struct *work)
 	dev_warn(dev->ctrl.device, "Disabling device after reset failure: %d\n",
 		 result);
 	nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DELETING);
-	nvme_dev_disable(dev, true);
+	nvme_dev_disable(dev, NVME_PCI_DISABLE_SHUTDOWN);
 	nvme_sync_queues(&dev->ctrl);
 	nvme_mark_namespaces_dead(&dev->ctrl);
 	nvme_unquiesce_io_queues(&dev->ctrl);
@@ -3075,7 +3086,7 @@ static int nvme_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 
 out_disable:
 	nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DELETING);
-	nvme_dev_disable(dev, true);
+	nvme_dev_disable(dev, NVME_PCI_DISABLE_SHUTDOWN);
 	nvme_free_host_mem(dev);
 	nvme_dev_remove_admin(dev);
 	nvme_dbbuf_dma_free(dev);
@@ -3101,7 +3112,7 @@ static void nvme_reset_prepare(struct pci_dev *pdev)
 	 * state as pci_dev device lock is held, making it impossible to race
 	 * with ->remove().
 	 */
-	nvme_disable_prepare_reset(dev, false);
+	nvme_disable_prepare_reset(dev, NVME_PCI_DISABLE_RESET);
 	nvme_sync_queues(&dev->ctrl);
 }
 
@@ -3117,7 +3128,7 @@ static void nvme_shutdown(struct pci_dev *pdev)
 {
 	struct nvme_dev *dev = pci_get_drvdata(pdev);
 
-	nvme_disable_prepare_reset(dev, true);
+	nvme_disable_prepare_reset(dev, NVME_PCI_DISABLE_SHUTDOWN);
 }
 
 /*
@@ -3134,13 +3145,13 @@ static void nvme_remove(struct pci_dev *pdev)
 
 	if (!pci_device_is_present(pdev)) {
 		nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DEAD);
-		nvme_dev_disable(dev, true);
+		nvme_dev_disable(dev, NVME_PCI_DISABLE_SHUTDOWN);
 	}
 
 	flush_work(&dev->ctrl.reset_work);
 	nvme_stop_ctrl(&dev->ctrl);
 	nvme_remove_namespaces(&dev->ctrl);
-	nvme_dev_disable(dev, true);
+	nvme_dev_disable(dev, NVME_PCI_DISABLE_SHUTDOWN);
 	nvme_free_host_mem(dev);
 	nvme_dev_remove_admin(dev);
 	nvme_dbbuf_dma_free(dev);
@@ -3203,7 +3214,7 @@ static int nvme_suspend(struct device *dev)
 	if (pm_suspend_via_firmware() || !ctrl->npss ||
 	    !pcie_aspm_enabled(pdev) ||
 	    (ndev->ctrl.quirks & NVME_QUIRK_SIMPLE_SUSPEND))
-		return nvme_disable_prepare_reset(ndev, true);
+		return nvme_disable_prepare_reset(ndev, NVME_PCI_DISABLE_SHUTDOWN);
 
 	nvme_start_freeze(ctrl);
 	nvme_wait_freeze(ctrl);
@@ -3246,7 +3257,7 @@ static int nvme_suspend(struct device *dev)
 		 * Clearing npss forces a controller reset on resume. The
 		 * correct value will be rediscovered then.
 		 */
-		ret = nvme_disable_prepare_reset(ndev, true);
+		ret = nvme_disable_prepare_reset(ndev, NVME_PCI_DISABLE_SHUTDOWN);
 		ctrl->npss = 0;
 	}
 unfreeze:
@@ -3258,7 +3269,7 @@ static int nvme_simple_suspend(struct device *dev)
 {
 	struct nvme_dev *ndev = pci_get_drvdata(to_pci_dev(dev));
 
-	return nvme_disable_prepare_reset(ndev, true);
+	return nvme_disable_prepare_reset(ndev, NVME_PCI_DISABLE_SHUTDOWN);
 }
 
 static int nvme_simple_resume(struct device *dev)
@@ -3296,10 +3307,10 @@ static pci_ers_result_t nvme_error_detected(struct pci_dev *pdev,
 		dev_warn(dev->ctrl.device,
 			"frozen state error detected, reset controller\n");
 		if (!nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_RESETTING)) {
-			nvme_dev_disable(dev, true);
+			nvme_dev_disable(dev, NVME_PCI_DISABLE_SHUTDOWN);
 			return PCI_ERS_RESULT_DISCONNECT;
 		}
-		nvme_dev_disable(dev, false);
+		nvme_dev_disable(dev, NVME_PCI_DISABLE_RESET);
 		return PCI_ERS_RESULT_NEED_RESET;
 	case pci_channel_io_perm_failure:
 		dev_warn(dev->ctrl.device,
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 4/5] nvme: Add a new exported function nvme_request_shutdown().
  2024-01-29 18:19 Make NVME shutdown two-pass - Version 5 Jeremy Allison
                   ` (2 preceding siblings ...)
  2024-01-29 18:19 ` [PATCH 3/5] nvme: Change 'bool shutdown' into an enum shutdown_type for nvme_dev_disable() Jeremy Allison
@ 2024-01-29 18:19 ` Jeremy Allison
  2024-01-30 10:58   ` Sagi Grimberg
  2024-01-29 18:19 ` [PATCH 5/5] nvme: Add two-pass shutdown support Jeremy Allison
  4 siblings, 1 reply; 20+ messages in thread
From: Jeremy Allison @ 2024-01-29 18:19 UTC (permalink / raw)
  To: jallison, jra, tansuresh, hch, gregkh, rafael, bhelgaas, sagi; +Cc: linux-nvme

Sets the shutdown bit but doesn't wait for ready.
Use from nvme_disable_ctrl(). Export nvme_wait_ready()
so we can call it from drivers/nvme/host/pci.c.

Signed-off-by: Jeremy Allison <jallison@ciq.com>
---
 drivers/nvme/host/core.c | 22 ++++++++++++++++------
 drivers/nvme/host/nvme.h |  3 +++
 2 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index a42c347d08e8..13635ef88569 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2215,7 +2215,7 @@ const struct block_device_operations nvme_bdev_ops = {
 	.pr_ops		= &nvme_pr_ops,
 };
 
-static int nvme_wait_ready(struct nvme_ctrl *ctrl, u32 mask, u32 val,
+int nvme_wait_ready(struct nvme_ctrl *ctrl, u32 mask, u32 val,
 		u32 timeout, const char *op)
 {
 	unsigned long timeout_jiffies = jiffies + timeout * HZ;
@@ -2241,18 +2241,28 @@ static int nvme_wait_ready(struct nvme_ctrl *ctrl, u32 mask, u32 val,
 
 	return ret;
 }
+EXPORT_SYMBOL_GPL(nvme_wait_ready);
+
+int nvme_request_shutdown(struct nvme_ctrl *ctrl)
+{
+	ctrl->ctrl_config &= ~NVME_CC_SHN_MASK;
+	ctrl->ctrl_config |= NVME_CC_SHN_NORMAL;
+	return ctrl->ops->reg_write32(ctrl, NVME_REG_CC, ctrl->ctrl_config);
+}
+EXPORT_SYMBOL_GPL(nvme_request_shutdown);
 
 int nvme_disable_ctrl(struct nvme_ctrl *ctrl, bool shutdown)
 {
 	int ret;
 
-	ctrl->ctrl_config &= ~NVME_CC_SHN_MASK;
-	if (shutdown)
-		ctrl->ctrl_config |= NVME_CC_SHN_NORMAL;
-	else
+	if (shutdown) {
+		ret = nvme_request_shutdown(ctrl);
+	} else {
+		ctrl->ctrl_config &= ~NVME_CC_SHN_MASK;
 		ctrl->ctrl_config &= ~NVME_CC_ENABLE;
+		ret = ctrl->ops->reg_write32(ctrl, NVME_REG_CC, ctrl->ctrl_config);
+	}
 
-	ret = ctrl->ops->reg_write32(ctrl, NVME_REG_CC, ctrl->ctrl_config);
 	if (ret)
 		return ret;
 
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 1700063bc24d..a3409afbebc5 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -758,6 +758,9 @@ blk_status_t nvme_host_path_error(struct request *req);
 bool nvme_cancel_request(struct request *req, void *data);
 void nvme_cancel_tagset(struct nvme_ctrl *ctrl);
 void nvme_cancel_admin_tagset(struct nvme_ctrl *ctrl);
+int nvme_wait_ready(struct nvme_ctrl *ctrl, u32 mask, u32 val,
+		u32 timeout, const char *op);
+int nvme_request_shutdown(struct nvme_ctrl *ctrl);
 bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
 		enum nvme_ctrl_state new_state);
 int nvme_disable_ctrl(struct nvme_ctrl *ctrl, bool shutdown);
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 5/5] nvme: Add two-pass shutdown support.
  2024-01-29 18:19 Make NVME shutdown two-pass - Version 5 Jeremy Allison
                   ` (3 preceding siblings ...)
  2024-01-29 18:19 ` [PATCH 4/5] nvme: Add a new exported function nvme_request_shutdown() Jeremy Allison
@ 2024-01-29 18:19 ` Jeremy Allison
  2024-01-30 11:15   ` Sagi Grimberg
  4 siblings, 1 reply; 20+ messages in thread
From: Jeremy Allison @ 2024-01-29 18:19 UTC (permalink / raw)
  To: jallison, jra, tansuresh, hch, gregkh, rafael, bhelgaas, sagi; +Cc: linux-nvme

This works with the two-pass shutdown mechanism setup for the PCI
drivers and participates to provide the shutdown_wait
method at the pci_driver structure level.

This patch changes the nvme shutdown() method to pass
down the NVME_PCI_DISABLE_SHUTDOWN_TWOPASS enum value instead
of NVME_PCI_DISABLE_SHUTDOWN. nvme_dev_disable() is changed
to call nvme_request_shutdown() instead of nvme_disable_ctrl()
in this case.

nvme_request_shutdown() sets the shutdown bit,
but does not wait for completion.

The nvme_shutdown_wait() callback is added to synchronously
wait for the NVME_CSTS_SHST_CMPLT bit meaning the nvme
device has shutdown.

This change speeds up the shutdown in a system which hosts
many controllers.

Based on work by Tanjore Suresh <tansuresh@google.com>

Signed-off-by: Jeremy Allison <jallison@ciq.com>
---
 drivers/nvme/host/pci.c | 39 ++++++++++++++++++++++++++++++++++++---
 1 file changed, 36 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 8ee77f755d9d..bbd89eecf05a 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2607,7 +2607,14 @@ static void nvme_dev_disable(struct nvme_dev *dev,
 
 	if (!dead && dev->ctrl.queue_count > 0) {
 		nvme_delete_io_queues(dev);
-		nvme_disable_ctrl(&dev->ctrl, shutdown);
+		/*
+		 * NVME_PCI_DISABLE_SHUTDOWN_TWOPASS requests shutdown
+		 * but doesn't wait for completion.
+		 */
+		if (shutdown_type == NVME_PCI_DISABLE_SHUTDOWN_TWOPASS)
+			nvme_request_shutdown(&dev->ctrl);
+		else
+			nvme_disable_ctrl(&dev->ctrl, shutdown);
 		nvme_poll_irqdisable(&dev->queues[0]);
 	}
 	nvme_suspend_io_queues(dev);
@@ -2625,7 +2632,7 @@ static void nvme_dev_disable(struct nvme_dev *dev,
 	 * must flush all entered requests to their failed completion to avoid
 	 * deadlocking blk-mq hot-cpu notifier.
 	 */
-	if (shutdown) {
+	if (shutdown_type == NVME_PCI_DISABLE_SHUTDOWN) {
 		nvme_unquiesce_io_queues(&dev->ctrl);
 		if (dev->ctrl.admin_q && !blk_queue_dying(dev->ctrl.admin_q))
 			nvme_unquiesce_admin_queue(&dev->ctrl);
@@ -3128,7 +3135,32 @@ static void nvme_shutdown(struct pci_dev *pdev)
 {
 	struct nvme_dev *dev = pci_get_drvdata(pdev);
 
-	nvme_disable_prepare_reset(dev, NVME_PCI_DISABLE_SHUTDOWN);
+	nvme_disable_prepare_reset(dev, NVME_PCI_DISABLE_SHUTDOWN_TWOPASS);
+}
+
+static void nvme_shutdown_wait(struct pci_dev *pdev)
+{
+	struct nvme_dev *dev = pci_get_drvdata(pdev);
+
+	mutex_lock(&dev->shutdown_lock);
+	/*
+	 * Finish waiting for the shutdown request
+	 * initiated in nvme_shutdown() above using
+	 * NVME_PCI_DISABLE_SHUTDOWN_TWOPASS.
+	 */
+	nvme_wait_ready(&dev->ctrl, NVME_CSTS_SHST_MASK,
+			NVME_CSTS_SHST_CMPLT,
+			dev->ctrl.shutdown_timeout, "shutdown");
+	/*
+	 * The driver will not be starting up queues again if shutting down so
+	 * must flush all entered requests to their failed completion to avoid
+	 * deadlocking blk-mq hot-cpu notifier.
+	 */
+	nvme_unquiesce_io_queues(&dev->ctrl);
+	if (dev->ctrl.admin_q && !blk_queue_dying(dev->ctrl.admin_q))
+		nvme_unquiesce_admin_queue(&dev->ctrl);
+
+	mutex_unlock(&dev->shutdown_lock);
 }
 
 /*
@@ -3522,6 +3554,7 @@ static struct pci_driver nvme_driver = {
 	.probe		= nvme_probe,
 	.remove		= nvme_remove,
 	.shutdown	= nvme_shutdown,
+	.shutdown_wait	= nvme_shutdown_wait,
 	.driver		= {
 		.probe_type	= PROBE_PREFER_ASYNCHRONOUS,
 #ifdef CONFIG_PM_SLEEP
-- 
2.39.3



^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/5] driver core: Support two-pass driver shutdown.
  2024-01-29 18:19 ` [PATCH 1/5] driver core: Support two-pass driver shutdown Jeremy Allison
@ 2024-01-30 10:00   ` Sagi Grimberg
  2024-02-14  3:36   ` Saravana Kannan
  1 sibling, 0 replies; 20+ messages in thread
From: Sagi Grimberg @ 2024-01-30 10:00 UTC (permalink / raw)
  To: Jeremy Allison, jra, tansuresh, hch, gregkh, rafael, bhelgaas; +Cc: linux-nvme

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/5] PCI: Support two-pass shutdown
  2024-01-29 18:19 ` [PATCH 2/5] PCI: Support two-pass shutdown Jeremy Allison
@ 2024-01-30 10:00   ` Sagi Grimberg
  2024-01-30 17:54   ` Bjorn Helgaas
  1 sibling, 0 replies; 20+ messages in thread
From: Sagi Grimberg @ 2024-01-30 10:00 UTC (permalink / raw)
  To: Jeremy Allison, jra, tansuresh, hch, gregkh, rafael, bhelgaas; +Cc: linux-nvme

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/5] nvme: Add a new exported function nvme_request_shutdown().
  2024-01-29 18:19 ` [PATCH 4/5] nvme: Add a new exported function nvme_request_shutdown() Jeremy Allison
@ 2024-01-30 10:58   ` Sagi Grimberg
  2024-01-30 17:31     ` Jeremy Allison
  0 siblings, 1 reply; 20+ messages in thread
From: Sagi Grimberg @ 2024-01-30 10:58 UTC (permalink / raw)
  To: Jeremy Allison, jra, tansuresh, hch, gregkh, rafael, bhelgaas; +Cc: linux-nvme


> Sets the shutdown bit but doesn't wait for ready.
> Use from nvme_disable_ctrl(). Export nvme_wait_ready()
> so we can call it from drivers/nvme/host/pci.c.
> 
> Signed-off-by: Jeremy Allison <jallison@ciq.com>
> ---
>   drivers/nvme/host/core.c | 22 ++++++++++++++++------
>   drivers/nvme/host/nvme.h |  3 +++
>   2 files changed, 19 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index a42c347d08e8..13635ef88569 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -2215,7 +2215,7 @@ const struct block_device_operations nvme_bdev_ops = {
>   	.pr_ops		= &nvme_pr_ops,
>   };
>   
> -static int nvme_wait_ready(struct nvme_ctrl *ctrl, u32 mask, u32 val,
> +int nvme_wait_ready(struct nvme_ctrl *ctrl, u32 mask, u32 val,
>   		u32 timeout, const char *op)
>   {
>   	unsigned long timeout_jiffies = jiffies + timeout * HZ;
> @@ -2241,18 +2241,28 @@ static int nvme_wait_ready(struct nvme_ctrl *ctrl, u32 mask, u32 val,
>   
>   	return ret;
>   }
> +EXPORT_SYMBOL_GPL(nvme_wait_ready);
> +
> +int nvme_request_shutdown(struct nvme_ctrl *ctrl)
> +{
> +	ctrl->ctrl_config &= ~NVME_CC_SHN_MASK;
> +	ctrl->ctrl_config |= NVME_CC_SHN_NORMAL;
> +	return ctrl->ops->reg_write32(ctrl, NVME_REG_CC, ctrl->ctrl_config);
> +}
> +EXPORT_SYMBOL_GPL(nvme_request_shutdown);

I think the name can be better, nvme_request_shutdown makes it seem
that you are actually sending a request to the controller.

Maybe nvme_ctrl_shutdown_start ? or as I suggested before
nvme_shutdown_nowait ?

>   
>   int nvme_disable_ctrl(struct nvme_ctrl *ctrl, bool shutdown)
>   {
>   	int ret;
>   
> -	ctrl->ctrl_config &= ~NVME_CC_SHN_MASK;
> -	if (shutdown)
> -		ctrl->ctrl_config |= NVME_CC_SHN_NORMAL;
> -	else
> +	if (shutdown) {
> +		ret = nvme_request_shutdown(ctrl);
> +	} else {
> +		ctrl->ctrl_config &= ~NVME_CC_SHN_MASK;
>   		ctrl->ctrl_config &= ~NVME_CC_ENABLE;
> +		ret = ctrl->ops->reg_write32(ctrl, NVME_REG_CC, ctrl->ctrl_config);
> +	}
>   
> -	ret = ctrl->ops->reg_write32(ctrl, NVME_REG_CC, ctrl->ctrl_config);
>   	if (ret)
>   		return ret;
>   
> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> index 1700063bc24d..a3409afbebc5 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -758,6 +758,9 @@ blk_status_t nvme_host_path_error(struct request *req);
>   bool nvme_cancel_request(struct request *req, void *data);
>   void nvme_cancel_tagset(struct nvme_ctrl *ctrl);
>   void nvme_cancel_admin_tagset(struct nvme_ctrl *ctrl);
> +int nvme_wait_ready(struct nvme_ctrl *ctrl, u32 mask, u32 val,
> +		u32 timeout, const char *op);
> +int nvme_request_shutdown(struct nvme_ctrl *ctrl);
>   bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
>   		enum nvme_ctrl_state new_state);
>   int nvme_disable_ctrl(struct nvme_ctrl *ctrl, bool shutdown);


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/5] nvme: Change 'bool shutdown' into an enum shutdown_type for nvme_dev_disable().
  2024-01-29 18:19 ` [PATCH 3/5] nvme: Change 'bool shutdown' into an enum shutdown_type for nvme_dev_disable() Jeremy Allison
@ 2024-01-30 11:00   ` Sagi Grimberg
  2024-01-31  6:19   ` Christoph Hellwig
  1 sibling, 0 replies; 20+ messages in thread
From: Sagi Grimberg @ 2024-01-30 11:00 UTC (permalink / raw)
  To: Jeremy Allison, jra, tansuresh, hch, gregkh, rafael, bhelgaas; +Cc: linux-nvme

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 5/5] nvme: Add two-pass shutdown support.
  2024-01-29 18:19 ` [PATCH 5/5] nvme: Add two-pass shutdown support Jeremy Allison
@ 2024-01-30 11:15   ` Sagi Grimberg
  2024-01-30 18:54     ` Jeremy Allison
  0 siblings, 1 reply; 20+ messages in thread
From: Sagi Grimberg @ 2024-01-30 11:15 UTC (permalink / raw)
  To: Jeremy Allison, jra, tansuresh, hch, gregkh, rafael, bhelgaas; +Cc: linux-nvme


> This works with the two-pass shutdown mechanism setup for the PCI
> drivers and participates to provide the shutdown_wait
> method at the pci_driver structure level.
> 
> This patch changes the nvme shutdown() method to pass
> down the NVME_PCI_DISABLE_SHUTDOWN_TWOPASS enum value instead
> of NVME_PCI_DISABLE_SHUTDOWN. nvme_dev_disable() is changed
> to call nvme_request_shutdown() instead of nvme_disable_ctrl()
> in this case.
> 
> nvme_request_shutdown() sets the shutdown bit,
> but does not wait for completion.
> 
> The nvme_shutdown_wait() callback is added to synchronously
> wait for the NVME_CSTS_SHST_CMPLT bit meaning the nvme
> device has shutdown.
> 
> This change speeds up the shutdown in a system which hosts
> many controllers.
> 
> Based on work by Tanjore Suresh <tansuresh@google.com>
> 
> Signed-off-by: Jeremy Allison <jallison@ciq.com>
> ---
>   drivers/nvme/host/pci.c | 39 ++++++++++++++++++++++++++++++++++++---
>   1 file changed, 36 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index 8ee77f755d9d..bbd89eecf05a 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -2607,7 +2607,14 @@ static void nvme_dev_disable(struct nvme_dev *dev,
>   
>   	if (!dead && dev->ctrl.queue_count > 0) {
>   		nvme_delete_io_queues(dev);
> -		nvme_disable_ctrl(&dev->ctrl, shutdown);
> +		/*
> +		 * NVME_PCI_DISABLE_SHUTDOWN_TWOPASS requests shutdown
> +		 * but doesn't wait for completion.
> +		 */
> +		if (shutdown_type == NVME_PCI_DISABLE_SHUTDOWN_TWOPASS)
> +			nvme_request_shutdown(&dev->ctrl);
> +		else
> +			nvme_disable_ctrl(&dev->ctrl, shutdown);
>   		nvme_poll_irqdisable(&dev->queues[0]);
>   	}
>   	nvme_suspend_io_queues(dev);
> @@ -2625,7 +2632,7 @@ static void nvme_dev_disable(struct nvme_dev *dev,
>   	 * must flush all entered requests to their failed completion to avoid
>   	 * deadlocking blk-mq hot-cpu notifier.
>   	 */
> -	if (shutdown) {
> +	if (shutdown_type == NVME_PCI_DISABLE_SHUTDOWN) {
>   		nvme_unquiesce_io_queues(&dev->ctrl);
>   		if (dev->ctrl.admin_q && !blk_queue_dying(dev->ctrl.admin_q))
>   			nvme_unquiesce_admin_queue(&dev->ctrl);
> @@ -3128,7 +3135,32 @@ static void nvme_shutdown(struct pci_dev *pdev)
>   {
>   	struct nvme_dev *dev = pci_get_drvdata(pdev);
>   
> -	nvme_disable_prepare_reset(dev, NVME_PCI_DISABLE_SHUTDOWN);
> +	nvme_disable_prepare_reset(dev, NVME_PCI_DISABLE_SHUTDOWN_TWOPASS);
> +}
> +
> +static void nvme_shutdown_wait(struct pci_dev *pdev)
> +{
> +	struct nvme_dev *dev = pci_get_drvdata(pdev);
> +
> +	mutex_lock(&dev->shutdown_lock);

General question that just came up, is there any risk here that the
shutdown_lock is released and re-taken later while the shutdown is still 
in progress? I don't spot anything immediate, but perhaps Keith can
comment?

If this change does expose us to potential issues, you can simply
release the shutdown_lock here (while taking it in ?

> +	/*
> +	 * Finish waiting for the shutdown request
> +	 * initiated in nvme_shutdown() above using
> +	 * NVME_PCI_DISABLE_SHUTDOWN_TWOPASS.
> +	 */
> +	nvme_wait_ready(&dev->ctrl, NVME_CSTS_SHST_MASK,
> +			NVME_CSTS_SHST_CMPLT,
> +			dev->ctrl.shutdown_timeout, "shutdown");
> +	/*
> +	 * The driver will not be starting up queues again if shutting down so
> +	 * must flush all entered requests to their failed completion to avoid
> +	 * deadlocking blk-mq hot-cpu notifier.
> +	 */
> +	nvme_unquiesce_io_queues(&dev->ctrl);
> +	if (dev->ctrl.admin_q && !blk_queue_dying(dev->ctrl.admin_q))
> +		nvme_unquiesce_admin_queue(&dev->ctrl);
> +
> +	mutex_unlock(&dev->shutdown_lock);
>   }
>   
>   /*
> @@ -3522,6 +3554,7 @@ static struct pci_driver nvme_driver = {
>   	.probe		= nvme_probe,
>   	.remove		= nvme_remove,
>   	.shutdown	= nvme_shutdown,
> +	.shutdown_wait	= nvme_shutdown_wait,
>   	.driver		= {
>   		.probe_type	= PROBE_PREFER_ASYNCHRONOUS,
>   #ifdef CONFIG_PM_SLEEP


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/5] nvme: Add a new exported function nvme_request_shutdown().
  2024-01-30 10:58   ` Sagi Grimberg
@ 2024-01-30 17:31     ` Jeremy Allison
  2024-01-31  6:20       ` Christoph Hellwig
  0 siblings, 1 reply; 20+ messages in thread
From: Jeremy Allison @ 2024-01-30 17:31 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Jeremy Allison, tansuresh, hch, gregkh, rafael, bhelgaas,
	linux-nvme, jra

On Tue, Jan 30, 2024 at 12:58:31PM +0200, Sagi Grimberg wrote:
>>+	ctrl->ctrl_config &= ~NVME_CC_SHN_MASK;
>>+	ctrl->ctrl_config |= NVME_CC_SHN_NORMAL;
>>+	return ctrl->ops->reg_write32(ctrl, NVME_REG_CC, ctrl->ctrl_config);
>>+}
>>+EXPORT_SYMBOL_GPL(nvme_request_shutdown);
>
>I think the name can be better, nvme_request_shutdown makes it seem
>that you are actually sending a request to the controller.
>
>Maybe nvme_ctrl_shutdown_start ? or as I suggested before
>nvme_shutdown_nowait ?

I think the idea of adding _ctrl_ in the name is a good
one as it makes explicit the level it's operating on.

I think nvme_ctrl_shutdown_start() is the better name,
so I'll use that for the next version.

Thanks !


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/5] PCI: Support two-pass shutdown
  2024-01-29 18:19 ` [PATCH 2/5] PCI: Support two-pass shutdown Jeremy Allison
  2024-01-30 10:00   ` Sagi Grimberg
@ 2024-01-30 17:54   ` Bjorn Helgaas
  1 sibling, 0 replies; 20+ messages in thread
From: Bjorn Helgaas @ 2024-01-30 17:54 UTC (permalink / raw)
  To: Jeremy Allison
  Cc: jra, tansuresh, hch, gregkh, rafael, bhelgaas, sagi, linux-nvme

On Mon, Jan 29, 2024 at 10:19:50AM -0800, Jeremy Allison wrote:
> From: Tanjore Suresh <tansuresh@google.com>
> 
> Enhance the base PCI driver to add support for two-pass
> shutdown. Add shutdown_wait() method.
> 
> Assume a device takes n secs to shutdown. If a machine has been
> populated with M such devices, the total time spent in shutting down
> all the devices will be M * n secs if the shutdown is done
> synchronously. For example, if NVMe PCI Controllers take 5 secs
> to shutdown and if there are 16 such NVMe controllers in a system,
> system will spend a total of 80 secs to shutdown all
> NVMe devices in that system.
> 
> In order to speed up the shutdown time, a two-pass interface to
> shutdown has been implemented. The caller calls the shutdown method
> for each device in turn to allow a shutdown request to be sent,
> then the caller walks the list of devices and calls shutdown_wait()
> to synchronously wait for the shutdown to complete.
> 
> In the NVMe case above, all 16 devices will process the shutdown
> in parallel taking the total time to shutdown down to the time
> for one NVMe PCI Controller to shut down.
> 
> This will significantly reduce the machine reboot time.
> 
> Signed-off-by: Tanjore Suresh <tansuresh@google.com>
> Signed-off-by: Jeremy Allison <jallison@ciq.com>

Acked-by: Bjorn Helgaas <bhelgaas@google.com>

Please rewrap the commit log to fill 75 columns.

> ---
>  drivers/pci/pci-driver.c | 9 +++++++++
>  include/linux/pci.h      | 2 ++
>  2 files changed, 11 insertions(+)
> 
> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
> index 51ec9e7e784f..257bbb04c806 100644
> --- a/drivers/pci/pci-driver.c
> +++ b/drivers/pci/pci-driver.c
> @@ -547,6 +547,14 @@ static int pci_restore_standard_config(struct pci_dev *pci_dev)
>  }
>  #endif /* CONFIG_PM_SLEEP */
>  
> +static void pci_device_shutdown_wait(struct device *dev)
> +{
> +	struct pci_dev *pci_dev = to_pci_dev(dev);
> +	struct pci_driver *drv = pci_dev->driver;
> +
> +	if (drv && drv->shutdown_wait)
> +		drv->shutdown_wait(pci_dev);
> +}
>  #ifdef CONFIG_PM
>  
>  /* Auxiliary functions used for system resume and run-time resume */
> @@ -1682,6 +1690,7 @@ struct bus_type pci_bus_type = {
>  	.probe		= pci_device_probe,
>  	.remove		= pci_device_remove,
>  	.shutdown	= pci_device_shutdown,
> +	.shutdown_wait	= pci_device_shutdown_wait,
>  	.dev_groups	= pci_dev_groups,
>  	.bus_groups	= pci_bus_groups,
>  	.drv_groups	= pci_drv_groups,
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index add9368e6314..5ef014ac84f2 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -917,6 +917,7 @@ struct module;
>   *		Useful for enabling wake-on-lan (NIC) or changing
>   *		the power state of a device before reboot.
>   *		e.g. drivers/net/e100.c.
> + * @shutdown_wait: Optional driver callback to allow two-pass shutdown.
>   * @sriov_configure: Optional driver callback to allow configuration of
>   *		number of VFs to enable via sysfs "sriov_numvfs" file.
>   * @sriov_set_msix_vec_count: PF Driver callback to change number of MSI-X
> @@ -947,6 +948,7 @@ struct pci_driver {
>  	int  (*suspend)(struct pci_dev *dev, pm_message_t state);	/* Device suspended */
>  	int  (*resume)(struct pci_dev *dev);	/* Device woken up */
>  	void (*shutdown)(struct pci_dev *dev);
> +	void (*shutdown_wait)(struct pci_dev *dev);
>  	int  (*sriov_configure)(struct pci_dev *dev, int num_vfs); /* On PF */
>  	int  (*sriov_set_msix_vec_count)(struct pci_dev *vf, int msix_vec_count); /* On PF */
>  	u32  (*sriov_get_vf_total_msix)(struct pci_dev *pf);
> -- 
> 2.39.3
> 
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 5/5] nvme: Add two-pass shutdown support.
  2024-01-30 11:15   ` Sagi Grimberg
@ 2024-01-30 18:54     ` Jeremy Allison
  0 siblings, 0 replies; 20+ messages in thread
From: Jeremy Allison @ 2024-01-30 18:54 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Jeremy Allison, tansuresh, hch, gregkh, rafael, bhelgaas,
	linux-nvme, jra

On Tue, Jan 30, 2024 at 01:15:43PM +0200, Sagi Grimberg wrote:
>
>>This works with the two-pass shutdown mechanism setup for the PCI
>>drivers and participates to provide the shutdown_wait
>>method at the pci_driver structure level.
>>
>>This patch changes the nvme shutdown() method to pass
>>down the NVME_PCI_DISABLE_SHUTDOWN_TWOPASS enum value instead
>>of NVME_PCI_DISABLE_SHUTDOWN. nvme_dev_disable() is changed
>>to call nvme_request_shutdown() instead of nvme_disable_ctrl()
>>in this case.
>>
>>nvme_request_shutdown() sets the shutdown bit,
>>but does not wait for completion.
>>
>>The nvme_shutdown_wait() callback is added to synchronously
>>wait for the NVME_CSTS_SHST_CMPLT bit meaning the nvme
>>device has shutdown.
>>
>>This change speeds up the shutdown in a system which hosts
>>many controllers.
>>
>>Based on work by Tanjore Suresh <tansuresh@google.com>
>>
>>Signed-off-by: Jeremy Allison <jallison@ciq.com>
>>---
>>  drivers/nvme/host/pci.c | 39 ++++++++++++++++++++++++++++++++++++---
>>  1 file changed, 36 insertions(+), 3 deletions(-)
>>
>>diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
>>index 8ee77f755d9d..bbd89eecf05a 100644
>>--- a/drivers/nvme/host/pci.c
>>+++ b/drivers/nvme/host/pci.c
>>@@ -2607,7 +2607,14 @@ static void nvme_dev_disable(struct nvme_dev *dev,
>>  	if (!dead && dev->ctrl.queue_count > 0) {
>>  		nvme_delete_io_queues(dev);
>>-		nvme_disable_ctrl(&dev->ctrl, shutdown);
>>+		/*
>>+		 * NVME_PCI_DISABLE_SHUTDOWN_TWOPASS requests shutdown
>>+		 * but doesn't wait for completion.
>>+		 */
>>+		if (shutdown_type == NVME_PCI_DISABLE_SHUTDOWN_TWOPASS)
>>+			nvme_request_shutdown(&dev->ctrl);
>>+		else
>>+			nvme_disable_ctrl(&dev->ctrl, shutdown);
>>  		nvme_poll_irqdisable(&dev->queues[0]);
>>  	}
>>  	nvme_suspend_io_queues(dev);
>>@@ -2625,7 +2632,7 @@ static void nvme_dev_disable(struct nvme_dev *dev,
>>  	 * must flush all entered requests to their failed completion to avoid
>>  	 * deadlocking blk-mq hot-cpu notifier.
>>  	 */
>>-	if (shutdown) {
>>+	if (shutdown_type == NVME_PCI_DISABLE_SHUTDOWN) {
>>  		nvme_unquiesce_io_queues(&dev->ctrl);
>>  		if (dev->ctrl.admin_q && !blk_queue_dying(dev->ctrl.admin_q))
>>  			nvme_unquiesce_admin_queue(&dev->ctrl);
>>@@ -3128,7 +3135,32 @@ static void nvme_shutdown(struct pci_dev *pdev)
>>  {
>>  	struct nvme_dev *dev = pci_get_drvdata(pdev);
>>-	nvme_disable_prepare_reset(dev, NVME_PCI_DISABLE_SHUTDOWN);
>>+	nvme_disable_prepare_reset(dev, NVME_PCI_DISABLE_SHUTDOWN_TWOPASS);
>>+}
>>+
>>+static void nvme_shutdown_wait(struct pci_dev *pdev)
>>+{
>>+	struct nvme_dev *dev = pci_get_drvdata(pdev);
>>+
>>+	mutex_lock(&dev->shutdown_lock);
>
>General question that just came up, is there any risk here that the
>shutdown_lock is released and re-taken later while the shutdown is 
>still in progress? I don't spot anything immediate, but perhaps Keith 
>can
>comment?

My gut feel (not knowing the kernel nearly as well as Samba
of course :-) is that this isn't a problem.

But that's because this two-stage code path of:

mutex_lock(&dev->shutdown_lock);
..
	set device shutdown bit NVME_CC_SHN_NORMAL
..
mutex_unlock(&dev->shutdown_lock);

followed by:

mutex_lock(&dev->shutdown_lock);
..
	wait for shutdown bit NVME_CSTS_SHST_CMPLT
..
mutex_unlock(&dev->shutdown_lock);

Is only invoked on the shutdown() method of
the nvme device. At this point the system is
going down and user-space processes are already
terminated.

Also the device has been removed from the
devices_kset list under a lock and the operations
are being done under a device_lock(dev).

So I can't see what other things could
go wrong with the two-stage approach.

Happy to learn if I'm wrong though :-).

>If this change does expose us to potential issues, you can simply
>release the shutdown_lock here (while taking it in ?

You appear to have accidently a word here :-).


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/5] nvme: Change 'bool shutdown' into an enum shutdown_type for nvme_dev_disable().
  2024-01-29 18:19 ` [PATCH 3/5] nvme: Change 'bool shutdown' into an enum shutdown_type for nvme_dev_disable() Jeremy Allison
  2024-01-30 11:00   ` Sagi Grimberg
@ 2024-01-31  6:19   ` Christoph Hellwig
  1 sibling, 0 replies; 20+ messages in thread
From: Christoph Hellwig @ 2024-01-31  6:19 UTC (permalink / raw)
  To: Jeremy Allison
  Cc: jra, tansuresh, hch, gregkh, rafael, bhelgaas, sagi, linux-nvme

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/5] nvme: Add a new exported function nvme_request_shutdown().
  2024-01-30 17:31     ` Jeremy Allison
@ 2024-01-31  6:20       ` Christoph Hellwig
  0 siblings, 0 replies; 20+ messages in thread
From: Christoph Hellwig @ 2024-01-31  6:20 UTC (permalink / raw)
  To: Jeremy Allison
  Cc: Sagi Grimberg, Jeremy Allison, tansuresh, hch, gregkh, rafael,
	bhelgaas, linux-nvme

On Tue, Jan 30, 2024 at 09:31:40AM -0800, Jeremy Allison wrote:
> I think the idea of adding _ctrl_ in the name is a good
> one as it makes explicit the level it's operating on.
>
> I think nvme_ctrl_shutdown_start() is the better name,
> so I'll use that for the next version.

Sounds good to me.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/5] driver core: Support two-pass driver shutdown.
  2024-01-29 18:19 ` [PATCH 1/5] driver core: Support two-pass driver shutdown Jeremy Allison
  2024-01-30 10:00   ` Sagi Grimberg
@ 2024-02-14  3:36   ` Saravana Kannan
  2024-02-14  8:10     ` Christoph Hellwig
  1 sibling, 1 reply; 20+ messages in thread
From: Saravana Kannan @ 2024-02-14  3:36 UTC (permalink / raw)
  To: Jeremy Allison
  Cc: jra, tansuresh, hch, gregkh, rafael, bhelgaas, sagi, linux-nvme

On Mon, Jan 29, 2024 at 10:19 AM Jeremy Allison <jallison@ciq.com> wrote:
>
> From: Tanjore Suresh <tansuresh@google.com>
>
> This changes the bus driver interface with an additional entry point
> to enable devices to implement two-pass shutdown. The existing
> synchronous interface to shutdown is called, and if a shutdown_wait
> method is defined the device is moved to an alternate list.

This sounds more like you need a parallelized async shutdown than a
two-pass shutdown. Similar to how async probes are done today. Why not
do that so it'll actually be useful in a more general fashion? You can
even add a flag like we have for probes so that drivers don't need to
do anything special to allow this parallelism.

-Saravana

>
> Once the shutdown method is called for all devices, the
> shutdown_wait method is then called synchronously for
> all devices on the alternate list.
>
> Signed-off-by: Tanjore Suresh <tansuresh@google.com>
> Signed-off-by: Jeremy Allison <jallison@ciq.com>
> ---
>  drivers/base/core.c        | 35 +++++++++++++++++++++++++++++++++++
>  include/linux/device/bus.h |  6 +++++-
>  2 files changed, 40 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/base/core.c b/drivers/base/core.c
> index 14d46af40f9a..c8b0d4c9abed 100644
> --- a/drivers/base/core.c
> +++ b/drivers/base/core.c
> @@ -4725,6 +4725,7 @@ EXPORT_SYMBOL_GPL(device_change_owner);
>  void device_shutdown(void)
>  {
>         struct device *dev, *parent;
> +       LIST_HEAD(shutdown_wait_list);
>
>         wait_for_device_probe();
>         device_block_probing();
> @@ -4769,10 +4770,17 @@ void device_shutdown(void)
>                                 dev_info(dev, "shutdown_pre\n");
>                         dev->class->shutdown_pre(dev);
>                 }
> +
>                 if (dev->bus && dev->bus->shutdown) {
>                         if (initcall_debug)
>                                 dev_info(dev, "shutdown\n");
>                         dev->bus->shutdown(dev);
> +                       /*
> +                        * Only put the device on the shutdown_wait_list
> +                        * if a shutdown_wait() method is also defined.
> +                        */
> +                       if (dev->bus->shutdown_wait)
> +                               list_add(&dev->kobj.entry, &shutdown_wait_list);
>                 } else if (dev->driver && dev->driver->shutdown) {
>                         if (initcall_debug)
>                                 dev_info(dev, "shutdown\n");
> @@ -4789,6 +4797,33 @@ void device_shutdown(void)
>                 spin_lock(&devices_kset->list_lock);
>         }
>         spin_unlock(&devices_kset->list_lock);
> +
> +       /*
> +        * Second pass only for devices that have configured
> +        * a shutdown_wait() method.
> +        */
> +       while (!list_empty(&shutdown_wait_list)) {
> +               dev = list_entry(shutdown_wait_list.next, struct device,
> +                               kobj.entry);
> +               parent = get_device(dev->parent);
> +               get_device(dev);
> +               /*
> +                * Make sure the device is off the  list
> +                */
> +               list_del_init(&dev->kobj.entry);
> +               if (parent)
> +                       device_lock(parent);
> +               device_lock(dev);
> +               if (initcall_debug)
> +                       dev_info(dev,
> +                       "shutdown_wait called\n");
> +               dev->bus->shutdown_wait(dev);
> +               device_unlock(dev);
> +               if (parent)
> +                       device_unlock(parent);
> +               put_device(dev);
> +               put_device(parent);
> +       }
>  }
>
>  /*
> diff --git a/include/linux/device/bus.h b/include/linux/device/bus.h
> index 5ef4ec1c36c3..a4132c44dac3 100644
> --- a/include/linux/device/bus.h
> +++ b/include/linux/device/bus.h
> @@ -48,7 +48,10 @@ struct fwnode_handle;
>   *             will never get called until they do.
>   * @remove:    Called when a device removed from this bus.
>   * @shutdown:  Called at shut-down time to quiesce the device.
> - *
> + * @shutdown_wait:     If this method exists, devices are stored on a separate
> + *                     list after shutdown() has been called and
> + *                     shutdown_wait() is called synchronously on each device
> + *                     in this list in turn.
>   * @online:    Called to put the device back online (after offlining it).
>   * @offline:   Called to put the device offline for hot-removal. May fail.
>   *
> @@ -87,6 +90,7 @@ struct bus_type {
>         void (*sync_state)(struct device *dev);
>         void (*remove)(struct device *dev);
>         void (*shutdown)(struct device *dev);
> +       void (*shutdown_wait)(struct device *dev);
>
>         int (*online)(struct device *dev);
>         int (*offline)(struct device *dev);
> --
> 2.39.3
>
>
>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/5] driver core: Support two-pass driver shutdown.
  2024-02-14  3:36   ` Saravana Kannan
@ 2024-02-14  8:10     ` Christoph Hellwig
  2024-02-14 21:36       ` Saravana Kannan
  0 siblings, 1 reply; 20+ messages in thread
From: Christoph Hellwig @ 2024-02-14  8:10 UTC (permalink / raw)
  To: Saravana Kannan
  Cc: Jeremy Allison, jra, tansuresh, hch, gregkh, rafael, bhelgaas,
	sagi, linux-nvme

On Tue, Feb 13, 2024 at 07:36:35PM -0800, Saravana Kannan wrote:
> > This changes the bus driver interface with an additional entry point
> > to enable devices to implement two-pass shutdown. The existing
> > synchronous interface to shutdown is called, and if a shutdown_wait
> > method is defined the device is moved to an alternate list.
> 
> This sounds more like you need a parallelized async shutdown than a
> two-pass shutdown. Similar to how async probes are done today. Why not
> do that so it'll actually be useful in a more general fashion? You can
> even add a flag like we have for probes so that drivers don't need to
> do anything special to allow this parallelism.

Well, we had that discussion before.  The typically time consuming
part in a shutdown is waiting for the device to actually shut down.
An submit and then wait later scheme is more efficient for that,
even if the "simple" async shutdown that just spawns more threads
will also work, while being a lot less efficient.  So unless we
have a strong reason for that the two-phase scheme seems preferable
to me.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/5] driver core: Support two-pass driver shutdown.
  2024-02-14  8:10     ` Christoph Hellwig
@ 2024-02-14 21:36       ` Saravana Kannan
  2024-02-14 22:10         ` Bjorn Helgaas
  0 siblings, 1 reply; 20+ messages in thread
From: Saravana Kannan @ 2024-02-14 21:36 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jeremy Allison, jra, tansuresh, gregkh, rafael, bhelgaas, sagi,
	linux-nvme, Android Kernel Team

On Wed, Feb 14, 2024 at 12:10 AM Christoph Hellwig <hch@lst.de> wrote:
>
> On Tue, Feb 13, 2024 at 07:36:35PM -0800, Saravana Kannan wrote:
> > > This changes the bus driver interface with an additional entry point
> > > to enable devices to implement two-pass shutdown. The existing
> > > synchronous interface to shutdown is called, and if a shutdown_wait
> > > method is defined the device is moved to an alternate list.
> >
> > This sounds more like you need a parallelized async shutdown than a
> > two-pass shutdown. Similar to how async probes are done today. Why not
> > do that so it'll actually be useful in a more general fashion? You can
> > even add a flag like we have for probes so that drivers don't need to
> > do anything special to allow this parallelism.
>
> Well, we had that discussion before.

Can you point me to this please? I looked around and wasn't able to find it.

> The typically time consuming
> part in a shutdown is waiting for the device to actually shut down.
> An submit and then wait later scheme is more efficient for that,

This seems like yet another one of those "who needs more than X
levels" issue we run into with initcalls. We are finally getting rid
of that with fw_devlink and device links. This two stage approach
breaks down as soon as there is any kind of dependency order that
needs to be maintained across device shutdown.

> even if the "simple" async shutdown that just spawns more threads
> will also work, while being a lot less efficient.

Lot less efficient how? Shutdown is not a frequent / hot path event. I
don't think the number of threads spawned is an efficiency concern as
long as it's faster and is not fragile to use in a generic fashion for
all devices.

> So unless we
> have a strong reason for that the two-phase scheme seems preferable
> to me.

I think the strong reason is dependency tracking and ordering doesn't
work with a 2 pass system. Dependency levels are often several levels
deep.

-Saravana


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/5] driver core: Support two-pass driver shutdown.
  2024-02-14 21:36       ` Saravana Kannan
@ 2024-02-14 22:10         ` Bjorn Helgaas
  0 siblings, 0 replies; 20+ messages in thread
From: Bjorn Helgaas @ 2024-02-14 22:10 UTC (permalink / raw)
  To: Saravana Kannan
  Cc: Christoph Hellwig, Jeremy Allison, jra, tansuresh, gregkh,
	rafael, bhelgaas, sagi, linux-nvme, Android Kernel Team

On Wed, Feb 14, 2024 at 01:36:28PM -0800, Saravana Kannan wrote:
> On Wed, Feb 14, 2024 at 12:10 AM Christoph Hellwig <hch@lst.de> wrote:
> > On Tue, Feb 13, 2024 at 07:36:35PM -0800, Saravana Kannan wrote:
> > > > This changes the bus driver interface with an additional entry point
> > > > to enable devices to implement two-pass shutdown. The existing
> > > > synchronous interface to shutdown is called, and if a shutdown_wait
> > > > method is defined the device is moved to an alternate list.
> > >
> > > This sounds more like you need a parallelized async shutdown than a
> > > two-pass shutdown. Similar to how async probes are done today. Why not
> > > do that so it'll actually be useful in a more general fashion? You can
> > > even add a flag like we have for probes so that drivers don't need to
> > > do anything special to allow this parallelism.
> >
> > Well, we had that discussion before.
> 
> Can you point me to this please? I looked around and wasn't able to
> find it.

I'm not sure what Christoph had in mind, but he might be referring to
this:
https://lore.kernel.org/r/20231227203337.GA1509884@bhelgaas

Bjorn


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2024-02-14 22:10 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-29 18:19 Make NVME shutdown two-pass - Version 5 Jeremy Allison
2024-01-29 18:19 ` [PATCH 1/5] driver core: Support two-pass driver shutdown Jeremy Allison
2024-01-30 10:00   ` Sagi Grimberg
2024-02-14  3:36   ` Saravana Kannan
2024-02-14  8:10     ` Christoph Hellwig
2024-02-14 21:36       ` Saravana Kannan
2024-02-14 22:10         ` Bjorn Helgaas
2024-01-29 18:19 ` [PATCH 2/5] PCI: Support two-pass shutdown Jeremy Allison
2024-01-30 10:00   ` Sagi Grimberg
2024-01-30 17:54   ` Bjorn Helgaas
2024-01-29 18:19 ` [PATCH 3/5] nvme: Change 'bool shutdown' into an enum shutdown_type for nvme_dev_disable() Jeremy Allison
2024-01-30 11:00   ` Sagi Grimberg
2024-01-31  6:19   ` Christoph Hellwig
2024-01-29 18:19 ` [PATCH 4/5] nvme: Add a new exported function nvme_request_shutdown() Jeremy Allison
2024-01-30 10:58   ` Sagi Grimberg
2024-01-30 17:31     ` Jeremy Allison
2024-01-31  6:20       ` Christoph Hellwig
2024-01-29 18:19 ` [PATCH 5/5] nvme: Add two-pass shutdown support Jeremy Allison
2024-01-30 11:15   ` Sagi Grimberg
2024-01-30 18:54     ` Jeremy Allison

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.