All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v2 0/5] vfio/pci: Enable runtime power management support
@ 2022-01-24 18:17 Abhishek Sahu
  2022-01-24 18:17 ` [RFC PATCH v2 1/5] vfio/pci: register vfio-pci driver with runtime PM framework Abhishek Sahu
                   ` (4 more replies)
  0 siblings, 5 replies; 21+ messages in thread
From: Abhishek Sahu @ 2022-01-24 18:17 UTC (permalink / raw)
  To: kvm, Alex Williamson, Cornelia Huck
  Cc: Max Gurtovoy, Yishai Hadas, Zhen Lei, Jason Gunthorpe,
	linux-kernel, Abhishek Sahu

Currently, there is very limited power management support available
in the upstream vfio-pci driver. If there is no user of vfio-pci device,
then it will be moved into D3Hot state. Similarly, if we enable the
runtime power management for vfio-pci device in the guest OS, then the
device is being runtime suspended (for linux guest OS) and the PCI
device will be put into D3hot state (in function
vfio_pm_config_write()). If the D3cold state can be used instead of
D3hot, then it will help in saving maximum power. The D3cold state can't
be possible with native PCI PM. It requires interaction with platform
firmware which is system-specific. To go into low power states
(including D3cold), the runtime PM framework can be used which
internally interacts with PCI and platform firmware and puts the device
into the lowest possible D-States. This patch series registers the
vfio-pci driver with runtime PM framework and uses the same for moving
the physical PCI device to go into the low power state.

The current PM support was added with commit 6eb7018705de ("vfio-pci:
Move idle devices to D3hot power state") where the following point was
mentioned regarding D3cold state.

 "It's tempting to try to use D3cold, but we have no reason to inhibit
  hotplug of idle devices and we might get into a loop of having the
  device disappear before we have a chance to try to use it."

With the runtime PM, if the user want to prevent going into D3cold then
/sys/bus/pci/devices/.../d3cold_allowed can be set to 0 for the
devices where the above functionality is required instead of
disallowing the D3cold state for all the cases.

Since D3cold state can't be achieved by writing PCI standard PM
config registers, so a new IOCTL has been added, which change the PCI
device from D3hot to D3cold state and then D3cold to D0 state.
The hypervisors can implement virtual ACPI methods. For example,
in guest linux OS if PCI device ACPI node has _PR3 and _PR0 power
resources with _ON/_OFF method, then guest linux OS makes the _OFF call
during D3cold transition and then _ON during D0 transition. The
hypervisor can tap these virtual ACPI calls and then do the D3cold
related IOCTL in vfio driver.

The BAR access needs to be disabled if device is in D3hot state.
Also, there should not be any config access if device is in D3cold
state. This patch series added this support also.

Also, during testing one use case has been identified where the memory
taken for PCI state saving is not getting freed. So fixed that also.

* Changes in v2

- Rebased patches on v5.17-rc1.
- Included the patch to handle BAR access in D3cold.
- Included the patch to fix memory leak.
- Made a separate IOCTL that can be used to change the power state from
  D3hot to D3cold and D3cold to D0.
- Addressed the review comments given in v1.

Abhishek Sahu (5):
  vfio/pci: register vfio-pci driver with runtime PM framework
  vfio/pci: virtualize PME related registers bits and initialize to zero
  vfio/pci: fix memory leak during D3hot to D0 tranistion
  vfio/pci: Invalidate mmaps and block the access in D3hot power state
  vfio/pci: add the support for PCI D3cold state

 drivers/vfio/pci/vfio_pci.c        |   4 +-
 drivers/vfio/pci/vfio_pci_config.c |  44 +++-
 drivers/vfio/pci/vfio_pci_core.c   | 363 ++++++++++++++++++++++++++---
 drivers/vfio/pci/vfio_pci_rdwr.c   |  20 +-
 include/linux/vfio_pci_core.h      |   6 +
 include/uapi/linux/vfio.h          |  21 ++
 6 files changed, 411 insertions(+), 47 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [RFC PATCH v2 1/5] vfio/pci: register vfio-pci driver with runtime PM framework
  2022-01-24 18:17 [RFC PATCH v2 0/5] vfio/pci: Enable runtime power management support Abhishek Sahu
@ 2022-01-24 18:17 ` Abhishek Sahu
  2022-02-16 23:48   ` Alex Williamson
  2022-01-24 18:17 ` [RFC PATCH v2 2/5] vfio/pci: virtualize PME related registers bits and initialize to zero Abhishek Sahu
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 21+ messages in thread
From: Abhishek Sahu @ 2022-01-24 18:17 UTC (permalink / raw)
  To: kvm, Alex Williamson, Cornelia Huck
  Cc: Max Gurtovoy, Yishai Hadas, Zhen Lei, Jason Gunthorpe,
	linux-kernel, Abhishek Sahu

Currently, there is very limited power management support
available in the upstream vfio-pci driver. If there is no user of vfio-pci
device, then the PCI device will be moved into D3Hot state by writing
directly into PCI PM registers. This D3Hot state help in saving power
but we can achieve zero power consumption if we go into the D3cold state.
The D3cold state cannot be possible with native PCI PM. It requires
interaction with platform firmware which is system-specific.
To go into low power states (including D3cold), the runtime PM framework
can be used which internally interacts with PCI and platform firmware and
puts the device into the lowest possible D-States.

This patch registers vfio-pci driver with the runtime PM framework.

1. The PCI core framework takes care of most of the runtime PM
   related things. For enabling the runtime PM, the PCI driver needs to
   decrement the usage count and needs to register the runtime
   suspend/resume callbacks. For vfio-pci based driver, these callback
   routines can be stubbed in this patch since the vfio-pci driver
   is not doing the PCI device initialization. All the config state
   saving, and PCI power management related things will be done by
   PCI core framework itself inside its runtime suspend/resume callbacks.

2. Inside pci_reset_bus(), all the devices in bus/slot will be moved
   out of D0 state. This state change to D0 can happen directly without
   going through the runtime PM framework. So if runtime PM is enabled,
   then pm_runtime_resume() makes the runtime state active. Since the PCI
   device power state is already D0, so it should return early when it
   tries to change the state with pci_set_power_state(). Then
   pm_request_idle() can be used which will internally check for
   device usage count and will move the device again into the low power
   state.

3. Inside vfio_pci_core_disable(), the device usage count always needs
   to be decremented which was incremented in vfio_pci_core_enable().

4. Since the runtime PM framework will provide the same functionality,
   so directly writing into PCI PM config register can be replaced with
   the use of runtime PM routines. Also, the use of runtime PM can help
   us in more power saving.

   In the systems which do not support D3Cold,

   With the existing implementation:

   // PCI device
   # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
   D3hot
   // upstream bridge
   # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
   D0

   With runtime PM:

   // PCI device
   # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
   D3hot
   // upstream bridge
   # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
   D3hot

   So, with runtime PM, the upstream bridge or root port will also go
   into lower power state which is not possible with existing
   implementation.

   In the systems which support D3Cold,

   // PCI device
   # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
   D3hot
   // upstream bridge
   # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
   D0

   With runtime PM:

   // PCI device
   # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
   D3cold
   // upstream bridge
   # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
   D3cold

   So, with runtime PM, both the PCI device and upstream bridge will
   go into D3cold state.

5. If 'disable_idle_d3' module parameter is set, then also the runtime
   PM will be enabled, but in this case, the usage count should not be
   decremented.

6. vfio_pci_dev_set_try_reset() return value is unused now, so this
   function return type can be changed to void.

Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
---
 drivers/vfio/pci/vfio_pci.c      |  3 +
 drivers/vfio/pci/vfio_pci_core.c | 95 +++++++++++++++++++++++---------
 include/linux/vfio_pci_core.h    |  4 ++
 3 files changed, 75 insertions(+), 27 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index a5ce92beb655..c8695baf3b54 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -193,6 +193,9 @@ static struct pci_driver vfio_pci_driver = {
 	.remove			= vfio_pci_remove,
 	.sriov_configure	= vfio_pci_sriov_configure,
 	.err_handler		= &vfio_pci_core_err_handlers,
+#if defined(CONFIG_PM)
+	.driver.pm              = &vfio_pci_core_pm_ops,
+#endif
 };
 
 static void __init vfio_pci_fill_ids(void)
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index f948e6cd2993..c6e4fe9088c3 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -152,7 +152,7 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_core_device *vdev)
 }
 
 struct vfio_pci_group_info;
-static bool vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set);
+static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set);
 static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
 				      struct vfio_pci_group_info *groups);
 
@@ -245,7 +245,11 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
 	u16 cmd;
 	u8 msix_pos;
 
-	vfio_pci_set_power_state(vdev, PCI_D0);
+	if (!disable_idle_d3) {
+		ret = pm_runtime_resume_and_get(&pdev->dev);
+		if (ret < 0)
+			return ret;
+	}
 
 	/* Don't allow our initial saved state to include busmaster */
 	pci_clear_master(pdev);
@@ -405,8 +409,11 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
 out:
 	pci_disable_device(pdev);
 
-	if (!vfio_pci_dev_set_try_reset(vdev->vdev.dev_set) && !disable_idle_d3)
-		vfio_pci_set_power_state(vdev, PCI_D3hot);
+	vfio_pci_dev_set_try_reset(vdev->vdev.dev_set);
+
+	/* Put the pm-runtime usage counter acquired during enable */
+	if (!disable_idle_d3)
+		pm_runtime_put(&pdev->dev);
 }
 EXPORT_SYMBOL_GPL(vfio_pci_core_disable);
 
@@ -1847,19 +1854,20 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev)
 
 	vfio_pci_probe_power_state(vdev);
 
-	if (!disable_idle_d3) {
-		/*
-		 * pci-core sets the device power state to an unknown value at
-		 * bootup and after being removed from a driver.  The only
-		 * transition it allows from this unknown state is to D0, which
-		 * typically happens when a driver calls pci_enable_device().
-		 * We're not ready to enable the device yet, but we do want to
-		 * be able to get to D3.  Therefore first do a D0 transition
-		 * before going to D3.
-		 */
-		vfio_pci_set_power_state(vdev, PCI_D0);
-		vfio_pci_set_power_state(vdev, PCI_D3hot);
-	}
+	/*
+	 * pci-core sets the device power state to an unknown value at
+	 * bootup and after being removed from a driver.  The only
+	 * transition it allows from this unknown state is to D0, which
+	 * typically happens when a driver calls pci_enable_device().
+	 * We're not ready to enable the device yet, but we do want to
+	 * be able to get to D3.  Therefore first do a D0 transition
+	 * before enabling runtime PM.
+	 */
+	vfio_pci_set_power_state(vdev, PCI_D0);
+	pm_runtime_allow(&pdev->dev);
+
+	if (!disable_idle_d3)
+		pm_runtime_put(&pdev->dev);
 
 	ret = vfio_register_group_dev(&vdev->vdev);
 	if (ret)
@@ -1868,7 +1876,9 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev)
 
 out_power:
 	if (!disable_idle_d3)
-		vfio_pci_set_power_state(vdev, PCI_D0);
+		pm_runtime_get_noresume(&pdev->dev);
+
+	pm_runtime_forbid(&pdev->dev);
 out_vf:
 	vfio_pci_vf_uninit(vdev);
 	return ret;
@@ -1887,7 +1897,9 @@ void vfio_pci_core_unregister_device(struct vfio_pci_core_device *vdev)
 	vfio_pci_vga_uninit(vdev);
 
 	if (!disable_idle_d3)
-		vfio_pci_set_power_state(vdev, PCI_D0);
+		pm_runtime_get_noresume(&pdev->dev);
+
+	pm_runtime_forbid(&pdev->dev);
 }
 EXPORT_SYMBOL_GPL(vfio_pci_core_unregister_device);
 
@@ -2093,33 +2105,62 @@ static bool vfio_pci_dev_set_needs_reset(struct vfio_device_set *dev_set)
  *  - At least one of the affected devices is marked dirty via
  *    needs_reset (such as by lack of FLR support)
  * Then attempt to perform that bus or slot reset.
- * Returns true if the dev_set was reset.
  */
-static bool vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set)
+static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set)
 {
 	struct vfio_pci_core_device *cur;
 	struct pci_dev *pdev;
 	int ret;
 
 	if (!vfio_pci_dev_set_needs_reset(dev_set))
-		return false;
+		return;
 
 	pdev = vfio_pci_dev_set_resettable(dev_set);
 	if (!pdev)
-		return false;
+		return;
 
 	ret = pci_reset_bus(pdev);
 	if (ret)
-		return false;
+		return;
 
 	list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) {
 		cur->needs_reset = false;
-		if (!disable_idle_d3)
-			vfio_pci_set_power_state(cur, PCI_D3hot);
+		if (!disable_idle_d3) {
+			/*
+			 * Inside pci_reset_bus(), all the devices in bus/slot
+			 * will be moved out of D0 state. This state change to
+			 * D0 can happen directly without going through the
+			 * runtime PM framework. pm_runtime_resume() will
+			 * help make the runtime state as active and then
+			 * pm_request_idle() can be used which will
+			 * internally check for device usage count and will
+			 * move the device again into the low power state.
+			 */
+			pm_runtime_resume(&pdev->dev);
+			pm_request_idle(&pdev->dev);
+		}
 	}
-	return true;
 }
 
+#ifdef CONFIG_PM
+static int vfio_pci_core_runtime_suspend(struct device *dev)
+{
+	return 0;
+}
+
+static int vfio_pci_core_runtime_resume(struct device *dev)
+{
+	return 0;
+}
+
+const struct dev_pm_ops vfio_pci_core_pm_ops = {
+	SET_RUNTIME_PM_OPS(vfio_pci_core_runtime_suspend,
+			   vfio_pci_core_runtime_resume,
+			   NULL)
+};
+EXPORT_SYMBOL_GPL(vfio_pci_core_pm_ops);
+#endif
+
 void vfio_pci_core_set_params(bool is_nointxmask, bool is_disable_vga,
 			      bool is_disable_idle_d3)
 {
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index ef9a44b6cf5d..aafe09c9fa64 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -231,6 +231,10 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev);
 void vfio_pci_core_disable(struct vfio_pci_core_device *vdev);
 void vfio_pci_core_finish_enable(struct vfio_pci_core_device *vdev);
 
+#ifdef CONFIG_PM
+extern const struct dev_pm_ops vfio_pci_core_pm_ops;
+#endif
+
 static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
 {
 	return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH v2 2/5] vfio/pci: virtualize PME related registers bits and initialize to zero
  2022-01-24 18:17 [RFC PATCH v2 0/5] vfio/pci: Enable runtime power management support Abhishek Sahu
  2022-01-24 18:17 ` [RFC PATCH v2 1/5] vfio/pci: register vfio-pci driver with runtime PM framework Abhishek Sahu
@ 2022-01-24 18:17 ` Abhishek Sahu
  2022-01-24 18:17 ` [RFC PATCH v2 3/5] vfio/pci: fix memory leak during D3hot to D0 tranistion Abhishek Sahu
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 21+ messages in thread
From: Abhishek Sahu @ 2022-01-24 18:17 UTC (permalink / raw)
  To: kvm, Alex Williamson, Cornelia Huck
  Cc: Max Gurtovoy, Yishai Hadas, Zhen Lei, Jason Gunthorpe,
	linux-kernel, Abhishek Sahu

If any PME event will be generated by PCI, then it will be mostly
handled in the host by the root port PME code. For example, in the case
of PCIe, the PME event will be sent to the root port and then the PME
interrupt will be generated. This will be handled in
drivers/pci/pcie/pme.c at the host side. Inside this, the
pci_check_pme_status() will be called where PME_Status and PME_En bits
will be cleared. So, the guest OS which is using vfio-pci device will
not come to know about this PME event.

To handle these PME events inside guests, we need some framework so
that if any PME events will happen, then it needs to be forwarded to
virtual machine monitor. We can virtualize PME related registers bits
and initialize these bits to zero so vfio-pci device user will assume
that it is not capable of asserting the PME# signal from any power state.

Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
---
 drivers/vfio/pci/vfio_pci_config.c | 33 +++++++++++++++++++++++++++++-
 1 file changed, 32 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
index 6e58b4bf7a60..dd9ed211ba6f 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -738,12 +738,29 @@ static int __init init_pci_cap_pm_perm(struct perm_bits *perm)
 	 */
 	p_setb(perm, PCI_CAP_LIST_NEXT, (u8)ALL_VIRT, NO_WRITE);
 
+	/*
+	 * The guests can't process PME events. If any PME event will be
+	 * generated, then it will be mostly handled in the host and the
+	 * host will clear the PME_STATUS. So virtualize PME_Support bits.
+	 * The vconfig bits will be cleared during device capability
+	 * initialization.
+	 */
+	p_setw(perm, PCI_PM_PMC, PCI_PM_CAP_PME_MASK, NO_WRITE);
+
 	/*
 	 * Power management is defined *per function*, so we can let
 	 * the user change power state, but we trap and initiate the
 	 * change ourselves, so the state bits are read-only.
+	 *
+	 * The guest can't process PME from D3cold so virtualize PME_Status
+	 * and PME_En bits. The vconfig bits will be cleared during device
+	 * capability initialization.
 	 */
-	p_setd(perm, PCI_PM_CTRL, NO_VIRT, ~PCI_PM_CTRL_STATE_MASK);
+	p_setd(perm, PCI_PM_CTRL,
+	       PCI_PM_CTRL_PME_ENABLE | PCI_PM_CTRL_PME_STATUS,
+	       ~(PCI_PM_CTRL_PME_ENABLE | PCI_PM_CTRL_PME_STATUS |
+		 PCI_PM_CTRL_STATE_MASK));
+
 	return 0;
 }
 
@@ -1412,6 +1429,17 @@ static int vfio_ext_cap_len(struct vfio_pci_core_device *vdev, u16 ecap, u16 epo
 	return 0;
 }
 
+static void vfio_update_pm_vconfig_bytes(struct vfio_pci_core_device *vdev,
+					 int offset)
+{
+	__le16 *pmc = (__le16 *)&vdev->vconfig[offset + PCI_PM_PMC];
+	__le16 *ctrl = (__le16 *)&vdev->vconfig[offset + PCI_PM_CTRL];
+
+	/* Clear vconfig PME_Support, PME_Status, and PME_En bits */
+	*pmc &= ~cpu_to_le16(PCI_PM_CAP_PME_MASK);
+	*ctrl &= ~cpu_to_le16(PCI_PM_CTRL_PME_ENABLE | PCI_PM_CTRL_PME_STATUS);
+}
+
 static int vfio_fill_vconfig_bytes(struct vfio_pci_core_device *vdev,
 				   int offset, int size)
 {
@@ -1535,6 +1563,9 @@ static int vfio_cap_init(struct vfio_pci_core_device *vdev)
 		if (ret)
 			return ret;
 
+		if (cap == PCI_CAP_ID_PM)
+			vfio_update_pm_vconfig_bytes(vdev, pos);
+
 		prev = &vdev->vconfig[pos + PCI_CAP_LIST_NEXT];
 		pos = next;
 		caps++;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH v2 3/5] vfio/pci: fix memory leak during D3hot to D0 tranistion
  2022-01-24 18:17 [RFC PATCH v2 0/5] vfio/pci: Enable runtime power management support Abhishek Sahu
  2022-01-24 18:17 ` [RFC PATCH v2 1/5] vfio/pci: register vfio-pci driver with runtime PM framework Abhishek Sahu
  2022-01-24 18:17 ` [RFC PATCH v2 2/5] vfio/pci: virtualize PME related registers bits and initialize to zero Abhishek Sahu
@ 2022-01-24 18:17 ` Abhishek Sahu
  2022-01-28  0:05   ` Alex Williamson
  2022-01-24 18:17 ` [RFC PATCH v2 4/5] vfio/pci: Invalidate mmaps and block the access in D3hot power state Abhishek Sahu
  2022-01-24 18:17 ` [RFC PATCH v2 5/5] vfio/pci: add the support for PCI D3cold state Abhishek Sahu
  4 siblings, 1 reply; 21+ messages in thread
From: Abhishek Sahu @ 2022-01-24 18:17 UTC (permalink / raw)
  To: kvm, Alex Williamson, Cornelia Huck
  Cc: Max Gurtovoy, Yishai Hadas, Zhen Lei, Jason Gunthorpe,
	linux-kernel, Abhishek Sahu

If needs_pm_restore is set (PCI device does not have support for no
soft reset), then the current PCI state will be saved during D0->D3hot
transition and same will be restored back during D3hot->D0 transition.
For saving the PCI state locally, pci_store_saved_state() is being
used and the pci_load_and_free_saved_state() will free the allocated
memory.

But for reset related IOCTLs, vfio driver calls PCI reset related
API's which will internally change the PCI power state back to D0. So,
when the guest resumes, then it will get the current state as D0 and it
will skip the call to vfio_pci_set_power_state() for changing the
power state to D0 explicitly. In this case, the memory pointed by
pm_save will never be freed.

Also, in malicious sequence, the state changing to D3hot followed by
VFIO_DEVICE_RESET/VFIO_DEVICE_PCI_HOT_RESET can be run in loop and
it can cause an OOM situation. This patch stores the power state locally
and uses the same for comparing the current power state. For the
places where D0 transition can happen, call vfio_pci_set_power_state()
to transition to D0 state. Since the vfio power state is still D3hot,
so this D0 transition will help in running the logic required
from D3hot->D0 transition. Also, to prevent any miss during
future development to detect this condition, this patch puts a
check and frees the memory after printing warning.

This locally saved power state will help in subsequent patches
also.

Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
---
 drivers/vfio/pci/vfio_pci_core.c | 53 ++++++++++++++++++++++++++++++--
 include/linux/vfio_pci_core.h    |  1 +
 2 files changed, 51 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index c6e4fe9088c3..ee2fb8af57fa 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -206,6 +206,14 @@ static void vfio_pci_probe_power_state(struct vfio_pci_core_device *vdev)
  * restore when returned to D0.  Saved separately from pci_saved_state for use
  * by PM capability emulation and separately from pci_dev internal saved state
  * to avoid it being overwritten and consumed around other resets.
+ *
+ * There are few cases where the PCI power state can be changed to D0
+ * without the involvement of this API. So, cache the power state locally
+ * and call this API to update the D0 state. It will help in running the
+ * logic that is needed for transitioning to the D0 state. For example,
+ * if needs_pm_restore is set, then the PCI state will be saved locally.
+ * The memory taken for saving this PCI state needs to be freed to
+ * prevent memory leak.
  */
 int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t state)
 {
@@ -214,20 +222,34 @@ int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t stat
 	int ret;
 
 	if (vdev->needs_pm_restore) {
-		if (pdev->current_state < PCI_D3hot && state >= PCI_D3hot) {
+		if (vdev->power_state < PCI_D3hot && state >= PCI_D3hot) {
 			pci_save_state(pdev);
 			needs_save = true;
 		}
 
-		if (pdev->current_state >= PCI_D3hot && state <= PCI_D0)
+		if (vdev->power_state >= PCI_D3hot && state <= PCI_D0)
 			needs_restore = true;
 	}
 
 	ret = pci_set_power_state(pdev, state);
 
 	if (!ret) {
+		vdev->power_state = pdev->current_state;
+
 		/* D3 might be unsupported via quirk, skip unless in D3 */
-		if (needs_save && pdev->current_state >= PCI_D3hot) {
+		if (needs_save && vdev->power_state >= PCI_D3hot) {
+			/*
+			 * If somehow, the vfio driver was not able to free the
+			 * memory allocated in pm_save, then free the earlier
+			 * memory first before overwriting pm_save to prevent
+			 * memory leak.
+			 */
+			if (vdev->pm_save) {
+				pci_warn(pdev,
+					 "Overwriting saved PCI state pointer so freeing the earlier memory\n");
+				kfree(vdev->pm_save);
+			}
+
 			vdev->pm_save = pci_store_saved_state(pdev);
 		} else if (needs_restore) {
 			pci_load_and_free_saved_state(pdev, &vdev->pm_save);
@@ -326,6 +348,14 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
 	/* For needs_reset */
 	lockdep_assert_held(&vdev->vdev.dev_set->lock);
 
+	/*
+	 * If disable has been called while the power state is other than D0,
+	 * then set the power state in vfio driver to D0. It will help
+	 * in running the logic needed for D0 power state. The subsequent
+	 * runtime PM API's will put the device into the low power state again.
+	 */
+	vfio_pci_set_power_state(vdev, PCI_D0);
+
 	/* Stop the device from further DMA */
 	pci_clear_master(pdev);
 
@@ -929,6 +959,15 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
 
 		vfio_pci_zap_and_down_write_memory_lock(vdev);
 		ret = pci_try_reset_function(vdev->pdev);
+
+		/*
+		 * If pci_try_reset_function() has been called while the power
+		 * state is other than D0, then pci_try_reset_function() will
+		 * internally set the device state to D0 without vfio driver
+		 * interaction. Update the power state in vfio driver to perform
+		 * the logic needed for D0 power state.
+		 */
+		vfio_pci_set_power_state(vdev, PCI_D0);
 		up_write(&vdev->memory_lock);
 
 		return ret;
@@ -2071,6 +2110,14 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
 
 err_undo:
 	list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) {
+		/*
+		 * If pci_reset_bus() has been called while the power
+		 * state is other than D0, then pci_reset_bus() will
+		 * internally set the device state to D0 without vfio driver
+		 * interaction. Update the power state in vfio driver to perform
+		 * the logic needed for D0 power state.
+		 */
+		vfio_pci_set_power_state(cur, PCI_D0);
 		if (cur == cur_mem)
 			is_mem = false;
 		if (cur == cur_vma)
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index aafe09c9fa64..05db838e72cc 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -124,6 +124,7 @@ struct vfio_pci_core_device {
 	bool			needs_reset;
 	bool			nointx;
 	bool			needs_pm_restore;
+	pci_power_t		power_state;
 	struct pci_saved_state	*pci_saved_state;
 	struct pci_saved_state	*pm_save;
 	int			ioeventfds_nr;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH v2 4/5] vfio/pci: Invalidate mmaps and block the access in D3hot power state
  2022-01-24 18:17 [RFC PATCH v2 0/5] vfio/pci: Enable runtime power management support Abhishek Sahu
                   ` (2 preceding siblings ...)
  2022-01-24 18:17 ` [RFC PATCH v2 3/5] vfio/pci: fix memory leak during D3hot to D0 tranistion Abhishek Sahu
@ 2022-01-24 18:17 ` Abhishek Sahu
  2022-01-25  2:35   ` kernel test robot
  2022-02-17 23:14   ` Alex Williamson
  2022-01-24 18:17 ` [RFC PATCH v2 5/5] vfio/pci: add the support for PCI D3cold state Abhishek Sahu
  4 siblings, 2 replies; 21+ messages in thread
From: Abhishek Sahu @ 2022-01-24 18:17 UTC (permalink / raw)
  To: kvm, Alex Williamson, Cornelia Huck
  Cc: Max Gurtovoy, Yishai Hadas, Zhen Lei, Jason Gunthorpe,
	linux-kernel, Abhishek Sahu

According to [PCIe v5 5.3.1.4.1] for D3hot state

 "Configuration and Message requests are the only TLPs accepted by a
  Function in the D3Hot state. All other received Requests must be
  handled as Unsupported Requests, and all received Completions may
  optionally be handled as Unexpected Completions."

Currently, if the vfio PCI device has been put into D3hot state and if
user makes non-config related read/write request in D3hot state, these
requests will be forwarded to the host and this access may cause
issues on a few systems.

This patch leverages the memory-disable support added in commit
'abafbc551fdd ("vfio-pci: Invalidate mmaps and block MMIO access on
disabled memory")' to generate page fault on mmap access and
return error for the direct read/write. If the device is D3hot state,
then the error needs to be returned for all kinds of BAR
related access (memory, IO and ROM). Also, the power related structure
fields need to be protected so we can use the same 'memory_lock' to
protect these fields also. For the few cases, this 'memory_lock' will be
already acquired by callers so introduce a separate function
vfio_pci_set_power_state_locked(). The original
vfio_pci_set_power_state() now contains the code to do the locking
related operations.

Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
---
 drivers/vfio/pci/vfio_pci_core.c | 47 +++++++++++++++++++++++++-------
 drivers/vfio/pci/vfio_pci_rdwr.c | 20 ++++++++++----
 2 files changed, 51 insertions(+), 16 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index ee2fb8af57fa..38440d48973f 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -201,11 +201,12 @@ static void vfio_pci_probe_power_state(struct vfio_pci_core_device *vdev)
 }
 
 /*
- * pci_set_power_state() wrapper handling devices which perform a soft reset on
- * D3->D0 transition.  Save state prior to D0/1/2->D3, stash it on the vdev,
- * restore when returned to D0.  Saved separately from pci_saved_state for use
- * by PM capability emulation and separately from pci_dev internal saved state
- * to avoid it being overwritten and consumed around other resets.
+ * vfio_pci_set_power_state_locked() wrapper handling devices which perform a
+ * soft reset on D3->D0 transition.  Save state prior to D0/1/2->D3, stash it
+ * on the vdev, restore when returned to D0.  Saved separately from
+ * pci_saved_state for use by PM capability emulation and separately from
+ * pci_dev internal saved state to avoid it being overwritten and consumed
+ * around other resets.
  *
  * There are few cases where the PCI power state can be changed to D0
  * without the involvement of this API. So, cache the power state locally
@@ -215,7 +216,8 @@ static void vfio_pci_probe_power_state(struct vfio_pci_core_device *vdev)
  * The memory taken for saving this PCI state needs to be freed to
  * prevent memory leak.
  */
-int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t state)
+static int vfio_pci_set_power_state_locked(struct vfio_pci_core_device *vdev,
+					   pci_power_t state)
 {
 	struct pci_dev *pdev = vdev->pdev;
 	bool needs_restore = false, needs_save = false;
@@ -260,6 +262,26 @@ int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t stat
 	return ret;
 }
 
+/*
+ * vfio_pci_set_power_state() takes all the required locks to protect
+ * the access of power related variables and then invokes
+ * vfio_pci_set_power_state_locked().
+ */
+int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev,
+			     pci_power_t state)
+{
+	int ret;
+
+	if (state >= PCI_D3hot)
+		vfio_pci_zap_and_down_write_memory_lock(vdev);
+	else
+		down_write(&vdev->memory_lock);
+
+	ret = vfio_pci_set_power_state_locked(vdev, state);
+	up_write(&vdev->memory_lock);
+	return ret;
+}
+
 int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
 {
 	struct pci_dev *pdev = vdev->pdev;
@@ -354,7 +376,7 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
 	 * in running the logic needed for D0 power state. The subsequent
 	 * runtime PM API's will put the device into the low power state again.
 	 */
-	vfio_pci_set_power_state(vdev, PCI_D0);
+	vfio_pci_set_power_state_locked(vdev, PCI_D0);
 
 	/* Stop the device from further DMA */
 	pci_clear_master(pdev);
@@ -967,7 +989,7 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
 		 * interaction. Update the power state in vfio driver to perform
 		 * the logic needed for D0 power state.
 		 */
-		vfio_pci_set_power_state(vdev, PCI_D0);
+		vfio_pci_set_power_state_locked(vdev, PCI_D0);
 		up_write(&vdev->memory_lock);
 
 		return ret;
@@ -1453,6 +1475,11 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
 		goto up_out;
 	}
 
+	if (vdev->power_state >= PCI_D3hot) {
+		ret = VM_FAULT_SIGBUS;
+		goto up_out;
+	}
+
 	/*
 	 * We populate the whole vma on fault, so we need to test whether
 	 * the vma has already been mapped, such as for concurrent faults
@@ -1902,7 +1929,7 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev)
 	 * be able to get to D3.  Therefore first do a D0 transition
 	 * before enabling runtime PM.
 	 */
-	vfio_pci_set_power_state(vdev, PCI_D0);
+	vfio_pci_set_power_state_locked(vdev, PCI_D0);
 	pm_runtime_allow(&pdev->dev);
 
 	if (!disable_idle_d3)
@@ -2117,7 +2144,7 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
 		 * interaction. Update the power state in vfio driver to perform
 		 * the logic needed for D0 power state.
 		 */
-		vfio_pci_set_power_state(cur, PCI_D0);
+		vfio_pci_set_power_state_locked(cur, PCI_D0);
 		if (cur == cur_mem)
 			is_mem = false;
 		if (cur == cur_vma)
diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c
index 57d3b2cbbd8e..e97ba14c4aa0 100644
--- a/drivers/vfio/pci/vfio_pci_rdwr.c
+++ b/drivers/vfio/pci/vfio_pci_rdwr.c
@@ -41,8 +41,13 @@
 static int vfio_pci_iowrite##size(struct vfio_pci_core_device *vdev,		\
 			bool test_mem, u##size val, void __iomem *io)	\
 {									\
+	down_read(&vdev->memory_lock);					\
+	if (vdev->power_state >= PCI_D3hot) {				\
+		up_read(&vdev->memory_lock);				\
+		return -EIO;						\
+	}								\
+									\
 	if (test_mem) {							\
-		down_read(&vdev->memory_lock);				\
 		if (!__vfio_pci_memory_enabled(vdev)) {			\
 			up_read(&vdev->memory_lock);			\
 			return -EIO;					\
@@ -51,8 +56,7 @@ static int vfio_pci_iowrite##size(struct vfio_pci_core_device *vdev,		\
 									\
 	vfio_iowrite##size(val, io);					\
 									\
-	if (test_mem)							\
-		up_read(&vdev->memory_lock);				\
+	up_read(&vdev->memory_lock);					\
 									\
 	return 0;							\
 }
@@ -68,8 +72,13 @@ VFIO_IOWRITE(64)
 static int vfio_pci_ioread##size(struct vfio_pci_core_device *vdev,		\
 			bool test_mem, u##size *val, void __iomem *io)	\
 {									\
+	down_read(&vdev->memory_lock);					\
+	if (vdev->power_state >= PCI_D3hot) {				\
+		up_read(&vdev->memory_lock);				\
+		return -EIO;						\
+	}								\
+									\
 	if (test_mem) {							\
-		down_read(&vdev->memory_lock);				\
 		if (!__vfio_pci_memory_enabled(vdev)) {			\
 			up_read(&vdev->memory_lock);			\
 			return -EIO;					\
@@ -78,8 +87,7 @@ static int vfio_pci_ioread##size(struct vfio_pci_core_device *vdev,		\
 									\
 	*val = vfio_ioread##size(io);					\
 									\
-	if (test_mem)							\
-		up_read(&vdev->memory_lock);				\
+	up_read(&vdev->memory_lock);					\
 									\
 	return 0;							\
 }
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH v2 5/5] vfio/pci: add the support for PCI D3cold state
  2022-01-24 18:17 [RFC PATCH v2 0/5] vfio/pci: Enable runtime power management support Abhishek Sahu
                   ` (3 preceding siblings ...)
  2022-01-24 18:17 ` [RFC PATCH v2 4/5] vfio/pci: Invalidate mmaps and block the access in D3hot power state Abhishek Sahu
@ 2022-01-24 18:17 ` Abhishek Sahu
  2022-03-09 17:26   ` Alex Williamson
  4 siblings, 1 reply; 21+ messages in thread
From: Abhishek Sahu @ 2022-01-24 18:17 UTC (permalink / raw)
  To: kvm, Alex Williamson, Cornelia Huck
  Cc: Max Gurtovoy, Yishai Hadas, Zhen Lei, Jason Gunthorpe,
	linux-kernel, Abhishek Sahu

Currently, if the runtime power management is enabled for vfio-pci
device in the guest OS, then guest OS will do the register write for
PCI_PM_CTRL register. This write request will be handled in
vfio_pm_config_write() where it will do the actual register write
of PCI_PM_CTRL register. With this, the maximum D3hot state can be
achieved for low power. If we can use the runtime PM framework,
then we can achieve the D3cold state which will help in saving
maximum power.

1. Since D3cold state can't be achieved by writing PCI standard
   PM config registers, so this patch adds a new IOCTL which change the
   PCI device from D3hot to D3cold state and then D3cold to D0 state.

2. The hypervisors can implement virtual ACPI methods. For
   example, in guest linux OS if PCI device ACPI node has _PR3 and _PR0
   power resources with _ON/_OFF method, then guest linux OS makes the
   _OFF call during D3cold transition and then _ON during D0 transition.
   The hypervisor can tap these virtual ACPI calls and then do the D3cold
   related IOCTL in the vfio driver.

3. The vfio driver uses runtime PM framework to achieve the
   D3cold state. For the D3cold transition, decrement the usage count and
   during D0 transition increment the usage count.

4. For D3cold, the device current power state should be D3hot.
   Then during runtime suspend, the pci_platform_power_transition() is
   required for D3cold state. If the D3cold state is not supported, then
   the device will still be in D3hot state. But with the runtime PM, the
   root port can now also go into suspended state.

5. For most of the systems, the D3cold is supported at the root
   port level. So, when root port will transition to D3cold state, then
   the vfio PCI device will go from D3hot to D3cold state during its
   runtime suspend. If root port does not support D3cold, then the root
   will go into D3hot state.

6. The runtime suspend callback can now happen for 2 cases: there
   is no user of vfio device and the case where user has initiated
   D3cold. The 'runtime_suspend_pending' flag can help to distinguish
   this case.

7. There are cases where guest has put PCI device into D3cold
   state and then on the host side, user has run lspci or any other
   command which requires access of the PCI config register. In this case,
   the kernel runtime PM framework will resume the PCI device internally,
   read the config space and put the device into D3cold state again. Some
   PCI device needs the SW involvement before going into D3cold state.
   For the first D3cold state, the driver running in guest side does the SW
   side steps. But the second D3cold transition will be without guest
   driver involvement. So, prevent this second d3cold transition by
   incrementing the device usage count. This will make the device
   unnecessary in D0 but it's better than failure. In future, we can some
   mechanism by which we can forward these wake-up request to guest and
   then the mentioned case can be handled also.

8. In D3cold, all kind of BAR related access needs to be disabled
   like D3hot. Additionally, the config space will also be disabled in
   D3cold state. To prevent access of config space in the D3cold state,
   increment the runtime PM usage count before doing any config space
   access. Also, most of the IOCTLs do the config space access, so
   maintain one safe list and skip the resume only for these safe IOCTLs
   alone. For other IOCTLs, the runtime PM usage count will be
   incremented first.

9. Now, runtime suspend/resume callbacks need to get the vdev
   reference which can be obtained by dev_get_drvdata(). Currently, the
   dev_set_drvdata() is being set after returning from
   vfio_pci_core_register_device(). The runtime callbacks can come
   anytime after enabling runtime PM so dev_set_drvdata() must happen
   before that. We can move dev_set_drvdata() inside
   vfio_pci_core_register_device() itself.

10. The vfio device user can close the device after putting
    the device into runtime suspended state so inside
    vfio_pci_core_disable(), increment the runtime PM usage count.

11. Runtime PM will be possible only if CONFIG_PM is enabled on
    the host. So, the IOCTL related code can be put under CONFIG_PM
    Kconfig.

Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
---
 drivers/vfio/pci/vfio_pci.c        |   1 -
 drivers/vfio/pci/vfio_pci_config.c |  11 +-
 drivers/vfio/pci/vfio_pci_core.c   | 186 +++++++++++++++++++++++++++--
 include/linux/vfio_pci_core.h      |   1 +
 include/uapi/linux/vfio.h          |  21 ++++
 5 files changed, 211 insertions(+), 9 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index c8695baf3b54..4ac3338c8fc7 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -153,7 +153,6 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	ret = vfio_pci_core_register_device(vdev);
 	if (ret)
 		goto out_free;
-	dev_set_drvdata(&pdev->dev, vdev);
 	return 0;
 
 out_free:
diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
index dd9ed211ba6f..d20420657959 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -25,6 +25,7 @@
 #include <linux/uaccess.h>
 #include <linux/vfio.h>
 #include <linux/slab.h>
+#include <linux/pm_runtime.h>
 
 #include <linux/vfio_pci_core.h>
 
@@ -1919,16 +1920,23 @@ static ssize_t vfio_config_do_rw(struct vfio_pci_core_device *vdev, char __user
 ssize_t vfio_pci_config_rw(struct vfio_pci_core_device *vdev, char __user *buf,
 			   size_t count, loff_t *ppos, bool iswrite)
 {
+	struct device *dev = &vdev->pdev->dev;
 	size_t done = 0;
 	int ret = 0;
 	loff_t pos = *ppos;
 
 	pos &= VFIO_PCI_OFFSET_MASK;
 
+	ret = pm_runtime_resume_and_get(dev);
+	if (ret < 0)
+		return ret;
+
 	while (count) {
 		ret = vfio_config_do_rw(vdev, buf, count, &pos, iswrite);
-		if (ret < 0)
+		if (ret < 0) {
+			pm_runtime_put(dev);
 			return ret;
+		}
 
 		count -= ret;
 		done += ret;
@@ -1936,6 +1944,7 @@ ssize_t vfio_pci_config_rw(struct vfio_pci_core_device *vdev, char __user *buf,
 		pos += ret;
 	}
 
+	pm_runtime_put(dev);
 	*ppos += done;
 
 	return done;
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 38440d48973f..b70bb4fd940d 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -371,12 +371,23 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
 	lockdep_assert_held(&vdev->vdev.dev_set->lock);
 
 	/*
-	 * If disable has been called while the power state is other than D0,
-	 * then set the power state in vfio driver to D0. It will help
-	 * in running the logic needed for D0 power state. The subsequent
-	 * runtime PM API's will put the device into the low power state again.
+	 * The vfio device user can close the device after putting the device
+	 * into runtime suspended state so wake up the device first in
+	 * this case.
 	 */
-	vfio_pci_set_power_state_locked(vdev, PCI_D0);
+	if (vdev->runtime_suspend_pending) {
+		vdev->runtime_suspend_pending = false;
+		pm_runtime_resume_and_get(&pdev->dev);
+	} else {
+		/*
+		 * If disable has been called while the power state is other
+		 * than D0, then set the power state in vfio driver to D0. It
+		 * will help in running the logic needed for D0 power state.
+		 * The subsequent runtime PM API's will put the device into
+		 * the low power state again.
+		 */
+		vfio_pci_set_power_state_locked(vdev, PCI_D0);
+	}
 
 	/* Stop the device from further DMA */
 	pci_clear_master(pdev);
@@ -693,8 +704,8 @@ int vfio_pci_register_dev_region(struct vfio_pci_core_device *vdev,
 }
 EXPORT_SYMBOL_GPL(vfio_pci_register_dev_region);
 
-long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
-		unsigned long arg)
+static long vfio_pci_core_ioctl_internal(struct vfio_device *core_vdev,
+					 unsigned int cmd, unsigned long arg)
 {
 	struct vfio_pci_core_device *vdev =
 		container_of(core_vdev, struct vfio_pci_core_device, vdev);
@@ -1241,10 +1252,119 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
 		default:
 			return -ENOTTY;
 		}
+#ifdef CONFIG_PM
+	} else if (cmd == VFIO_DEVICE_POWER_MANAGEMENT) {
+		struct vfio_power_management vfio_pm;
+		struct pci_dev *pdev = vdev->pdev;
+		bool request_idle = false, request_resume = false;
+		int ret = 0;
+
+		if (copy_from_user(&vfio_pm, (void __user *)arg, sizeof(vfio_pm)))
+			return -EFAULT;
+
+		/*
+		 * The vdev power related fields are protected with memory_lock
+		 * semaphore.
+		 */
+		down_write(&vdev->memory_lock);
+		switch (vfio_pm.d3cold_state) {
+		case VFIO_DEVICE_D3COLD_STATE_ENTER:
+			/*
+			 * For D3cold, the device should already in D3hot
+			 * state.
+			 */
+			if (vdev->power_state < PCI_D3hot) {
+				ret = EINVAL;
+				break;
+			}
+
+			if (!vdev->runtime_suspend_pending) {
+				vdev->runtime_suspend_pending = true;
+				pm_runtime_put_noidle(&pdev->dev);
+				request_idle = true;
+			}
+
+			break;
+
+		case VFIO_DEVICE_D3COLD_STATE_EXIT:
+			/*
+			 * If the runtime resume has already been run, then
+			 * the device will be already in D0 state.
+			 */
+			if (vdev->runtime_suspend_pending) {
+				vdev->runtime_suspend_pending = false;
+				pm_runtime_get_noresume(&pdev->dev);
+				request_resume = true;
+			}
+
+			break;
+
+		default:
+			ret = EINVAL;
+			break;
+		}
+
+		up_write(&vdev->memory_lock);
+
+		/*
+		 * Call the runtime PM API's without any lock. Inside vfio driver
+		 * runtime suspend/resume, the locks can be acquired again.
+		 */
+		if (request_idle)
+			pm_request_idle(&pdev->dev);
+
+		if (request_resume)
+			pm_runtime_resume(&pdev->dev);
+
+		return ret;
+#endif
 	}
 
 	return -ENOTTY;
 }
+
+long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
+			 unsigned long arg)
+{
+#ifdef CONFIG_PM
+	struct vfio_pci_core_device *vdev =
+		container_of(core_vdev, struct vfio_pci_core_device, vdev);
+	struct device *dev = &vdev->pdev->dev;
+	bool skip_runtime_resume = false;
+	long ret;
+
+	/*
+	 * The list of commands which are safe to execute when the PCI device
+	 * is in D3cold state. In D3cold state, the PCI config or any other IO
+	 * access won't work.
+	 */
+	switch (cmd) {
+	case VFIO_DEVICE_POWER_MANAGEMENT:
+	case VFIO_DEVICE_GET_INFO:
+	case VFIO_DEVICE_FEATURE:
+		skip_runtime_resume = true;
+		break;
+
+	default:
+		break;
+	}
+
+	if (!skip_runtime_resume) {
+		ret = pm_runtime_resume_and_get(dev);
+		if (ret < 0)
+			return ret;
+	}
+
+	ret = vfio_pci_core_ioctl_internal(core_vdev, cmd, arg);
+
+	if (!skip_runtime_resume)
+		pm_runtime_put(dev);
+
+	return ret;
+#else
+	return vfio_pci_core_ioctl_internal(core_vdev, cmd, arg);
+#endif
+}
 EXPORT_SYMBOL_GPL(vfio_pci_core_ioctl);
 
 static ssize_t vfio_pci_rw(struct vfio_pci_core_device *vdev, char __user *buf,
@@ -1897,6 +2017,7 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev)
 		return -EBUSY;
 	}
 
+	dev_set_drvdata(&pdev->dev, vdev);
 	if (pci_is_root_bus(pdev->bus)) {
 		ret = vfio_assign_device_set(&vdev->vdev, vdev);
 	} else if (!pci_probe_reset_slot(pdev->slot)) {
@@ -1966,6 +2087,7 @@ void vfio_pci_core_unregister_device(struct vfio_pci_core_device *vdev)
 		pm_runtime_get_noresume(&pdev->dev);
 
 	pm_runtime_forbid(&pdev->dev);
+	dev_set_drvdata(&pdev->dev, NULL);
 }
 EXPORT_SYMBOL_GPL(vfio_pci_core_unregister_device);
 
@@ -2219,11 +2341,61 @@ static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set)
 #ifdef CONFIG_PM
 static int vfio_pci_core_runtime_suspend(struct device *dev)
 {
+	struct pci_dev *pdev = to_pci_dev(dev);
+	struct vfio_pci_core_device *vdev = dev_get_drvdata(dev);
+
+	down_read(&vdev->memory_lock);
+
+	/*
+	 * runtime_suspend_pending won't be set if there is no user of vfio pci
+	 * device. In that case, return early and PCI core will take care of
+	 * putting the device in the low power state.
+	 */
+	if (!vdev->runtime_suspend_pending) {
+		up_read(&vdev->memory_lock);
+		return 0;
+	}
+
+	/*
+	 * The runtime suspend will be called only if device is already at
+	 * D3hot state. Now, change the device state from D3hot to D3cold by
+	 * using platform power management. If setting of D3cold is not
+	 * supported for the PCI device, then the device state will still be
+	 * in D3hot state. The PCI core expects to save the PCI state, if
+	 * driver runtime routine handles the power state management.
+	 */
+	pci_save_state(pdev);
+	pci_platform_power_transition(pdev, PCI_D3cold);
+	up_read(&vdev->memory_lock);
+
 	return 0;
 }
 
 static int vfio_pci_core_runtime_resume(struct device *dev)
 {
+	struct pci_dev *pdev = to_pci_dev(dev);
+	struct vfio_pci_core_device *vdev = dev_get_drvdata(dev);
+
+	down_write(&vdev->memory_lock);
+
+	/*
+	 * The PCI core will move the device to D0 state before calling the
+	 * driver runtime resume.
+	 */
+	vfio_pci_set_power_state_locked(vdev, PCI_D0);
+
+	/*
+	 * Some PCI device needs the SW involvement before going to D3cold
+	 * state again. So if there is any wake-up which is not triggered
+	 * by the guest, then increase the usage count to prevent the
+	 * second runtime suspend.
+	 */
+	if (vdev->runtime_suspend_pending) {
+		vdev->runtime_suspend_pending = false;
+		pm_runtime_get_noresume(&pdev->dev);
+	}
+
+	up_write(&vdev->memory_lock);
 	return 0;
 }
 
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index 05db838e72cc..8bbfd028115a 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -124,6 +124,7 @@ struct vfio_pci_core_device {
 	bool			needs_reset;
 	bool			nointx;
 	bool			needs_pm_restore;
+	bool			runtime_suspend_pending;
 	pci_power_t		power_state;
 	struct pci_saved_state	*pci_saved_state;
 	struct pci_saved_state	*pm_save;
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index ef33ea002b0b..7b7dadc6df71 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -1002,6 +1002,27 @@ struct vfio_device_feature {
  */
 #define VFIO_DEVICE_FEATURE_PCI_VF_TOKEN	(0)
 
+/**
+ * VFIO_DEVICE_POWER_MANAGEMENT - _IOW(VFIO_TYPE, VFIO_BASE + 18,
+ *			       struct vfio_power_management)
+ *
+ * Provide the support for device power management.  The native PCI power
+ * management does not support the D3cold power state.  For moving the device
+ * into D3cold state, change the PCI state to D3hot with standard
+ * configuration registers and then call this IOCTL to setting the D3cold
+ * state.  Similarly, if the device in D3cold state, then call this IOCTL
+ * to exit from D3cold state.
+ *
+ * Return 0 on success, -errno on failure.
+ */
+#define VFIO_DEVICE_POWER_MANAGEMENT		_IO(VFIO_TYPE, VFIO_BASE + 18)
+struct vfio_power_management {
+	__u32	argsz;
+#define VFIO_DEVICE_D3COLD_STATE_EXIT		0x0
+#define VFIO_DEVICE_D3COLD_STATE_ENTER		0x1
+	__u32	d3cold_state;
+};
+
 /* -------- API for Type1 VFIO IOMMU -------- */
 
 /**
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v2 4/5] vfio/pci: Invalidate mmaps and block the access in D3hot power state
  2022-01-24 18:17 ` [RFC PATCH v2 4/5] vfio/pci: Invalidate mmaps and block the access in D3hot power state Abhishek Sahu
@ 2022-01-25  2:35   ` kernel test robot
  2022-02-17 23:14   ` Alex Williamson
  1 sibling, 0 replies; 21+ messages in thread
From: kernel test robot @ 2022-01-25  2:35 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 3398 bytes --]

Hi Abhishek,

[FYI, it's a private test report for your RFC patch.]
[auto build test WARNING on awilliam-vfio/next]
[also build test WARNING on v5.17-rc1 next-20220124]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Abhishek-Sahu/vfio-pci-Enable-runtime-power-management-support/20220125-021827
base:   https://github.com/awilliam/linux-vfio.git next
config: x86_64-randconfig-s022-20220124 (https://download.01.org/0day-ci/archive/20220125/202201251024.DoqLcJkP-lkp(a)intel.com/config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce:
        # apt-get install sparse
        # sparse version: v0.6.4-dirty
        # https://github.com/0day-ci/linux/commit/84afeb78e3ae55e0f1c1f84d42e28c60585fdc48
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Abhishek-Sahu/vfio-pci-Enable-runtime-power-management-support/20220125-021827
        git checkout 84afeb78e3ae55e0f1c1f84d42e28c60585fdc48
        # save the config file to linux build tree
        mkdir build_dir
        make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' O=build_dir ARCH=x86_64 SHELL=/bin/bash drivers/vfio/pci/ mm/kasan/

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>


sparse warnings: (new ones prefixed by >>)
>> drivers/vfio/pci/vfio_pci_rdwr.c:64:1: sparse: sparse: restricted pci_power_t degrades to integer
>> drivers/vfio/pci/vfio_pci_rdwr.c:64:1: sparse: sparse: restricted pci_power_t degrades to integer
   drivers/vfio/pci/vfio_pci_rdwr.c:65:1: sparse: sparse: restricted pci_power_t degrades to integer
   drivers/vfio/pci/vfio_pci_rdwr.c:65:1: sparse: sparse: restricted pci_power_t degrades to integer
   drivers/vfio/pci/vfio_pci_rdwr.c:66:1: sparse: sparse: restricted pci_power_t degrades to integer
   drivers/vfio/pci/vfio_pci_rdwr.c:66:1: sparse: sparse: restricted pci_power_t degrades to integer
   drivers/vfio/pci/vfio_pci_rdwr.c:95:1: sparse: sparse: restricted pci_power_t degrades to integer
   drivers/vfio/pci/vfio_pci_rdwr.c:95:1: sparse: sparse: restricted pci_power_t degrades to integer
   drivers/vfio/pci/vfio_pci_rdwr.c:96:1: sparse: sparse: restricted pci_power_t degrades to integer
   drivers/vfio/pci/vfio_pci_rdwr.c:96:1: sparse: sparse: restricted pci_power_t degrades to integer
   drivers/vfio/pci/vfio_pci_rdwr.c:97:1: sparse: sparse: restricted pci_power_t degrades to integer
   drivers/vfio/pci/vfio_pci_rdwr.c:97:1: sparse: sparse: restricted pci_power_t degrades to integer

vim +64 drivers/vfio/pci/vfio_pci_rdwr.c

bc93b9ae0151ae Alex Williamson 2020-08-17  63  
bc93b9ae0151ae Alex Williamson 2020-08-17 @64  VFIO_IOWRITE(8)
bc93b9ae0151ae Alex Williamson 2020-08-17  65  VFIO_IOWRITE(16)
bc93b9ae0151ae Alex Williamson 2020-08-17  66  VFIO_IOWRITE(32)
bc93b9ae0151ae Alex Williamson 2020-08-17  67  #ifdef iowrite64
bc93b9ae0151ae Alex Williamson 2020-08-17  68  VFIO_IOWRITE(64)
bc93b9ae0151ae Alex Williamson 2020-08-17  69  #endif
bc93b9ae0151ae Alex Williamson 2020-08-17  70  

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v2 3/5] vfio/pci: fix memory leak during D3hot to D0 tranistion
  2022-01-24 18:17 ` [RFC PATCH v2 3/5] vfio/pci: fix memory leak during D3hot to D0 tranistion Abhishek Sahu
@ 2022-01-28  0:05   ` Alex Williamson
  2022-01-31 11:34     ` Abhishek Sahu
  0 siblings, 1 reply; 21+ messages in thread
From: Alex Williamson @ 2022-01-28  0:05 UTC (permalink / raw)
  To: Abhishek Sahu
  Cc: kvm, Cornelia Huck, Max Gurtovoy, Yishai Hadas, Zhen Lei,
	Jason Gunthorpe, linux-kernel

On Mon, 24 Jan 2022 23:47:24 +0530
Abhishek Sahu <abhsahu@nvidia.com> wrote:

> If needs_pm_restore is set (PCI device does not have support for no
> soft reset), then the current PCI state will be saved during D0->D3hot
> transition and same will be restored back during D3hot->D0 transition.
> For saving the PCI state locally, pci_store_saved_state() is being
> used and the pci_load_and_free_saved_state() will free the allocated
> memory.
> 
> But for reset related IOCTLs, vfio driver calls PCI reset related
> API's which will internally change the PCI power state back to D0. So,
> when the guest resumes, then it will get the current state as D0 and it
> will skip the call to vfio_pci_set_power_state() for changing the
> power state to D0 explicitly. In this case, the memory pointed by
> pm_save will never be freed.
> 
> Also, in malicious sequence, the state changing to D3hot followed by
> VFIO_DEVICE_RESET/VFIO_DEVICE_PCI_HOT_RESET can be run in loop and
> it can cause an OOM situation. This patch stores the power state locally
> and uses the same for comparing the current power state. For the
> places where D0 transition can happen, call vfio_pci_set_power_state()
> to transition to D0 state. Since the vfio power state is still D3hot,
> so this D0 transition will help in running the logic required
> from D3hot->D0 transition. Also, to prevent any miss during
> future development to detect this condition, this patch puts a
> check and frees the memory after printing warning.
> 
> This locally saved power state will help in subsequent patches
> also.

Ideally let's put fixes patches at the start of the series, or better
yet send them separately, and don't include changes that only make
sense in the context of a subsequent patch.

Fixes: 51ef3a004b1e ("vfio/pci: Restore device state on PM transition")

> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
> ---
>  drivers/vfio/pci/vfio_pci_core.c | 53 ++++++++++++++++++++++++++++++--
>  include/linux/vfio_pci_core.h    |  1 +
>  2 files changed, 51 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index c6e4fe9088c3..ee2fb8af57fa 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -206,6 +206,14 @@ static void vfio_pci_probe_power_state(struct vfio_pci_core_device *vdev)
>   * restore when returned to D0.  Saved separately from pci_saved_state for use
>   * by PM capability emulation and separately from pci_dev internal saved state
>   * to avoid it being overwritten and consumed around other resets.
> + *
> + * There are few cases where the PCI power state can be changed to D0
> + * without the involvement of this API. So, cache the power state locally
> + * and call this API to update the D0 state. It will help in running the
> + * logic that is needed for transitioning to the D0 state. For example,
> + * if needs_pm_restore is set, then the PCI state will be saved locally.
> + * The memory taken for saving this PCI state needs to be freed to
> + * prevent memory leak.
>   */
>  int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t state)
>  {
> @@ -214,20 +222,34 @@ int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t stat
>  	int ret;
>  
>  	if (vdev->needs_pm_restore) {
> -		if (pdev->current_state < PCI_D3hot && state >= PCI_D3hot) {
> +		if (vdev->power_state < PCI_D3hot && state >= PCI_D3hot) {
>  			pci_save_state(pdev);
>  			needs_save = true;
>  		}
>  
> -		if (pdev->current_state >= PCI_D3hot && state <= PCI_D0)
> +		if (vdev->power_state >= PCI_D3hot && state <= PCI_D0)
>  			needs_restore = true;
>  	}
>  
>  	ret = pci_set_power_state(pdev, state);
>  
>  	if (!ret) {
> +		vdev->power_state = pdev->current_state;
> +
>  		/* D3 might be unsupported via quirk, skip unless in D3 */
> -		if (needs_save && pdev->current_state >= PCI_D3hot) {
> +		if (needs_save && vdev->power_state >= PCI_D3hot) {
> +			/*
> +			 * If somehow, the vfio driver was not able to free the
> +			 * memory allocated in pm_save, then free the earlier
> +			 * memory first before overwriting pm_save to prevent
> +			 * memory leak.
> +			 */
> +			if (vdev->pm_save) {
> +				pci_warn(pdev,
> +					 "Overwriting saved PCI state pointer so freeing the earlier memory\n");
> +				kfree(vdev->pm_save);
> +			}

The minimal fix for the described issue would simply be:

			kfree(vdev->pm_save);

It seems like the only purpose of the warning is try to make sure we
haven't missed any wake-up calls, where this would be a pretty small
breadcrumb to actually debug such an issue.

> +
>  			vdev->pm_save = pci_store_saved_state(pdev);
>  		} else if (needs_restore) {
>  			pci_load_and_free_saved_state(pdev, &vdev->pm_save);
> @@ -326,6 +348,14 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
>  	/* For needs_reset */
>  	lockdep_assert_held(&vdev->vdev.dev_set->lock);
>  
> +	/*
> +	 * If disable has been called while the power state is other than D0,
> +	 * then set the power state in vfio driver to D0. It will help
> +	 * in running the logic needed for D0 power state. The subsequent
> +	 * runtime PM API's will put the device into the low power state again.
> +	 */
> +	vfio_pci_set_power_state(vdev, PCI_D0);
> +

I do think we have an issue here, but the reason is that pci_pm_reset()
returns -EINVAL if we try to reset a device that isn't currently in D0.
Therefore any path where we're triggering a function reset that could
use a PM reset and we don't know if the device is in D0, should wake up
the device before we try that reset.

We're about to load the initial state of the device that was saved when
it was opened, so I don't think pdev->current_state vs
vdev->power_state matters here, we only care that the device is in D0.

>  	/* Stop the device from further DMA */
>  	pci_clear_master(pdev);
>  
> @@ -929,6 +959,15 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
>  
>  		vfio_pci_zap_and_down_write_memory_lock(vdev);
>  		ret = pci_try_reset_function(vdev->pdev);
> +
> +		/*
> +		 * If pci_try_reset_function() has been called while the power
> +		 * state is other than D0, then pci_try_reset_function() will
> +		 * internally set the device state to D0 without vfio driver
> +		 * interaction. Update the power state in vfio driver to perform
> +		 * the logic needed for D0 power state.
> +		 */
> +		vfio_pci_set_power_state(vdev, PCI_D0);

For the case where pci_try_reset_function() might use a PM reset, we
should set D0 before that call.  In doing so, the pdev->current_state
should match the actual device power state, so we still don't need to
stash power state on the vdev.

>  		up_write(&vdev->memory_lock);
>  
>  		return ret;
> @@ -2071,6 +2110,14 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
>  
>  err_undo:
>  	list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) {
> +		/*
> +		 * If pci_reset_bus() has been called while the power
> +		 * state is other than D0, then pci_reset_bus() will
> +		 * internally set the device state to D0 without vfio driver
> +		 * interaction. Update the power state in vfio driver to perform
> +		 * the logic needed for D0 power state.
> +		 */
> +		vfio_pci_set_power_state(cur, PCI_D0);

Here pci_reset_bus() will wakeup the device and I think the concern is
that around that bus reset we'll save and restore the device state, but
that's potentially bogus device state if waking it triggers a soft
reset.  We could again wake devices before the reset to make the state
correct, or we could test pm_save and perform the load and restore if
it exists.  Either of those would avoid needing to cache the power
state on the vdev.  Thanks,

Alex

>  		if (cur == cur_mem)
>  			is_mem = false;
>  		if (cur == cur_vma)
> diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
> index aafe09c9fa64..05db838e72cc 100644
> --- a/include/linux/vfio_pci_core.h
> +++ b/include/linux/vfio_pci_core.h
> @@ -124,6 +124,7 @@ struct vfio_pci_core_device {
>  	bool			needs_reset;
>  	bool			nointx;
>  	bool			needs_pm_restore;
> +	pci_power_t		power_state;
>  	struct pci_saved_state	*pci_saved_state;
>  	struct pci_saved_state	*pm_save;
>  	int			ioeventfds_nr;


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v2 3/5] vfio/pci: fix memory leak during D3hot to D0 tranistion
  2022-01-28  0:05   ` Alex Williamson
@ 2022-01-31 11:34     ` Abhishek Sahu
  2022-01-31 15:33       ` Alex Williamson
  0 siblings, 1 reply; 21+ messages in thread
From: Abhishek Sahu @ 2022-01-31 11:34 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, Cornelia Huck, Max Gurtovoy, Yishai Hadas, Zhen Lei,
	Jason Gunthorpe, linux-kernel

On 1/28/2022 5:35 AM, Alex Williamson wrote:
> External email: Use caution opening links or attachments
> 
> 
> On Mon, 24 Jan 2022 23:47:24 +0530
> Abhishek Sahu <abhsahu@nvidia.com> wrote:
> 
>> If needs_pm_restore is set (PCI device does not have support for no
>> soft reset), then the current PCI state will be saved during D0->D3hot
>> transition and same will be restored back during D3hot->D0 transition.
>> For saving the PCI state locally, pci_store_saved_state() is being
>> used and the pci_load_and_free_saved_state() will free the allocated
>> memory.
>>
>> But for reset related IOCTLs, vfio driver calls PCI reset related
>> API's which will internally change the PCI power state back to D0. So,
>> when the guest resumes, then it will get the current state as D0 and it
>> will skip the call to vfio_pci_set_power_state() for changing the
>> power state to D0 explicitly. In this case, the memory pointed by
>> pm_save will never be freed.
>>
>> Also, in malicious sequence, the state changing to D3hot followed by
>> VFIO_DEVICE_RESET/VFIO_DEVICE_PCI_HOT_RESET can be run in loop and
>> it can cause an OOM situation. This patch stores the power state locally
>> and uses the same for comparing the current power state. For the
>> places where D0 transition can happen, call vfio_pci_set_power_state()
>> to transition to D0 state. Since the vfio power state is still D3hot,
>> so this D0 transition will help in running the logic required
>> from D3hot->D0 transition. Also, to prevent any miss during
>> future development to detect this condition, this patch puts a
>> check and frees the memory after printing warning.
>>
>> This locally saved power state will help in subsequent patches
>> also.
> 
> Ideally let's put fixes patches at the start of the series, or better
> yet send them separately, and don't include changes that only make
> sense in the context of a subsequent patch.
> 
> Fixes: 51ef3a004b1e ("vfio/pci: Restore device state on PM transition")
> 

 Thanks Alex for reviewing this patch.
 I have added Fixes tag and sent this patch separately.

 Should I update this patch series or you are planning to review the
 other patches first of this patch series first. 

>> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
>> ---
>>  drivers/vfio/pci/vfio_pci_core.c | 53 ++++++++++++++++++++++++++++++--
>>  include/linux/vfio_pci_core.h    |  1 +
>>  2 files changed, 51 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
>> index c6e4fe9088c3..ee2fb8af57fa 100644
>> --- a/drivers/vfio/pci/vfio_pci_core.c
>> +++ b/drivers/vfio/pci/vfio_pci_core.c
>> @@ -206,6 +206,14 @@ static void vfio_pci_probe_power_state(struct vfio_pci_core_device *vdev)
>>   * restore when returned to D0.  Saved separately from pci_saved_state for use
>>   * by PM capability emulation and separately from pci_dev internal saved state
>>   * to avoid it being overwritten and consumed around other resets.
>> + *
>> + * There are few cases where the PCI power state can be changed to D0
>> + * without the involvement of this API. So, cache the power state locally
>> + * and call this API to update the D0 state. It will help in running the
>> + * logic that is needed for transitioning to the D0 state. For example,
>> + * if needs_pm_restore is set, then the PCI state will be saved locally.
>> + * The memory taken for saving this PCI state needs to be freed to
>> + * prevent memory leak.
>>   */
>>  int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t state)
>>  {
>> @@ -214,20 +222,34 @@ int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t stat
>>       int ret;
>>
>>       if (vdev->needs_pm_restore) {
>> -             if (pdev->current_state < PCI_D3hot && state >= PCI_D3hot) {
>> +             if (vdev->power_state < PCI_D3hot && state >= PCI_D3hot) {
>>                       pci_save_state(pdev);
>>                       needs_save = true;
>>               }
>>
>> -             if (pdev->current_state >= PCI_D3hot && state <= PCI_D0)
>> +             if (vdev->power_state >= PCI_D3hot && state <= PCI_D0)
>>                       needs_restore = true;
>>       }
>>
>>       ret = pci_set_power_state(pdev, state);
>>
>>       if (!ret) {
>> +             vdev->power_state = pdev->current_state;
>> +
>>               /* D3 might be unsupported via quirk, skip unless in D3 */
>> -             if (needs_save && pdev->current_state >= PCI_D3hot) {
>> +             if (needs_save && vdev->power_state >= PCI_D3hot) {
>> +                     /*
>> +                      * If somehow, the vfio driver was not able to free the
>> +                      * memory allocated in pm_save, then free the earlier
>> +                      * memory first before overwriting pm_save to prevent
>> +                      * memory leak.
>> +                      */
>> +                     if (vdev->pm_save) {
>> +                             pci_warn(pdev,
>> +                                      "Overwriting saved PCI state pointer so freeing the earlier memory\n");
>> +                             kfree(vdev->pm_save);
>> +                     }
> 
> The minimal fix for the described issue would simply be:
> 
>                         kfree(vdev->pm_save);
> 
> It seems like the only purpose of the warning is try to make sure we
> haven't missed any wake-up calls, where this would be a pretty small
> breadcrumb to actually debug such an issue.
> 

 I have removed the warning in the updated patch.

>> +
>>                       vdev->pm_save = pci_store_saved_state(pdev);
>>               } else if (needs_restore) {
>>                       pci_load_and_free_saved_state(pdev, &vdev->pm_save);
>> @@ -326,6 +348,14 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
>>       /* For needs_reset */
>>       lockdep_assert_held(&vdev->vdev.dev_set->lock);
>>
>> +     /*
>> +      * If disable has been called while the power state is other than D0,
>> +      * then set the power state in vfio driver to D0. It will help
>> +      * in running the logic needed for D0 power state. The subsequent
>> +      * runtime PM API's will put the device into the low power state again.
>> +      */
>> +     vfio_pci_set_power_state(vdev, PCI_D0);
>> +
> 
> I do think we have an issue here, but the reason is that pci_pm_reset()
> returns -EINVAL if we try to reset a device that isn't currently in D0.
> Therefore any path where we're triggering a function reset that could
> use a PM reset and we don't know if the device is in D0, should wake up
> the device before we try that reset.
> 
> We're about to load the initial state of the device that was saved when
> it was opened, so I don't think pdev->current_state vs
> vdev->power_state matters here, we only care that the device is in D0.
> 

 I have added this point in the commit message of the updated patch.

>>       /* Stop the device from further DMA */
>>       pci_clear_master(pdev);
>>
>> @@ -929,6 +959,15 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
>>
>>               vfio_pci_zap_and_down_write_memory_lock(vdev);
>>               ret = pci_try_reset_function(vdev->pdev);
>> +
>> +             /*
>> +              * If pci_try_reset_function() has been called while the power
>> +              * state is other than D0, then pci_try_reset_function() will
>> +              * internally set the device state to D0 without vfio driver
>> +              * interaction. Update the power state in vfio driver to perform
>> +              * the logic needed for D0 power state.
>> +              */
>> +             vfio_pci_set_power_state(vdev, PCI_D0);
> 
> For the case where pci_try_reset_function() might use a PM reset, we
> should set D0 before that call.  In doing so, the pdev->current_state
> should match the actual device power state, so we still don't need to
> stash power state on the vdev.
> 

 I have set D0 state before calling pci_try_reset_function() in
 the updated patch.

>>               up_write(&vdev->memory_lock);
>>
>>               return ret;
>> @@ -2071,6 +2110,14 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
>>
>>  err_undo:
>>       list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) {
>> +             /*
>> +              * If pci_reset_bus() has been called while the power
>> +              * state is other than D0, then pci_reset_bus() will
>> +              * internally set the device state to D0 without vfio driver
>> +              * interaction. Update the power state in vfio driver to perform
>> +              * the logic needed for D0 power state.
>> +              */
>> +             vfio_pci_set_power_state(cur, PCI_D0);
> 
> Here pci_reset_bus() will wakeup the device and I think the concern is
> that around that bus reset we'll save and restore the device state, but
> that's potentially bogus device state if waking it triggers a soft
> reset.  We could again wake devices before the reset to make the state
> correct, or we could test pm_save and perform the load and restore if
> it exists.  Either of those would avoid needing to cache the power
> state on the vdev.  Thanks,
> 

 I have made the changes to wake-up the devices.

 Thanks
 Abhishek

> Alex
> 
>>               if (cur == cur_mem)
>>                       is_mem = false;
>>               if (cur == cur_vma)
>> diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
>> index aafe09c9fa64..05db838e72cc 100644
>> --- a/include/linux/vfio_pci_core.h
>> +++ b/include/linux/vfio_pci_core.h
>> @@ -124,6 +124,7 @@ struct vfio_pci_core_device {
>>       bool                    needs_reset;
>>       bool                    nointx;
>>       bool                    needs_pm_restore;
>> +     pci_power_t             power_state;
>>       struct pci_saved_state  *pci_saved_state;
>>       struct pci_saved_state  *pm_save;
>>       int                     ioeventfds_nr;
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v2 3/5] vfio/pci: fix memory leak during D3hot to D0 tranistion
  2022-01-31 11:34     ` Abhishek Sahu
@ 2022-01-31 15:33       ` Alex Williamson
  0 siblings, 0 replies; 21+ messages in thread
From: Alex Williamson @ 2022-01-31 15:33 UTC (permalink / raw)
  To: Abhishek Sahu
  Cc: kvm, Cornelia Huck, Max Gurtovoy, Yishai Hadas, Zhen Lei,
	Jason Gunthorpe, linux-kernel

On Mon, 31 Jan 2022 17:04:12 +0530
Abhishek Sahu <abhsahu@nvidia.com> wrote:

> On 1/28/2022 5:35 AM, Alex Williamson wrote:
> > External email: Use caution opening links or attachments
> > 
> > 
> > On Mon, 24 Jan 2022 23:47:24 +0530
> > Abhishek Sahu <abhsahu@nvidia.com> wrote:
> >   
> >> If needs_pm_restore is set (PCI device does not have support for no
> >> soft reset), then the current PCI state will be saved during D0->D3hot
> >> transition and same will be restored back during D3hot->D0 transition.
> >> For saving the PCI state locally, pci_store_saved_state() is being
> >> used and the pci_load_and_free_saved_state() will free the allocated
> >> memory.
> >>
> >> But for reset related IOCTLs, vfio driver calls PCI reset related
> >> API's which will internally change the PCI power state back to D0. So,
> >> when the guest resumes, then it will get the current state as D0 and it
> >> will skip the call to vfio_pci_set_power_state() for changing the
> >> power state to D0 explicitly. In this case, the memory pointed by
> >> pm_save will never be freed.
> >>
> >> Also, in malicious sequence, the state changing to D3hot followed by
> >> VFIO_DEVICE_RESET/VFIO_DEVICE_PCI_HOT_RESET can be run in loop and
> >> it can cause an OOM situation. This patch stores the power state locally
> >> and uses the same for comparing the current power state. For the
> >> places where D0 transition can happen, call vfio_pci_set_power_state()
> >> to transition to D0 state. Since the vfio power state is still D3hot,
> >> so this D0 transition will help in running the logic required
> >> from D3hot->D0 transition. Also, to prevent any miss during
> >> future development to detect this condition, this patch puts a
> >> check and frees the memory after printing warning.
> >>
> >> This locally saved power state will help in subsequent patches
> >> also.  
> > 
> > Ideally let's put fixes patches at the start of the series, or better
> > yet send them separately, and don't include changes that only make
> > sense in the context of a subsequent patch.
> > 
> > Fixes: 51ef3a004b1e ("vfio/pci: Restore device state on PM transition")
> >   
> 
>  Thanks Alex for reviewing this patch.
>  I have added Fixes tag and sent this patch separately.
> 
>  Should I update this patch series or you are planning to review the
>  other patches first of this patch series first. 

Thanks for splitting this out.  I'll keep the remainder of the series
on the review queue, I expect I'll have some comments and it will be
easy enough to imagine vfio_pci_core_device.power_state being declared
in another patch if there's still a worthwhile use for it.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v2 1/5] vfio/pci: register vfio-pci driver with runtime PM framework
  2022-01-24 18:17 ` [RFC PATCH v2 1/5] vfio/pci: register vfio-pci driver with runtime PM framework Abhishek Sahu
@ 2022-02-16 23:48   ` Alex Williamson
  2022-02-21  6:35     ` Abhishek Sahu
  0 siblings, 1 reply; 21+ messages in thread
From: Alex Williamson @ 2022-02-16 23:48 UTC (permalink / raw)
  To: Abhishek Sahu
  Cc: kvm, Cornelia Huck, Max Gurtovoy, Yishai Hadas, Zhen Lei,
	Jason Gunthorpe, linux-kernel

On Mon, 24 Jan 2022 23:47:22 +0530
Abhishek Sahu <abhsahu@nvidia.com> wrote:

> Currently, there is very limited power management support
> available in the upstream vfio-pci driver. If there is no user of vfio-pci
> device, then the PCI device will be moved into D3Hot state by writing
> directly into PCI PM registers. This D3Hot state help in saving power
> but we can achieve zero power consumption if we go into the D3cold state.
> The D3cold state cannot be possible with native PCI PM. It requires
> interaction with platform firmware which is system-specific.
> To go into low power states (including D3cold), the runtime PM framework
> can be used which internally interacts with PCI and platform firmware and
> puts the device into the lowest possible D-States.
> 
> This patch registers vfio-pci driver with the runtime PM framework.
> 
> 1. The PCI core framework takes care of most of the runtime PM
>    related things. For enabling the runtime PM, the PCI driver needs to
>    decrement the usage count and needs to register the runtime
>    suspend/resume callbacks. For vfio-pci based driver, these callback
>    routines can be stubbed in this patch since the vfio-pci driver
>    is not doing the PCI device initialization. All the config state
>    saving, and PCI power management related things will be done by
>    PCI core framework itself inside its runtime suspend/resume callbacks.
> 
> 2. Inside pci_reset_bus(), all the devices in bus/slot will be moved
>    out of D0 state. This state change to D0 can happen directly without
>    going through the runtime PM framework. So if runtime PM is enabled,
>    then pm_runtime_resume() makes the runtime state active. Since the PCI
>    device power state is already D0, so it should return early when it
>    tries to change the state with pci_set_power_state(). Then
>    pm_request_idle() can be used which will internally check for
>    device usage count and will move the device again into the low power
>    state.
> 
> 3. Inside vfio_pci_core_disable(), the device usage count always needs
>    to be decremented which was incremented in vfio_pci_core_enable().
> 
> 4. Since the runtime PM framework will provide the same functionality,
>    so directly writing into PCI PM config register can be replaced with
>    the use of runtime PM routines. Also, the use of runtime PM can help
>    us in more power saving.
> 
>    In the systems which do not support D3Cold,
> 
>    With the existing implementation:
> 
>    // PCI device
>    # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
>    D3hot
>    // upstream bridge
>    # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
>    D0
> 
>    With runtime PM:
> 
>    // PCI device
>    # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
>    D3hot
>    // upstream bridge
>    # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
>    D3hot
> 
>    So, with runtime PM, the upstream bridge or root port will also go
>    into lower power state which is not possible with existing
>    implementation.
> 
>    In the systems which support D3Cold,
> 
>    // PCI device
>    # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
>    D3hot
>    // upstream bridge
>    # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
>    D0
> 
>    With runtime PM:
> 
>    // PCI device
>    # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
>    D3cold
>    // upstream bridge
>    # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
>    D3cold
> 
>    So, with runtime PM, both the PCI device and upstream bridge will
>    go into D3cold state.
> 
> 5. If 'disable_idle_d3' module parameter is set, then also the runtime
>    PM will be enabled, but in this case, the usage count should not be
>    decremented.
> 
> 6. vfio_pci_dev_set_try_reset() return value is unused now, so this
>    function return type can be changed to void.
> 
> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
> ---
>  drivers/vfio/pci/vfio_pci.c      |  3 +
>  drivers/vfio/pci/vfio_pci_core.c | 95 +++++++++++++++++++++++---------
>  include/linux/vfio_pci_core.h    |  4 ++
>  3 files changed, 75 insertions(+), 27 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index a5ce92beb655..c8695baf3b54 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -193,6 +193,9 @@ static struct pci_driver vfio_pci_driver = {
>  	.remove			= vfio_pci_remove,
>  	.sriov_configure	= vfio_pci_sriov_configure,
>  	.err_handler		= &vfio_pci_core_err_handlers,
> +#if defined(CONFIG_PM)
> +	.driver.pm              = &vfio_pci_core_pm_ops,
> +#endif
>  };
>  
>  static void __init vfio_pci_fill_ids(void)
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index f948e6cd2993..c6e4fe9088c3 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -152,7 +152,7 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_core_device *vdev)
>  }
>  
>  struct vfio_pci_group_info;
> -static bool vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set);
> +static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set);
>  static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
>  				      struct vfio_pci_group_info *groups);
>  
> @@ -245,7 +245,11 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
>  	u16 cmd;
>  	u8 msix_pos;
>  
> -	vfio_pci_set_power_state(vdev, PCI_D0);
> +	if (!disable_idle_d3) {
> +		ret = pm_runtime_resume_and_get(&pdev->dev);
> +		if (ret < 0)
> +			return ret;
> +	}

Sorry for the delay in review, I'm a novice in pm runtime, but I
haven't forgotten about the remainder of this series.

I think we're removing the unconditional wake here because we now wake
the device in the core registration function below, but I think there
might be a subtle dependency here on the fix to always wake devices in
the disable function as well, otherwise I'm afraid the power state of a
device released in D3hot could leak to the next user here.

>  
>  	/* Don't allow our initial saved state to include busmaster */
>  	pci_clear_master(pdev);
> @@ -405,8 +409,11 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
>  out:
>  	pci_disable_device(pdev);
>  
> -	if (!vfio_pci_dev_set_try_reset(vdev->vdev.dev_set) && !disable_idle_d3)
> -		vfio_pci_set_power_state(vdev, PCI_D3hot);
> +	vfio_pci_dev_set_try_reset(vdev->vdev.dev_set);
> +
> +	/* Put the pm-runtime usage counter acquired during enable */
> +	if (!disable_idle_d3)
> +		pm_runtime_put(&pdev->dev);
>  }
>  EXPORT_SYMBOL_GPL(vfio_pci_core_disable);
>  
> @@ -1847,19 +1854,20 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev)
>  
>  	vfio_pci_probe_power_state(vdev);
>  
> -	if (!disable_idle_d3) {
> -		/*
> -		 * pci-core sets the device power state to an unknown value at
> -		 * bootup and after being removed from a driver.  The only
> -		 * transition it allows from this unknown state is to D0, which
> -		 * typically happens when a driver calls pci_enable_device().
> -		 * We're not ready to enable the device yet, but we do want to
> -		 * be able to get to D3.  Therefore first do a D0 transition
> -		 * before going to D3.
> -		 */
> -		vfio_pci_set_power_state(vdev, PCI_D0);
> -		vfio_pci_set_power_state(vdev, PCI_D3hot);
> -	}
> +	/*
> +	 * pci-core sets the device power state to an unknown value at
> +	 * bootup and after being removed from a driver.  The only
> +	 * transition it allows from this unknown state is to D0, which
> +	 * typically happens when a driver calls pci_enable_device().
> +	 * We're not ready to enable the device yet, but we do want to
> +	 * be able to get to D3.  Therefore first do a D0 transition
> +	 * before enabling runtime PM.
> +	 */
> +	vfio_pci_set_power_state(vdev, PCI_D0);
> +	pm_runtime_allow(&pdev->dev);
> +
> +	if (!disable_idle_d3)
> +		pm_runtime_put(&pdev->dev);

I could use some enlightenment here.  pm_runtime_allow() only does
something if power.runtime_allow is false, in which case it sets that
value to true and decrements power.usage_count.  runtime_allow is
enabled by default in pm_runtime_init(), but pci_pm_init() calls
pm_runtime_forbid() which does the reverse of pm_runtime_allow().  So
do I understand correctly that PCI devices are probed with
runtime_allow = false and a usage_count of 2?

>  
>  	ret = vfio_register_group_dev(&vdev->vdev);
>  	if (ret)
> @@ -1868,7 +1876,9 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev)
>  
>  out_power:
>  	if (!disable_idle_d3)
> -		vfio_pci_set_power_state(vdev, PCI_D0);
> +		pm_runtime_get_noresume(&pdev->dev);
> +
> +	pm_runtime_forbid(&pdev->dev);
>  out_vf:
>  	vfio_pci_vf_uninit(vdev);
>  	return ret;
> @@ -1887,7 +1897,9 @@ void vfio_pci_core_unregister_device(struct vfio_pci_core_device *vdev)
>  	vfio_pci_vga_uninit(vdev);
>  
>  	if (!disable_idle_d3)
> -		vfio_pci_set_power_state(vdev, PCI_D0);
> +		pm_runtime_get_noresume(&pdev->dev);
> +
> +	pm_runtime_forbid(&pdev->dev);
>  }
>  EXPORT_SYMBOL_GPL(vfio_pci_core_unregister_device);
>  
> @@ -2093,33 +2105,62 @@ static bool vfio_pci_dev_set_needs_reset(struct vfio_device_set *dev_set)
>   *  - At least one of the affected devices is marked dirty via
>   *    needs_reset (such as by lack of FLR support)
>   * Then attempt to perform that bus or slot reset.
> - * Returns true if the dev_set was reset.
>   */
> -static bool vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set)
> +static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set)
>  {
>  	struct vfio_pci_core_device *cur;
>  	struct pci_dev *pdev;
>  	int ret;
>  
>  	if (!vfio_pci_dev_set_needs_reset(dev_set))
> -		return false;
> +		return;
>  
>  	pdev = vfio_pci_dev_set_resettable(dev_set);
>  	if (!pdev)
> -		return false;
> +		return;
>  
>  	ret = pci_reset_bus(pdev);
>  	if (ret)
> -		return false;
> +		return;
>  
>  	list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) {
>  		cur->needs_reset = false;
> -		if (!disable_idle_d3)
> -			vfio_pci_set_power_state(cur, PCI_D3hot);
> +		if (!disable_idle_d3) {
> +			/*
> +			 * Inside pci_reset_bus(), all the devices in bus/slot
> +			 * will be moved out of D0 state. This state change to

s/out of/into/?

> +			 * D0 can happen directly without going through the
> +			 * runtime PM framework. pm_runtime_resume() will
> +			 * help make the runtime state as active and then
> +			 * pm_request_idle() can be used which will
> +			 * internally check for device usage count and will
> +			 * move the device again into the low power state.
> +			 */
> +			pm_runtime_resume(&pdev->dev);
> +			pm_request_idle(&pdev->dev);
> +		}
>  	}
> -	return true;
>  }
>  
> +#ifdef CONFIG_PM
> +static int vfio_pci_core_runtime_suspend(struct device *dev)
> +{
> +	return 0;
> +}
> +
> +static int vfio_pci_core_runtime_resume(struct device *dev)
> +{
> +	return 0;
> +}
> +
> +const struct dev_pm_ops vfio_pci_core_pm_ops = {
> +	SET_RUNTIME_PM_OPS(vfio_pci_core_runtime_suspend,
> +			   vfio_pci_core_runtime_resume,
> +			   NULL)
> +};
> +EXPORT_SYMBOL_GPL(vfio_pci_core_pm_ops);
> +#endif

It looks like the vfio_pci_core_pm_ops implementation should all be
moved to where we implement D3cold support, it's not necessary to
implement stubs for any of the functionality of this patch.  Thanks,

Alex

> +
>  void vfio_pci_core_set_params(bool is_nointxmask, bool is_disable_vga,
>  			      bool is_disable_idle_d3)
>  {
> diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
> index ef9a44b6cf5d..aafe09c9fa64 100644
> --- a/include/linux/vfio_pci_core.h
> +++ b/include/linux/vfio_pci_core.h
> @@ -231,6 +231,10 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev);
>  void vfio_pci_core_disable(struct vfio_pci_core_device *vdev);
>  void vfio_pci_core_finish_enable(struct vfio_pci_core_device *vdev);
>  
> +#ifdef CONFIG_PM
> +extern const struct dev_pm_ops vfio_pci_core_pm_ops;
> +#endif
> +
>  static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
>  {
>  	return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v2 4/5] vfio/pci: Invalidate mmaps and block the access in D3hot power state
  2022-01-24 18:17 ` [RFC PATCH v2 4/5] vfio/pci: Invalidate mmaps and block the access in D3hot power state Abhishek Sahu
  2022-01-25  2:35   ` kernel test robot
@ 2022-02-17 23:14   ` Alex Williamson
  2022-02-21  8:12     ` Abhishek Sahu
  1 sibling, 1 reply; 21+ messages in thread
From: Alex Williamson @ 2022-02-17 23:14 UTC (permalink / raw)
  To: Abhishek Sahu
  Cc: kvm, Cornelia Huck, Max Gurtovoy, Yishai Hadas, Zhen Lei,
	Jason Gunthorpe, linux-kernel

On Mon, 24 Jan 2022 23:47:25 +0530
Abhishek Sahu <abhsahu@nvidia.com> wrote:

> According to [PCIe v5 5.3.1.4.1] for D3hot state
> 
>  "Configuration and Message requests are the only TLPs accepted by a
>   Function in the D3Hot state. All other received Requests must be
>   handled as Unsupported Requests, and all received Completions may
>   optionally be handled as Unexpected Completions."
> 
> Currently, if the vfio PCI device has been put into D3hot state and if
> user makes non-config related read/write request in D3hot state, these
> requests will be forwarded to the host and this access may cause
> issues on a few systems.
> 
> This patch leverages the memory-disable support added in commit
> 'abafbc551fdd ("vfio-pci: Invalidate mmaps and block MMIO access on
> disabled memory")' to generate page fault on mmap access and
> return error for the direct read/write. If the device is D3hot state,
> then the error needs to be returned for all kinds of BAR
> related access (memory, IO and ROM). Also, the power related structure
> fields need to be protected so we can use the same 'memory_lock' to
> protect these fields also. For the few cases, this 'memory_lock' will be
> already acquired by callers so introduce a separate function
> vfio_pci_set_power_state_locked(). The original
> vfio_pci_set_power_state() now contains the code to do the locking
> related operations.
> 
> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
> ---
>  drivers/vfio/pci/vfio_pci_core.c | 47 +++++++++++++++++++++++++-------
>  drivers/vfio/pci/vfio_pci_rdwr.c | 20 ++++++++++----
>  2 files changed, 51 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index ee2fb8af57fa..38440d48973f 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -201,11 +201,12 @@ static void vfio_pci_probe_power_state(struct vfio_pci_core_device *vdev)
>  }
>  
>  /*
> - * pci_set_power_state() wrapper handling devices which perform a soft reset on
> - * D3->D0 transition.  Save state prior to D0/1/2->D3, stash it on the vdev,
> - * restore when returned to D0.  Saved separately from pci_saved_state for use
> - * by PM capability emulation and separately from pci_dev internal saved state
> - * to avoid it being overwritten and consumed around other resets.
> + * vfio_pci_set_power_state_locked() wrapper handling devices which perform a
> + * soft reset on D3->D0 transition.  Save state prior to D0/1/2->D3, stash it
> + * on the vdev, restore when returned to D0.  Saved separately from
> + * pci_saved_state for use by PM capability emulation and separately from
> + * pci_dev internal saved state to avoid it being overwritten and consumed
> + * around other resets.
>   *
>   * There are few cases where the PCI power state can be changed to D0
>   * without the involvement of this API. So, cache the power state locally
> @@ -215,7 +216,8 @@ static void vfio_pci_probe_power_state(struct vfio_pci_core_device *vdev)
>   * The memory taken for saving this PCI state needs to be freed to
>   * prevent memory leak.
>   */
> -int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t state)
> +static int vfio_pci_set_power_state_locked(struct vfio_pci_core_device *vdev,
> +					   pci_power_t state)
>  {
>  	struct pci_dev *pdev = vdev->pdev;
>  	bool needs_restore = false, needs_save = false;
> @@ -260,6 +262,26 @@ int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t stat
>  	return ret;
>  }
>  
> +/*
> + * vfio_pci_set_power_state() takes all the required locks to protect
> + * the access of power related variables and then invokes
> + * vfio_pci_set_power_state_locked().
> + */
> +int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev,
> +			     pci_power_t state)
> +{
> +	int ret;
> +
> +	if (state >= PCI_D3hot)
> +		vfio_pci_zap_and_down_write_memory_lock(vdev);
> +	else
> +		down_write(&vdev->memory_lock);
> +
> +	ret = vfio_pci_set_power_state_locked(vdev, state);
> +	up_write(&vdev->memory_lock);
> +	return ret;
> +}
> +
>  int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
>  {
>  	struct pci_dev *pdev = vdev->pdev;
> @@ -354,7 +376,7 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
>  	 * in running the logic needed for D0 power state. The subsequent
>  	 * runtime PM API's will put the device into the low power state again.
>  	 */
> -	vfio_pci_set_power_state(vdev, PCI_D0);
> +	vfio_pci_set_power_state_locked(vdev, PCI_D0);
>  
>  	/* Stop the device from further DMA */
>  	pci_clear_master(pdev);
> @@ -967,7 +989,7 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
>  		 * interaction. Update the power state in vfio driver to perform
>  		 * the logic needed for D0 power state.
>  		 */
> -		vfio_pci_set_power_state(vdev, PCI_D0);
> +		vfio_pci_set_power_state_locked(vdev, PCI_D0);
>  		up_write(&vdev->memory_lock);
>  
>  		return ret;
> @@ -1453,6 +1475,11 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
>  		goto up_out;
>  	}
>  
> +	if (vdev->power_state >= PCI_D3hot) {
> +		ret = VM_FAULT_SIGBUS;
> +		goto up_out;
> +	}
> +
>  	/*
>  	 * We populate the whole vma on fault, so we need to test whether
>  	 * the vma has already been mapped, such as for concurrent faults
> @@ -1902,7 +1929,7 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev)
>  	 * be able to get to D3.  Therefore first do a D0 transition
>  	 * before enabling runtime PM.
>  	 */
> -	vfio_pci_set_power_state(vdev, PCI_D0);
> +	vfio_pci_set_power_state_locked(vdev, PCI_D0);
>  	pm_runtime_allow(&pdev->dev);
>  
>  	if (!disable_idle_d3)
> @@ -2117,7 +2144,7 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
>  		 * interaction. Update the power state in vfio driver to perform
>  		 * the logic needed for D0 power state.
>  		 */
> -		vfio_pci_set_power_state(cur, PCI_D0);
> +		vfio_pci_set_power_state_locked(cur, PCI_D0);
>  		if (cur == cur_mem)
>  			is_mem = false;
>  		if (cur == cur_vma)
> diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c
> index 57d3b2cbbd8e..e97ba14c4aa0 100644
> --- a/drivers/vfio/pci/vfio_pci_rdwr.c
> +++ b/drivers/vfio/pci/vfio_pci_rdwr.c
> @@ -41,8 +41,13 @@
>  static int vfio_pci_iowrite##size(struct vfio_pci_core_device *vdev,		\
>  			bool test_mem, u##size val, void __iomem *io)	\
>  {									\
> +	down_read(&vdev->memory_lock);					\
> +	if (vdev->power_state >= PCI_D3hot) {				\
> +		up_read(&vdev->memory_lock);				\
> +		return -EIO;						\
> +	}								\
> +									\

The reason that we only set test_mem for MMIO BARs is that systems are
generally more lenient about probing unresponsive I/O port space to
support legacy use cases.  Have you found cases where access to an I/O
port BAR when the device is either in D3hot+ or I/O port is disabled in
the command register triggers a system fault?  If not it seems we could
roll the power_state check into __vfio_pci_memory_enabled(), if so then
we probably need to improve our coverage of access to disabled I/O port
BARs beyond only the power_state check.  Thanks,

Alex

>  	if (test_mem) {							\
> -		down_read(&vdev->memory_lock);				\
>  		if (!__vfio_pci_memory_enabled(vdev)) {			\
>  			up_read(&vdev->memory_lock);			\
>  			return -EIO;					\
> @@ -51,8 +56,7 @@ static int vfio_pci_iowrite##size(struct vfio_pci_core_device *vdev,		\
>  									\
>  	vfio_iowrite##size(val, io);					\
>  									\
> -	if (test_mem)							\
> -		up_read(&vdev->memory_lock);				\
> +	up_read(&vdev->memory_lock);					\
>  									\
>  	return 0;							\
>  }
> @@ -68,8 +72,13 @@ VFIO_IOWRITE(64)
>  static int vfio_pci_ioread##size(struct vfio_pci_core_device *vdev,		\
>  			bool test_mem, u##size *val, void __iomem *io)	\
>  {									\
> +	down_read(&vdev->memory_lock);					\
> +	if (vdev->power_state >= PCI_D3hot) {				\
> +		up_read(&vdev->memory_lock);				\
> +		return -EIO;						\
> +	}								\
> +									\
>  	if (test_mem) {							\
> -		down_read(&vdev->memory_lock);				\
>  		if (!__vfio_pci_memory_enabled(vdev)) {			\
>  			up_read(&vdev->memory_lock);			\
>  			return -EIO;					\
> @@ -78,8 +87,7 @@ static int vfio_pci_ioread##size(struct vfio_pci_core_device *vdev,		\
>  									\
>  	*val = vfio_ioread##size(io);					\
>  									\
> -	if (test_mem)							\
> -		up_read(&vdev->memory_lock);				\
> +	up_read(&vdev->memory_lock);					\
>  									\
>  	return 0;							\
>  }


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v2 1/5] vfio/pci: register vfio-pci driver with runtime PM framework
  2022-02-16 23:48   ` Alex Williamson
@ 2022-02-21  6:35     ` Abhishek Sahu
  0 siblings, 0 replies; 21+ messages in thread
From: Abhishek Sahu @ 2022-02-21  6:35 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, Cornelia Huck, Max Gurtovoy, Yishai Hadas, Zhen Lei,
	Jason Gunthorpe, linux-kernel

On 2/17/2022 5:18 AM, Alex Williamson wrote:
> On Mon, 24 Jan 2022 23:47:22 +0530
> Abhishek Sahu <abhsahu@nvidia.com> wrote:
> 
>> Currently, there is very limited power management support
>> available in the upstream vfio-pci driver. If there is no user of vfio-pci
>> device, then the PCI device will be moved into D3Hot state by writing
>> directly into PCI PM registers. This D3Hot state help in saving power
>> but we can achieve zero power consumption if we go into the D3cold state.
>> The D3cold state cannot be possible with native PCI PM. It requires
>> interaction with platform firmware which is system-specific.
>> To go into low power states (including D3cold), the runtime PM framework
>> can be used which internally interacts with PCI and platform firmware and
>> puts the device into the lowest possible D-States.
>>
>> This patch registers vfio-pci driver with the runtime PM framework.
>>
>> 1. The PCI core framework takes care of most of the runtime PM
>>    related things. For enabling the runtime PM, the PCI driver needs to
>>    decrement the usage count and needs to register the runtime
>>    suspend/resume callbacks. For vfio-pci based driver, these callback
>>    routines can be stubbed in this patch since the vfio-pci driver
>>    is not doing the PCI device initialization. All the config state
>>    saving, and PCI power management related things will be done by
>>    PCI core framework itself inside its runtime suspend/resume callbacks.
>>
>> 2. Inside pci_reset_bus(), all the devices in bus/slot will be moved
>>    out of D0 state. This state change to D0 can happen directly without
>>    going through the runtime PM framework. So if runtime PM is enabled,
>>    then pm_runtime_resume() makes the runtime state active. Since the PCI
>>    device power state is already D0, so it should return early when it
>>    tries to change the state with pci_set_power_state(). Then
>>    pm_request_idle() can be used which will internally check for
>>    device usage count and will move the device again into the low power
>>    state.
>>
>> 3. Inside vfio_pci_core_disable(), the device usage count always needs
>>    to be decremented which was incremented in vfio_pci_core_enable().
>>
>> 4. Since the runtime PM framework will provide the same functionality,
>>    so directly writing into PCI PM config register can be replaced with
>>    the use of runtime PM routines. Also, the use of runtime PM can help
>>    us in more power saving.
>>
>>    In the systems which do not support D3Cold,
>>
>>    With the existing implementation:
>>
>>    // PCI device
>>    # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
>>    D3hot
>>    // upstream bridge
>>    # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
>>    D0
>>
>>    With runtime PM:
>>
>>    // PCI device
>>    # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
>>    D3hot
>>    // upstream bridge
>>    # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
>>    D3hot
>>
>>    So, with runtime PM, the upstream bridge or root port will also go
>>    into lower power state which is not possible with existing
>>    implementation.
>>
>>    In the systems which support D3Cold,
>>
>>    // PCI device
>>    # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
>>    D3hot
>>    // upstream bridge
>>    # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
>>    D0
>>
>>    With runtime PM:
>>
>>    // PCI device
>>    # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
>>    D3cold
>>    // upstream bridge
>>    # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
>>    D3cold
>>
>>    So, with runtime PM, both the PCI device and upstream bridge will
>>    go into D3cold state.
>>
>> 5. If 'disable_idle_d3' module parameter is set, then also the runtime
>>    PM will be enabled, but in this case, the usage count should not be
>>    decremented.
>>
>> 6. vfio_pci_dev_set_try_reset() return value is unused now, so this
>>    function return type can be changed to void.
>>
>> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
>> ---
>>  drivers/vfio/pci/vfio_pci.c      |  3 +
>>  drivers/vfio/pci/vfio_pci_core.c | 95 +++++++++++++++++++++++---------
>>  include/linux/vfio_pci_core.h    |  4 ++
>>  3 files changed, 75 insertions(+), 27 deletions(-)
>>
>> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
>> index a5ce92beb655..c8695baf3b54 100644
>> --- a/drivers/vfio/pci/vfio_pci.c
>> +++ b/drivers/vfio/pci/vfio_pci.c
>> @@ -193,6 +193,9 @@ static struct pci_driver vfio_pci_driver = {
>>       .remove                 = vfio_pci_remove,
>>       .sriov_configure        = vfio_pci_sriov_configure,
>>       .err_handler            = &vfio_pci_core_err_handlers,
>> +#if defined(CONFIG_PM)
>> +     .driver.pm              = &vfio_pci_core_pm_ops,
>> +#endif
>>  };
>>
>>  static void __init vfio_pci_fill_ids(void)
>> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
>> index f948e6cd2993..c6e4fe9088c3 100644
>> --- a/drivers/vfio/pci/vfio_pci_core.c
>> +++ b/drivers/vfio/pci/vfio_pci_core.c
>> @@ -152,7 +152,7 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_core_device *vdev)
>>  }
>>
>>  struct vfio_pci_group_info;
>> -static bool vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set);
>> +static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set);
>>  static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
>>                                     struct vfio_pci_group_info *groups);
>>
>> @@ -245,7 +245,11 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
>>       u16 cmd;
>>       u8 msix_pos;
>>
>> -     vfio_pci_set_power_state(vdev, PCI_D0);
>> +     if (!disable_idle_d3) {
>> +             ret = pm_runtime_resume_and_get(&pdev->dev);
>> +             if (ret < 0)
>> +                     return ret;
>> +     }
> 
> Sorry for the delay in review, I'm a novice in pm runtime, but I
> haven't forgotten about the remainder of this series.
> 

 Thanks Alex.
 Should I include linux-pm@vger.kernel.org while sending the updated
 version. I got following comment in my different patch related
 with PCI PM
 (https://lore.kernel.org/lkml/20220204233219.GA228585@bhelgaas/T/#me17cb6e1aa3848cfd4ea577a3c93ebbbfdbf7c73)

 "generally PM patches should be CCed to linux-pm anyway"


> I think we're removing the unconditional wake here because we now wake
> the device in the core registration function below, but I think there
> might be a subtle dependency here on the fix to always wake devices in
> the disable function as well, otherwise I'm afraid the power state of a
> device released in D3hot could leak to the next user here.
> 

 Yes. We need to consider the fix. 
 Either we can add the state restore handling logic inside
 vfio_pci_core_runtime_resume() or we can keep restore the state
 alone here explictly. For runtime PM, we need to call
 pm_runtime_resume_and_get() first since root port should be moved
 to D0 first. 

>>
>>       /* Don't allow our initial saved state to include busmaster */
>>       pci_clear_master(pdev);
>> @@ -405,8 +409,11 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
>>  out:
>>       pci_disable_device(pdev);
>>
>> -     if (!vfio_pci_dev_set_try_reset(vdev->vdev.dev_set) && !disable_idle_d3)
>> -             vfio_pci_set_power_state(vdev, PCI_D3hot);
>> +     vfio_pci_dev_set_try_reset(vdev->vdev.dev_set);
>> +
>> +     /* Put the pm-runtime usage counter acquired during enable */
>> +     if (!disable_idle_d3)
>> +             pm_runtime_put(&pdev->dev);
>>  }
>>  EXPORT_SYMBOL_GPL(vfio_pci_core_disable);
>>
>> @@ -1847,19 +1854,20 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev)
>>
>>       vfio_pci_probe_power_state(vdev);
>>
>> -     if (!disable_idle_d3) {
>> -             /*
>> -              * pci-core sets the device power state to an unknown value at
>> -              * bootup and after being removed from a driver.  The only
>> -              * transition it allows from this unknown state is to D0, which
>> -              * typically happens when a driver calls pci_enable_device().
>> -              * We're not ready to enable the device yet, but we do want to
>> -              * be able to get to D3.  Therefore first do a D0 transition
>> -              * before going to D3.
>> -              */
>> -             vfio_pci_set_power_state(vdev, PCI_D0);
>> -             vfio_pci_set_power_state(vdev, PCI_D3hot);
>> -     }
>> +     /*
>> +      * pci-core sets the device power state to an unknown value at
>> +      * bootup and after being removed from a driver.  The only
>> +      * transition it allows from this unknown state is to D0, which
>> +      * typically happens when a driver calls pci_enable_device().
>> +      * We're not ready to enable the device yet, but we do want to
>> +      * be able to get to D3.  Therefore first do a D0 transition
>> +      * before enabling runtime PM.
>> +      */
>> +     vfio_pci_set_power_state(vdev, PCI_D0);
>> +     pm_runtime_allow(&pdev->dev);
>> +
>> +     if (!disable_idle_d3)
>> +             pm_runtime_put(&pdev->dev);
> 
> I could use some enlightenment here.  pm_runtime_allow() only does
> something if power.runtime_allow is false, in which case it sets that
> value to true and decrements power.usage_count.  runtime_allow is
> enabled by default in pm_runtime_init(), but pci_pm_init() calls
> pm_runtime_forbid() which does the reverse of pm_runtime_allow().  So
> do I understand correctly that PCI devices are probed with
> runtime_allow = false and a usage_count of 2?
> 

 Following is the flow w.r.t. usage_count and runtime_allow.

 In pci_pm_init(), the default usage_count=0 and runtime_allow=true initially.
 pm_runtime_forbid() in pci_pm_init() makes usage_count=1 and runtime_allow=false

 Then, inside local_pci_probe(), pm_runtime_get_sync() is called,
 After this, the usage_count=2 and runtime_allow=false

 So, you are correct that the PCI devices are probed with
 runtime_allow=false and usage_count=2.

 In the driver, 

 pm_runtime_allow() is for doing the reverse of pm_runtime_forbid().
 and pm_runtime_put() is for doing the reverse of pm_runtime_get_sync().
 
>>
>>       ret = vfio_register_group_dev(&vdev->vdev);
>>       if (ret)
>> @@ -1868,7 +1876,9 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev)
>>
>>  out_power:
>>       if (!disable_idle_d3)
>> -             vfio_pci_set_power_state(vdev, PCI_D0);
>> +             pm_runtime_get_noresume(&pdev->dev);
>> +
>> +     pm_runtime_forbid(&pdev->dev);
>>  out_vf:
>>       vfio_pci_vf_uninit(vdev);
>>       return ret;
>> @@ -1887,7 +1897,9 @@ void vfio_pci_core_unregister_device(struct vfio_pci_core_device *vdev)
>>       vfio_pci_vga_uninit(vdev);
>>
>>       if (!disable_idle_d3)
>> -             vfio_pci_set_power_state(vdev, PCI_D0);
>> +             pm_runtime_get_noresume(&pdev->dev);
>> +
>> +     pm_runtime_forbid(&pdev->dev);
>>  }
>>  EXPORT_SYMBOL_GPL(vfio_pci_core_unregister_device);
>>
>> @@ -2093,33 +2105,62 @@ static bool vfio_pci_dev_set_needs_reset(struct vfio_device_set *dev_set)
>>   *  - At least one of the affected devices is marked dirty via
>>   *    needs_reset (such as by lack of FLR support)
>>   * Then attempt to perform that bus or slot reset.
>> - * Returns true if the dev_set was reset.
>>   */
>> -static bool vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set)
>> +static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set)
>>  {
>>       struct vfio_pci_core_device *cur;
>>       struct pci_dev *pdev;
>>       int ret;
>>
>>       if (!vfio_pci_dev_set_needs_reset(dev_set))
>> -             return false;
>> +             return;
>>
>>       pdev = vfio_pci_dev_set_resettable(dev_set);
>>       if (!pdev)
>> -             return false;
>> +             return;
>>
>>       ret = pci_reset_bus(pdev);
>>       if (ret)
>> -             return false;
>> +             return;
>>
>>       list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) {
>>               cur->needs_reset = false;
>> -             if (!disable_idle_d3)
>> -                     vfio_pci_set_power_state(cur, PCI_D3hot);
>> +             if (!disable_idle_d3) {
>> +                     /*
>> +                      * Inside pci_reset_bus(), all the devices in bus/slot
>> +                      * will be moved out of D0 state. This state change to
> 
> s/out of/into/?
> 

 Yes. I will fix this. 

>> +                      * D0 can happen directly without going through the
>> +                      * runtime PM framework. pm_runtime_resume() will
>> +                      * help make the runtime state as active and then
>> +                      * pm_request_idle() can be used which will
>> +                      * internally check for device usage count and will
>> +                      * move the device again into the low power state.
>> +                      */
>> +                     pm_runtime_resume(&pdev->dev);
>> +                     pm_request_idle(&pdev->dev);
>> +             }
>>       }
>> -     return true;
>>  }
>>
>> +#ifdef CONFIG_PM
>> +static int vfio_pci_core_runtime_suspend(struct device *dev)
>> +{
>> +     return 0;
>> +}
>> +
>> +static int vfio_pci_core_runtime_resume(struct device *dev)
>> +{
>> +     return 0;
>> +}
>> +
>> +const struct dev_pm_ops vfio_pci_core_pm_ops = {
>> +     SET_RUNTIME_PM_OPS(vfio_pci_core_runtime_suspend,
>> +                        vfio_pci_core_runtime_resume,
>> +                        NULL)
>> +};
>> +EXPORT_SYMBOL_GPL(vfio_pci_core_pm_ops);
>> +#endif
> 
> It looks like the vfio_pci_core_pm_ops implementation should all be
> moved to where we implement D3cold support, it's not necessary to
> implement stubs for any of the functionality of this patch.  Thanks,
> 

 We need to provide dev_pm_ops atleast to make runtime PM working.
 In pci_pm_runtime_idle() generic function:

 const struct dev_pm_ops *pm = dev->driver ? dev->driver->pm : NULL;
 if (!pm)
    return -ENOSYS;

 Without dev_pm_ops, the idle routine will return ENOSYS error.

 vfio_pci_core_runtime_{suspend/resume}() stub implementation can be removed
 but we need to provide stub vfio_pci_core_pm_ops atleast.

 const struct dev_pm_ops vfio_pci_core_pm_ops = { };
 EXPORT_SYMBOL_GPL(vfio_pci_core_pm_ops);

 Thanks,
 Abhishek 
 
> Alex
> 
>> +
>>  void vfio_pci_core_set_params(bool is_nointxmask, bool is_disable_vga,
>>                             bool is_disable_idle_d3)
>>  {
>> diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
>> index ef9a44b6cf5d..aafe09c9fa64 100644
>> --- a/include/linux/vfio_pci_core.h
>> +++ b/include/linux/vfio_pci_core.h
>> @@ -231,6 +231,10 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev);
>>  void vfio_pci_core_disable(struct vfio_pci_core_device *vdev);
>>  void vfio_pci_core_finish_enable(struct vfio_pci_core_device *vdev);
>>
>> +#ifdef CONFIG_PM
>> +extern const struct dev_pm_ops vfio_pci_core_pm_ops;
>> +#endif
>> +
>>  static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
>>  {
>>       return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v2 4/5] vfio/pci: Invalidate mmaps and block the access in D3hot power state
  2022-02-17 23:14   ` Alex Williamson
@ 2022-02-21  8:12     ` Abhishek Sahu
  0 siblings, 0 replies; 21+ messages in thread
From: Abhishek Sahu @ 2022-02-21  8:12 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, Cornelia Huck, Max Gurtovoy, Yishai Hadas, Zhen Lei,
	Jason Gunthorpe, linux-kernel

On 2/18/2022 4:44 AM, Alex Williamson wrote:
> On Mon, 24 Jan 2022 23:47:25 +0530
> Abhishek Sahu <abhsahu@nvidia.com> wrote:
> 
>> According to [PCIe v5 5.3.1.4.1] for D3hot state
>>
>>  "Configuration and Message requests are the only TLPs accepted by a
>>   Function in the D3Hot state. All other received Requests must be
>>   handled as Unsupported Requests, and all received Completions may
>>   optionally be handled as Unexpected Completions."
>>
>> Currently, if the vfio PCI device has been put into D3hot state and if
>> user makes non-config related read/write request in D3hot state, these
>> requests will be forwarded to the host and this access may cause
>> issues on a few systems.
>>
>> This patch leverages the memory-disable support added in commit
>> 'abafbc551fdd ("vfio-pci: Invalidate mmaps and block MMIO access on
>> disabled memory")' to generate page fault on mmap access and
>> return error for the direct read/write. If the device is D3hot state,
>> then the error needs to be returned for all kinds of BAR
>> related access (memory, IO and ROM). Also, the power related structure
>> fields need to be protected so we can use the same 'memory_lock' to
>> protect these fields also. For the few cases, this 'memory_lock' will be
>> already acquired by callers so introduce a separate function
>> vfio_pci_set_power_state_locked(). The original
>> vfio_pci_set_power_state() now contains the code to do the locking
>> related operations.
>>
>> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
>> ---
>>  drivers/vfio/pci/vfio_pci_core.c | 47 +++++++++++++++++++++++++-------
>>  drivers/vfio/pci/vfio_pci_rdwr.c | 20 ++++++++++----
>>  2 files changed, 51 insertions(+), 16 deletions(-)
>>
>> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
>> index ee2fb8af57fa..38440d48973f 100644
>> --- a/drivers/vfio/pci/vfio_pci_core.c
>> +++ b/drivers/vfio/pci/vfio_pci_core.c
>> @@ -201,11 +201,12 @@ static void vfio_pci_probe_power_state(struct vfio_pci_core_device *vdev)
>>  }
>>
>>  /*
>> - * pci_set_power_state() wrapper handling devices which perform a soft reset on
>> - * D3->D0 transition.  Save state prior to D0/1/2->D3, stash it on the vdev,
>> - * restore when returned to D0.  Saved separately from pci_saved_state for use
>> - * by PM capability emulation and separately from pci_dev internal saved state
>> - * to avoid it being overwritten and consumed around other resets.
>> + * vfio_pci_set_power_state_locked() wrapper handling devices which perform a
>> + * soft reset on D3->D0 transition.  Save state prior to D0/1/2->D3, stash it
>> + * on the vdev, restore when returned to D0.  Saved separately from
>> + * pci_saved_state for use by PM capability emulation and separately from
>> + * pci_dev internal saved state to avoid it being overwritten and consumed
>> + * around other resets.
>>   *
>>   * There are few cases where the PCI power state can be changed to D0
>>   * without the involvement of this API. So, cache the power state locally
>> @@ -215,7 +216,8 @@ static void vfio_pci_probe_power_state(struct vfio_pci_core_device *vdev)
>>   * The memory taken for saving this PCI state needs to be freed to
>>   * prevent memory leak.
>>   */
>> -int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t state)
>> +static int vfio_pci_set_power_state_locked(struct vfio_pci_core_device *vdev,
>> +                                        pci_power_t state)
>>  {
>>       struct pci_dev *pdev = vdev->pdev;
>>       bool needs_restore = false, needs_save = false;
>> @@ -260,6 +262,26 @@ int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t stat
>>       return ret;
>>  }
>>
>> +/*
>> + * vfio_pci_set_power_state() takes all the required locks to protect
>> + * the access of power related variables and then invokes
>> + * vfio_pci_set_power_state_locked().
>> + */
>> +int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev,
>> +                          pci_power_t state)
>> +{
>> +     int ret;
>> +
>> +     if (state >= PCI_D3hot)
>> +             vfio_pci_zap_and_down_write_memory_lock(vdev);
>> +     else
>> +             down_write(&vdev->memory_lock);
>> +
>> +     ret = vfio_pci_set_power_state_locked(vdev, state);
>> +     up_write(&vdev->memory_lock);
>> +     return ret;
>> +}
>> +
>>  int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
>>  {
>>       struct pci_dev *pdev = vdev->pdev;
>> @@ -354,7 +376,7 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
>>        * in running the logic needed for D0 power state. The subsequent
>>        * runtime PM API's will put the device into the low power state again.
>>        */
>> -     vfio_pci_set_power_state(vdev, PCI_D0);
>> +     vfio_pci_set_power_state_locked(vdev, PCI_D0);
>>
>>       /* Stop the device from further DMA */
>>       pci_clear_master(pdev);
>> @@ -967,7 +989,7 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
>>                * interaction. Update the power state in vfio driver to perform
>>                * the logic needed for D0 power state.
>>                */
>> -             vfio_pci_set_power_state(vdev, PCI_D0);
>> +             vfio_pci_set_power_state_locked(vdev, PCI_D0);
>>               up_write(&vdev->memory_lock);
>>
>>               return ret;
>> @@ -1453,6 +1475,11 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
>>               goto up_out;
>>       }
>>
>> +     if (vdev->power_state >= PCI_D3hot) {
>> +             ret = VM_FAULT_SIGBUS;
>> +             goto up_out;
>> +     }
>> +
>>       /*
>>        * We populate the whole vma on fault, so we need to test whether
>>        * the vma has already been mapped, such as for concurrent faults
>> @@ -1902,7 +1929,7 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev)
>>        * be able to get to D3.  Therefore first do a D0 transition
>>        * before enabling runtime PM.
>>        */
>> -     vfio_pci_set_power_state(vdev, PCI_D0);
>> +     vfio_pci_set_power_state_locked(vdev, PCI_D0);
>>       pm_runtime_allow(&pdev->dev);
>>
>>       if (!disable_idle_d3)
>> @@ -2117,7 +2144,7 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
>>                * interaction. Update the power state in vfio driver to perform
>>                * the logic needed for D0 power state.
>>                */
>> -             vfio_pci_set_power_state(cur, PCI_D0);
>> +             vfio_pci_set_power_state_locked(cur, PCI_D0);
>>               if (cur == cur_mem)
>>                       is_mem = false;
>>               if (cur == cur_vma)
>> diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c
>> index 57d3b2cbbd8e..e97ba14c4aa0 100644
>> --- a/drivers/vfio/pci/vfio_pci_rdwr.c
>> +++ b/drivers/vfio/pci/vfio_pci_rdwr.c
>> @@ -41,8 +41,13 @@
>>  static int vfio_pci_iowrite##size(struct vfio_pci_core_device *vdev,         \
>>                       bool test_mem, u##size val, void __iomem *io)   \
>>  {                                                                    \
>> +     down_read(&vdev->memory_lock);                                  \
>> +     if (vdev->power_state >= PCI_D3hot) {                           \
>> +             up_read(&vdev->memory_lock);                            \
>> +             return -EIO;                                            \
>> +     }                                                               \
>> +                                                                     \
> 
> The reason that we only set test_mem for MMIO BARs is that systems are
> generally more lenient about probing unresponsive I/O port space to
> support legacy use cases.  Have you found cases where access to an I/O
> port BAR when the device is either in D3hot+ or I/O port is disabled in
> the command register triggers a system fault?  If not it seems we could
> roll the power_state check into __vfio_pci_memory_enabled(), if so then
> we probably need to improve our coverage of access to disabled I/O port
> BARs beyond only the power_state check.  Thanks,
> 
> Alex
> 

 I have not seen any system unresponsive in the systems which I am using 
 for testing these patches. If I try to access MMIO BAR or IO port while
 the device is in D3hot+, then I am getting all 0xff. Since I was not
 sure regarding the behaviour in other systems while the device is in
 D3hot+, so I did power_state check outside.

 We can start with power_state check under __vfio_pci_memory_enabled()
 and improve coverage later-on if any issue arises.

 Thanks,
 Abhishek

>>       if (test_mem) {                                                 \
>> -             down_read(&vdev->memory_lock);                          \
>>               if (!__vfio_pci_memory_enabled(vdev)) {                 \
>>                       up_read(&vdev->memory_lock);                    \
>>                       return -EIO;                                    \
>> @@ -51,8 +56,7 @@ static int vfio_pci_iowrite##size(struct vfio_pci_core_device *vdev,                \
>>                                                                       \
>>       vfio_iowrite##size(val, io);                                    \
>>                                                                       \
>> -     if (test_mem)                                                   \
>> -             up_read(&vdev->memory_lock);                            \
>> +     up_read(&vdev->memory_lock);                                    \
>>                                                                       \
>>       return 0;                                                       \
>>  }
>> @@ -68,8 +72,13 @@ VFIO_IOWRITE(64)
>>  static int vfio_pci_ioread##size(struct vfio_pci_core_device *vdev,          \
>>                       bool test_mem, u##size *val, void __iomem *io)  \
>>  {                                                                    \
>> +     down_read(&vdev->memory_lock);                                  \
>> +     if (vdev->power_state >= PCI_D3hot) {                           \
>> +             up_read(&vdev->memory_lock);                            \
>> +             return -EIO;                                            \
>> +     }                                                               \
>> +                                                                     \
>>       if (test_mem) {                                                 \
>> -             down_read(&vdev->memory_lock);                          \
>>               if (!__vfio_pci_memory_enabled(vdev)) {                 \
>>                       up_read(&vdev->memory_lock);                    \
>>                       return -EIO;                                    \
>> @@ -78,8 +87,7 @@ static int vfio_pci_ioread##size(struct vfio_pci_core_device *vdev,         \
>>                                                                       \
>>       *val = vfio_ioread##size(io);                                   \
>>                                                                       \
>> -     if (test_mem)                                                   \
>> -             up_read(&vdev->memory_lock);                            \
>> +     up_read(&vdev->memory_lock);                                    \
>>                                                                       \
>>       return 0;                                                       \
>>  }
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v2 5/5] vfio/pci: add the support for PCI D3cold state
  2022-01-24 18:17 ` [RFC PATCH v2 5/5] vfio/pci: add the support for PCI D3cold state Abhishek Sahu
@ 2022-03-09 17:26   ` Alex Williamson
  2022-03-11 15:45     ` Abhishek Sahu
  2022-03-11 16:17     ` Jason Gunthorpe
  0 siblings, 2 replies; 21+ messages in thread
From: Alex Williamson @ 2022-03-09 17:26 UTC (permalink / raw)
  To: Abhishek Sahu
  Cc: kvm, Cornelia Huck, Max Gurtovoy, Yishai Hadas, Zhen Lei,
	Jason Gunthorpe, linux-kernel

On Mon, 24 Jan 2022 23:47:26 +0530
Abhishek Sahu <abhsahu@nvidia.com> wrote:

> Currently, if the runtime power management is enabled for vfio-pci
> device in the guest OS, then guest OS will do the register write for
> PCI_PM_CTRL register. This write request will be handled in
> vfio_pm_config_write() where it will do the actual register write
> of PCI_PM_CTRL register. With this, the maximum D3hot state can be
> achieved for low power. If we can use the runtime PM framework,
> then we can achieve the D3cold state which will help in saving
> maximum power.
> 
> 1. Since D3cold state can't be achieved by writing PCI standard
>    PM config registers, so this patch adds a new IOCTL which change the
>    PCI device from D3hot to D3cold state and then D3cold to D0 state.
> 
> 2. The hypervisors can implement virtual ACPI methods. For
>    example, in guest linux OS if PCI device ACPI node has _PR3 and _PR0
>    power resources with _ON/_OFF method, then guest linux OS makes the
>    _OFF call during D3cold transition and then _ON during D0 transition.
>    The hypervisor can tap these virtual ACPI calls and then do the D3cold
>    related IOCTL in the vfio driver.
> 
> 3. The vfio driver uses runtime PM framework to achieve the
>    D3cold state. For the D3cold transition, decrement the usage count and
>    during D0 transition increment the usage count.
> 
> 4. For D3cold, the device current power state should be D3hot.
>    Then during runtime suspend, the pci_platform_power_transition() is
>    required for D3cold state. If the D3cold state is not supported, then
>    the device will still be in D3hot state. But with the runtime PM, the
>    root port can now also go into suspended state.
> 
> 5. For most of the systems, the D3cold is supported at the root
>    port level. So, when root port will transition to D3cold state, then
>    the vfio PCI device will go from D3hot to D3cold state during its
>    runtime suspend. If root port does not support D3cold, then the root
>    will go into D3hot state.
> 
> 6. The runtime suspend callback can now happen for 2 cases: there
>    is no user of vfio device and the case where user has initiated
>    D3cold. The 'runtime_suspend_pending' flag can help to distinguish
>    this case.
> 
> 7. There are cases where guest has put PCI device into D3cold
>    state and then on the host side, user has run lspci or any other
>    command which requires access of the PCI config register. In this case,
>    the kernel runtime PM framework will resume the PCI device internally,
>    read the config space and put the device into D3cold state again. Some
>    PCI device needs the SW involvement before going into D3cold state.
>    For the first D3cold state, the driver running in guest side does the SW
>    side steps. But the second D3cold transition will be without guest
>    driver involvement. So, prevent this second d3cold transition by
>    incrementing the device usage count. This will make the device
>    unnecessary in D0 but it's better than failure. In future, we can some
>    mechanism by which we can forward these wake-up request to guest and
>    then the mentioned case can be handled also.
> 
> 8. In D3cold, all kind of BAR related access needs to be disabled
>    like D3hot. Additionally, the config space will also be disabled in
>    D3cold state. To prevent access of config space in the D3cold state,
>    increment the runtime PM usage count before doing any config space
>    access. Also, most of the IOCTLs do the config space access, so
>    maintain one safe list and skip the resume only for these safe IOCTLs
>    alone. For other IOCTLs, the runtime PM usage count will be
>    incremented first.
> 
> 9. Now, runtime suspend/resume callbacks need to get the vdev
>    reference which can be obtained by dev_get_drvdata(). Currently, the
>    dev_set_drvdata() is being set after returning from
>    vfio_pci_core_register_device(). The runtime callbacks can come
>    anytime after enabling runtime PM so dev_set_drvdata() must happen
>    before that. We can move dev_set_drvdata() inside
>    vfio_pci_core_register_device() itself.
> 
> 10. The vfio device user can close the device after putting
>     the device into runtime suspended state so inside
>     vfio_pci_core_disable(), increment the runtime PM usage count.
> 
> 11. Runtime PM will be possible only if CONFIG_PM is enabled on
>     the host. So, the IOCTL related code can be put under CONFIG_PM
>     Kconfig.
> 
> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
> ---
>  drivers/vfio/pci/vfio_pci.c        |   1 -
>  drivers/vfio/pci/vfio_pci_config.c |  11 +-
>  drivers/vfio/pci/vfio_pci_core.c   | 186 +++++++++++++++++++++++++++--
>  include/linux/vfio_pci_core.h      |   1 +
>  include/uapi/linux/vfio.h          |  21 ++++
>  5 files changed, 211 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index c8695baf3b54..4ac3338c8fc7 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -153,7 +153,6 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>  	ret = vfio_pci_core_register_device(vdev);
>  	if (ret)
>  		goto out_free;
> -	dev_set_drvdata(&pdev->dev, vdev);

Relocating the setting of drvdata should be proposed separately rather
than buried in this patch.  The driver owns drvdata, the driver is the
only consumer of drvdata, so pushing this into the core to impose a
standard for drvdata across all vfio-pci variants doesn't seem like a
good idea to me.

>  	return 0;
>  
>  out_free:
> diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
> index dd9ed211ba6f..d20420657959 100644
> --- a/drivers/vfio/pci/vfio_pci_config.c
> +++ b/drivers/vfio/pci/vfio_pci_config.c
> @@ -25,6 +25,7 @@
>  #include <linux/uaccess.h>
>  #include <linux/vfio.h>
>  #include <linux/slab.h>
> +#include <linux/pm_runtime.h>
>  
>  #include <linux/vfio_pci_core.h>
>  
> @@ -1919,16 +1920,23 @@ static ssize_t vfio_config_do_rw(struct vfio_pci_core_device *vdev, char __user
>  ssize_t vfio_pci_config_rw(struct vfio_pci_core_device *vdev, char __user *buf,
>  			   size_t count, loff_t *ppos, bool iswrite)
>  {
> +	struct device *dev = &vdev->pdev->dev;
>  	size_t done = 0;
>  	int ret = 0;
>  	loff_t pos = *ppos;
>  
>  	pos &= VFIO_PCI_OFFSET_MASK;
>  
> +	ret = pm_runtime_resume_and_get(dev);
> +	if (ret < 0)
> +		return ret;
> +
>  	while (count) {
>  		ret = vfio_config_do_rw(vdev, buf, count, &pos, iswrite);
> -		if (ret < 0)
> +		if (ret < 0) {
> +			pm_runtime_put(dev);
>  			return ret;
> +		}
>  
>  		count -= ret;
>  		done += ret;
> @@ -1936,6 +1944,7 @@ ssize_t vfio_pci_config_rw(struct vfio_pci_core_device *vdev, char __user *buf,
>  		pos += ret;
>  	}
>  
> +	pm_runtime_put(dev);

What about other config accesses, ex. shared INTx?  We need to
interact with the device command and status register on an incoming
interrupt to test if our device sent an interrupt and to mask it.  The
unmask eventfd can also trigger config space accesses.  Seems
incomplete relative to config space.

>  	*ppos += done;
>  
>  	return done;
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index 38440d48973f..b70bb4fd940d 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -371,12 +371,23 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
>  	lockdep_assert_held(&vdev->vdev.dev_set->lock);
>  
>  	/*
> -	 * If disable has been called while the power state is other than D0,
> -	 * then set the power state in vfio driver to D0. It will help
> -	 * in running the logic needed for D0 power state. The subsequent
> -	 * runtime PM API's will put the device into the low power state again.
> +	 * The vfio device user can close the device after putting the device
> +	 * into runtime suspended state so wake up the device first in
> +	 * this case.
>  	 */
> -	vfio_pci_set_power_state_locked(vdev, PCI_D0);
> +	if (vdev->runtime_suspend_pending) {
> +		vdev->runtime_suspend_pending = false;
> +		pm_runtime_resume_and_get(&pdev->dev);

Doesn't vdev->power_state become unsynchronized from the actual device
state here and maybe elsewhere in this patch?  (I see below that maybe
the resume handler accounts for this)

> +	} else {
> +		/*
> +		 * If disable has been called while the power state is other
> +		 * than D0, then set the power state in vfio driver to D0. It
> +		 * will help in running the logic needed for D0 power state.
> +		 * The subsequent runtime PM API's will put the device into
> +		 * the low power state again.
> +		 */
> +		vfio_pci_set_power_state_locked(vdev, PCI_D0);
> +	}
>  
>  	/* Stop the device from further DMA */
>  	pci_clear_master(pdev);
> @@ -693,8 +704,8 @@ int vfio_pci_register_dev_region(struct vfio_pci_core_device *vdev,
>  }
>  EXPORT_SYMBOL_GPL(vfio_pci_register_dev_region);
>  
> -long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
> -		unsigned long arg)
> +static long vfio_pci_core_ioctl_internal(struct vfio_device *core_vdev,
> +					 unsigned int cmd, unsigned long arg)
>  {
>  	struct vfio_pci_core_device *vdev =
>  		container_of(core_vdev, struct vfio_pci_core_device, vdev);
> @@ -1241,10 +1252,119 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
>  		default:
>  			return -ENOTTY;
>  		}
> +#ifdef CONFIG_PM
> +	} else if (cmd == VFIO_DEVICE_POWER_MANAGEMENT) {

I'd suggest using a DEVICE_FEATURE ioctl for this.  This ioctl doesn't
follow the vfio standard of argsz/flags and doesn't seem to do anything
special that we couldn't achieve with a DEVICE_FEATURE ioctl.

> +		struct vfio_power_management vfio_pm;
> +		struct pci_dev *pdev = vdev->pdev;
> +		bool request_idle = false, request_resume = false;
> +		int ret = 0;
> +
> +		if (copy_from_user(&vfio_pm, (void __user *)arg, sizeof(vfio_pm)))
> +			return -EFAULT;
> +
> +		/*
> +		 * The vdev power related fields are protected with memory_lock
> +		 * semaphore.
> +		 */
> +		down_write(&vdev->memory_lock);
> +		switch (vfio_pm.d3cold_state) {
> +		case VFIO_DEVICE_D3COLD_STATE_ENTER:
> +			/*
> +			 * For D3cold, the device should already in D3hot
> +			 * state.
> +			 */
> +			if (vdev->power_state < PCI_D3hot) {
> +				ret = EINVAL;
> +				break;
> +			}
> +
> +			if (!vdev->runtime_suspend_pending) {
> +				vdev->runtime_suspend_pending = true;
> +				pm_runtime_put_noidle(&pdev->dev);
> +				request_idle = true;
> +			}

If I call this multiple times, runtime_suspend_pending prevents it from
doing anything, but what should the return value be in that case?  Same
question for exit.

> +
> +			break;
> +
> +		case VFIO_DEVICE_D3COLD_STATE_EXIT:
> +			/*
> +			 * If the runtime resume has already been run, then
> +			 * the device will be already in D0 state.
> +			 */
> +			if (vdev->runtime_suspend_pending) {
> +				vdev->runtime_suspend_pending = false;
> +				pm_runtime_get_noresume(&pdev->dev);
> +				request_resume = true;
> +			}
> +
> +			break;
> +
> +		default:
> +			ret = EINVAL;
> +			break;
> +		}
> +
> +		up_write(&vdev->memory_lock);
> +
> +		/*
> +		 * Call the runtime PM API's without any lock. Inside vfio driver
> +		 * runtime suspend/resume, the locks can be acquired again.
> +		 */
> +		if (request_idle)
> +			pm_request_idle(&pdev->dev);
> +
> +		if (request_resume)
> +			pm_runtime_resume(&pdev->dev);
> +
> +		return ret;
> +#endif
>  	}
>  
>  	return -ENOTTY;
>  }
> +
> +long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
> +			 unsigned long arg)
> +{
> +#ifdef CONFIG_PM
> +	struct vfio_pci_core_device *vdev =
> +		container_of(core_vdev, struct vfio_pci_core_device, vdev);
> +	struct device *dev = &vdev->pdev->dev;
> +	bool skip_runtime_resume = false;
> +	long ret;
> +
> +	/*
> +	 * The list of commands which are safe to execute when the PCI device
> +	 * is in D3cold state. In D3cold state, the PCI config or any other IO
> +	 * access won't work.
> +	 */
> +	switch (cmd) {
> +	case VFIO_DEVICE_POWER_MANAGEMENT:
> +	case VFIO_DEVICE_GET_INFO:
> +	case VFIO_DEVICE_FEATURE:
> +		skip_runtime_resume = true;
> +		break;

How can we know that there won't be DEVICE_FEATURE calls that touch the
device, the recently added migration via DEVICE_FEATURE does already.
DEVICE_GET_INFO seems equally as prone to breaking via capabilities
that could touch the device.  It seems easier to maintain and more
consistent to the user interface if we simply define that any device
access will resume the device.  We need to do something about
interrupts though.  Maybe we could error the user ioctl to set d3cold
for devices running in INTx mode, but we also have numerous ways that
the device could be resumed under the user, which might start
triggering MSI/X interrupts?

> +
> +	default:
> +		break;
> +	}
> +
> +	if (!skip_runtime_resume) {
> +		ret = pm_runtime_resume_and_get(dev);
> +		if (ret < 0)
> +			return ret;
> +	}
> +
> +	ret = vfio_pci_core_ioctl_internal(core_vdev, cmd, arg);
> +

I'm not a fan of wrapping the main ioctl interface for power management
like this.

> +	if (!skip_runtime_resume)
> +		pm_runtime_put(dev);
> +
> +	return ret;
> +#else
> +	return vfio_pci_core_ioctl_internal(core_vdev, cmd, arg);
> +#endif
> +}
>  EXPORT_SYMBOL_GPL(vfio_pci_core_ioctl);
>  
>  static ssize_t vfio_pci_rw(struct vfio_pci_core_device *vdev, char __user *buf,
> @@ -1897,6 +2017,7 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev)
>  		return -EBUSY;
>  	}
>  
> +	dev_set_drvdata(&pdev->dev, vdev);
>  	if (pci_is_root_bus(pdev->bus)) {
>  		ret = vfio_assign_device_set(&vdev->vdev, vdev);
>  	} else if (!pci_probe_reset_slot(pdev->slot)) {
> @@ -1966,6 +2087,7 @@ void vfio_pci_core_unregister_device(struct vfio_pci_core_device *vdev)
>  		pm_runtime_get_noresume(&pdev->dev);
>  
>  	pm_runtime_forbid(&pdev->dev);
> +	dev_set_drvdata(&pdev->dev, NULL);
>  }
>  EXPORT_SYMBOL_GPL(vfio_pci_core_unregister_device);
>  
> @@ -2219,11 +2341,61 @@ static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set)
>  #ifdef CONFIG_PM
>  static int vfio_pci_core_runtime_suspend(struct device *dev)
>  {
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +	struct vfio_pci_core_device *vdev = dev_get_drvdata(dev);
> +
> +	down_read(&vdev->memory_lock);
> +
> +	/*
> +	 * runtime_suspend_pending won't be set if there is no user of vfio pci
> +	 * device. In that case, return early and PCI core will take care of
> +	 * putting the device in the low power state.
> +	 */
> +	if (!vdev->runtime_suspend_pending) {
> +		up_read(&vdev->memory_lock);
> +		return 0;
> +	}

Doesn't this also mean that idle, unused devices can at best sit in
d3hot rather than d3cold?

> +
> +	/*
> +	 * The runtime suspend will be called only if device is already at
> +	 * D3hot state. Now, change the device state from D3hot to D3cold by
> +	 * using platform power management. If setting of D3cold is not
> +	 * supported for the PCI device, then the device state will still be
> +	 * in D3hot state. The PCI core expects to save the PCI state, if
> +	 * driver runtime routine handles the power state management.
> +	 */
> +	pci_save_state(pdev);
> +	pci_platform_power_transition(pdev, PCI_D3cold);
> +	up_read(&vdev->memory_lock);
> +
>  	return 0;
>  }
>  
>  static int vfio_pci_core_runtime_resume(struct device *dev)
>  {
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +	struct vfio_pci_core_device *vdev = dev_get_drvdata(dev);
> +
> +	down_write(&vdev->memory_lock);
> +
> +	/*
> +	 * The PCI core will move the device to D0 state before calling the
> +	 * driver runtime resume.
> +	 */
> +	vfio_pci_set_power_state_locked(vdev, PCI_D0);

Maybe this is where vdev->power_state is kept synchronized?

> +
> +	/*
> +	 * Some PCI device needs the SW involvement before going to D3cold
> +	 * state again. So if there is any wake-up which is not triggered
> +	 * by the guest, then increase the usage count to prevent the
> +	 * second runtime suspend.
> +	 */

Can you give examples of devices that need this and the reason they
need this?  The interface is not terribly deterministic if a random
unprivileged lspci on the host can move devices back to d3hot.  How
useful is this implementation if a notice to the guest of a resumed
device is TBD?  Thanks,

Alex

> +	if (vdev->runtime_suspend_pending) {
> +		vdev->runtime_suspend_pending = false;
> +		pm_runtime_get_noresume(&pdev->dev);
> +	}
> +
> +	up_write(&vdev->memory_lock);
>  	return 0;
>  }
>  
> diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
> index 05db838e72cc..8bbfd028115a 100644
> --- a/include/linux/vfio_pci_core.h
> +++ b/include/linux/vfio_pci_core.h
> @@ -124,6 +124,7 @@ struct vfio_pci_core_device {
>  	bool			needs_reset;
>  	bool			nointx;
>  	bool			needs_pm_restore;
> +	bool			runtime_suspend_pending;
>  	pci_power_t		power_state;
>  	struct pci_saved_state	*pci_saved_state;
>  	struct pci_saved_state	*pm_save;
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index ef33ea002b0b..7b7dadc6df71 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -1002,6 +1002,27 @@ struct vfio_device_feature {
>   */
>  #define VFIO_DEVICE_FEATURE_PCI_VF_TOKEN	(0)
>  
> +/**
> + * VFIO_DEVICE_POWER_MANAGEMENT - _IOW(VFIO_TYPE, VFIO_BASE + 18,
> + *			       struct vfio_power_management)
> + *
> + * Provide the support for device power management.  The native PCI power
> + * management does not support the D3cold power state.  For moving the device
> + * into D3cold state, change the PCI state to D3hot with standard
> + * configuration registers and then call this IOCTL to setting the D3cold
> + * state.  Similarly, if the device in D3cold state, then call this IOCTL
> + * to exit from D3cold state.
> + *
> + * Return 0 on success, -errno on failure.
> + */
> +#define VFIO_DEVICE_POWER_MANAGEMENT		_IO(VFIO_TYPE, VFIO_BASE + 18)
> +struct vfio_power_management {
> +	__u32	argsz;
> +#define VFIO_DEVICE_D3COLD_STATE_EXIT		0x0
> +#define VFIO_DEVICE_D3COLD_STATE_ENTER		0x1
> +	__u32	d3cold_state;
> +};
> +
>  /* -------- API for Type1 VFIO IOMMU -------- */
>  
>  /**


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v2 5/5] vfio/pci: add the support for PCI D3cold state
  2022-03-09 17:26   ` Alex Williamson
@ 2022-03-11 15:45     ` Abhishek Sahu
  2022-03-11 23:06       ` Alex Williamson
  2022-03-11 16:17     ` Jason Gunthorpe
  1 sibling, 1 reply; 21+ messages in thread
From: Abhishek Sahu @ 2022-03-11 15:45 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, Cornelia Huck, Max Gurtovoy, Yishai Hadas, Zhen Lei,
	Jason Gunthorpe, linux-kernel

On 3/9/2022 10:56 PM, Alex Williamson wrote:
> On Mon, 24 Jan 2022 23:47:26 +0530
> Abhishek Sahu <abhsahu@nvidia.com> wrote:
> 
>> Currently, if the runtime power management is enabled for vfio-pci
>> device in the guest OS, then guest OS will do the register write for
>> PCI_PM_CTRL register. This write request will be handled in
>> vfio_pm_config_write() where it will do the actual register write
>> of PCI_PM_CTRL register. With this, the maximum D3hot state can be
>> achieved for low power. If we can use the runtime PM framework,
>> then we can achieve the D3cold state which will help in saving
>> maximum power.
>>
>> 1. Since D3cold state can't be achieved by writing PCI standard
>>    PM config registers, so this patch adds a new IOCTL which change the
>>    PCI device from D3hot to D3cold state and then D3cold to D0 state.
>>
>> 2. The hypervisors can implement virtual ACPI methods. For
>>    example, in guest linux OS if PCI device ACPI node has _PR3 and _PR0
>>    power resources with _ON/_OFF method, then guest linux OS makes the
>>    _OFF call during D3cold transition and then _ON during D0 transition.
>>    The hypervisor can tap these virtual ACPI calls and then do the D3cold
>>    related IOCTL in the vfio driver.
>>
>> 3. The vfio driver uses runtime PM framework to achieve the
>>    D3cold state. For the D3cold transition, decrement the usage count and
>>    during D0 transition increment the usage count.
>>
>> 4. For D3cold, the device current power state should be D3hot.
>>    Then during runtime suspend, the pci_platform_power_transition() is
>>    required for D3cold state. If the D3cold state is not supported, then
>>    the device will still be in D3hot state. But with the runtime PM, the
>>    root port can now also go into suspended state.
>>
>> 5. For most of the systems, the D3cold is supported at the root
>>    port level. So, when root port will transition to D3cold state, then
>>    the vfio PCI device will go from D3hot to D3cold state during its
>>    runtime suspend. If root port does not support D3cold, then the root
>>    will go into D3hot state.
>>
>> 6. The runtime suspend callback can now happen for 2 cases: there
>>    is no user of vfio device and the case where user has initiated
>>    D3cold. The 'runtime_suspend_pending' flag can help to distinguish
>>    this case.
>>
>> 7. There are cases where guest has put PCI device into D3cold
>>    state and then on the host side, user has run lspci or any other
>>    command which requires access of the PCI config register. In this case,
>>    the kernel runtime PM framework will resume the PCI device internally,
>>    read the config space and put the device into D3cold state again. Some
>>    PCI device needs the SW involvement before going into D3cold state.
>>    For the first D3cold state, the driver running in guest side does the SW
>>    side steps. But the second D3cold transition will be without guest
>>    driver involvement. So, prevent this second d3cold transition by
>>    incrementing the device usage count. This will make the device
>>    unnecessary in D0 but it's better than failure. In future, we can some
>>    mechanism by which we can forward these wake-up request to guest and
>>    then the mentioned case can be handled also.
>>
>> 8. In D3cold, all kind of BAR related access needs to be disabled
>>    like D3hot. Additionally, the config space will also be disabled in
>>    D3cold state. To prevent access of config space in the D3cold state,
>>    increment the runtime PM usage count before doing any config space
>>    access. Also, most of the IOCTLs do the config space access, so
>>    maintain one safe list and skip the resume only for these safe IOCTLs
>>    alone. For other IOCTLs, the runtime PM usage count will be
>>    incremented first.
>>
>> 9. Now, runtime suspend/resume callbacks need to get the vdev
>>    reference which can be obtained by dev_get_drvdata(). Currently, the
>>    dev_set_drvdata() is being set after returning from
>>    vfio_pci_core_register_device(). The runtime callbacks can come
>>    anytime after enabling runtime PM so dev_set_drvdata() must happen
>>    before that. We can move dev_set_drvdata() inside
>>    vfio_pci_core_register_device() itself.
>>
>> 10. The vfio device user can close the device after putting
>>     the device into runtime suspended state so inside
>>     vfio_pci_core_disable(), increment the runtime PM usage count.
>>
>> 11. Runtime PM will be possible only if CONFIG_PM is enabled on
>>     the host. So, the IOCTL related code can be put under CONFIG_PM
>>     Kconfig.
>>
>> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
>> ---
>>  drivers/vfio/pci/vfio_pci.c        |   1 -
>>  drivers/vfio/pci/vfio_pci_config.c |  11 +-
>>  drivers/vfio/pci/vfio_pci_core.c   | 186 +++++++++++++++++++++++++++--
>>  include/linux/vfio_pci_core.h      |   1 +
>>  include/uapi/linux/vfio.h          |  21 ++++
>>  5 files changed, 211 insertions(+), 9 deletions(-)
>>
>> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
>> index c8695baf3b54..4ac3338c8fc7 100644
>> --- a/drivers/vfio/pci/vfio_pci.c
>> +++ b/drivers/vfio/pci/vfio_pci.c
>> @@ -153,7 +153,6 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>>       ret = vfio_pci_core_register_device(vdev);
>>       if (ret)
>>               goto out_free;
>> -     dev_set_drvdata(&pdev->dev, vdev);
> 
> Relocating the setting of drvdata should be proposed separately rather
> than buried in this patch.  The driver owns drvdata, the driver is the
> only consumer of drvdata, so pushing this into the core to impose a
> standard for drvdata across all vfio-pci variants doesn't seem like a
> good idea to me.
> 
 
 I will check regarding this part.
 Mainly drvdata is needed for the runtime PM callbacks which are added
 inside core layer and we need to get vdev from struct device.

>>       return 0;
>>
>>  out_free:
>> diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
>> index dd9ed211ba6f..d20420657959 100644
>> --- a/drivers/vfio/pci/vfio_pci_config.c
>> +++ b/drivers/vfio/pci/vfio_pci_config.c
>> @@ -25,6 +25,7 @@
>>  #include <linux/uaccess.h>
>>  #include <linux/vfio.h>
>>  #include <linux/slab.h>
>> +#include <linux/pm_runtime.h>
>>
>>  #include <linux/vfio_pci_core.h>
>>
>> @@ -1919,16 +1920,23 @@ static ssize_t vfio_config_do_rw(struct vfio_pci_core_device *vdev, char __user
>>  ssize_t vfio_pci_config_rw(struct vfio_pci_core_device *vdev, char __user *buf,
>>                          size_t count, loff_t *ppos, bool iswrite)
>>  {
>> +     struct device *dev = &vdev->pdev->dev;
>>       size_t done = 0;
>>       int ret = 0;
>>       loff_t pos = *ppos;
>>
>>       pos &= VFIO_PCI_OFFSET_MASK;
>>
>> +     ret = pm_runtime_resume_and_get(dev);
>> +     if (ret < 0)
>> +             return ret;
>> +
>>       while (count) {
>>               ret = vfio_config_do_rw(vdev, buf, count, &pos, iswrite);
>> -             if (ret < 0)
>> +             if (ret < 0) {
>> +                     pm_runtime_put(dev);
>>                       return ret;
>> +             }
>>
>>               count -= ret;
>>               done += ret;
>> @@ -1936,6 +1944,7 @@ ssize_t vfio_pci_config_rw(struct vfio_pci_core_device *vdev, char __user *buf,
>>               pos += ret;
>>       }
>>
>> +     pm_runtime_put(dev);
> 
> What about other config accesses, ex. shared INTx?  We need to
> interact with the device command and status register on an incoming
> interrupt to test if our device sent an interrupt and to mask it.  The
> unmask eventfd can also trigger config space accesses.  Seems
> incomplete relative to config space.
> 

 I will check this path thoroughly.
 But from initial analysis, it seems we have 2 path here:

 Most of the mentioned functions are being called from
 vfio_pci_set_irqs_ioctl() and pm_runtime_resume_and_get()
 should be called for this ioctl also in this patch.

 Second path is when we are inside IRQ handler. For that, we need some
 other mechanism which I explained below.
 
>>       *ppos += done;
>>
>>       return done;
>> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
>> index 38440d48973f..b70bb4fd940d 100644
>> --- a/drivers/vfio/pci/vfio_pci_core.c
>> +++ b/drivers/vfio/pci/vfio_pci_core.c
>> @@ -371,12 +371,23 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
>>       lockdep_assert_held(&vdev->vdev.dev_set->lock);
>>
>>       /*
>> -      * If disable has been called while the power state is other than D0,
>> -      * then set the power state in vfio driver to D0. It will help
>> -      * in running the logic needed for D0 power state. The subsequent
>> -      * runtime PM API's will put the device into the low power state again.
>> +      * The vfio device user can close the device after putting the device
>> +      * into runtime suspended state so wake up the device first in
>> +      * this case.
>>        */
>> -     vfio_pci_set_power_state_locked(vdev, PCI_D0);
>> +     if (vdev->runtime_suspend_pending) {
>> +             vdev->runtime_suspend_pending = false;
>> +             pm_runtime_resume_and_get(&pdev->dev);
> 
> Doesn't vdev->power_state become unsynchronized from the actual device
> state here and maybe elsewhere in this patch?  (I see below that maybe
> the resume handler accounts for this)
> 

 Yes. Inside runtime resume handler, it is being changed back to D0.

>> +     } else {
>> +             /*
>> +              * If disable has been called while the power state is other
>> +              * than D0, then set the power state in vfio driver to D0. It
>> +              * will help in running the logic needed for D0 power state.
>> +              * The subsequent runtime PM API's will put the device into
>> +              * the low power state again.
>> +              */
>> +             vfio_pci_set_power_state_locked(vdev, PCI_D0);
>> +     }
>>
>>       /* Stop the device from further DMA */
>>       pci_clear_master(pdev);
>> @@ -693,8 +704,8 @@ int vfio_pci_register_dev_region(struct vfio_pci_core_device *vdev,
>>  }
>>  EXPORT_SYMBOL_GPL(vfio_pci_register_dev_region);
>>
>> -long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
>> -             unsigned long arg)
>> +static long vfio_pci_core_ioctl_internal(struct vfio_device *core_vdev,
>> +                                      unsigned int cmd, unsigned long arg)
>>  {
>>       struct vfio_pci_core_device *vdev =
>>               container_of(core_vdev, struct vfio_pci_core_device, vdev);
>> @@ -1241,10 +1252,119 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
>>               default:
>>                       return -ENOTTY;
>>               }
>> +#ifdef CONFIG_PM
>> +     } else if (cmd == VFIO_DEVICE_POWER_MANAGEMENT) {
> 
> I'd suggest using a DEVICE_FEATURE ioctl for this.  This ioctl doesn't
> follow the vfio standard of argsz/flags and doesn't seem to do anything
> special that we couldn't achieve with a DEVICE_FEATURE ioctl.
> 

 Sure. DEVICE_FEATURE can help for this.

>> +             struct vfio_power_management vfio_pm;
>> +             struct pci_dev *pdev = vdev->pdev;
>> +             bool request_idle = false, request_resume = false;
>> +             int ret = 0;
>> +
>> +             if (copy_from_user(&vfio_pm, (void __user *)arg, sizeof(vfio_pm)))
>> +                     return -EFAULT;
>> +
>> +             /*
>> +              * The vdev power related fields are protected with memory_lock
>> +              * semaphore.
>> +              */
>> +             down_write(&vdev->memory_lock);
>> +             switch (vfio_pm.d3cold_state) {
>> +             case VFIO_DEVICE_D3COLD_STATE_ENTER:
>> +                     /*
>> +                      * For D3cold, the device should already in D3hot
>> +                      * state.
>> +                      */
>> +                     if (vdev->power_state < PCI_D3hot) {
>> +                             ret = EINVAL;
>> +                             break;
>> +                     }
>> +
>> +                     if (!vdev->runtime_suspend_pending) {
>> +                             vdev->runtime_suspend_pending = true;
>> +                             pm_runtime_put_noidle(&pdev->dev);
>> +                             request_idle = true;
>> +                     }
> 
> If I call this multiple times, runtime_suspend_pending prevents it from
> doing anything, but what should the return value be in that case?  Same
> question for exit.
> 

 For entry, the user should not call moving the device to D3cold, if it has
 already requested. So, we can return error in this case. For exit,
 currently, in this patch, I am clearing runtime_suspend_pending if the
 wake-up is triggered from the host side (with lspci or some other command).
 In that case, the exit should not return error. Should we add code to 
 detect multiple calling of these and ensure only one
 VFIO_DEVICE_D3COLD_STATE_ENTER/VFIO_DEVICE_D3COLD_STATE_EXIT can be called.

>> +
>> +                     break;
>> +
>> +             case VFIO_DEVICE_D3COLD_STATE_EXIT:
>> +                     /*
>> +                      * If the runtime resume has already been run, then
>> +                      * the device will be already in D0 state.
>> +                      */
>> +                     if (vdev->runtime_suspend_pending) {
>> +                             vdev->runtime_suspend_pending = false;
>> +                             pm_runtime_get_noresume(&pdev->dev);
>> +                             request_resume = true;
>> +                     }
>> +
>> +                     break;
>> +
>> +             default:
>> +                     ret = EINVAL;
>> +                     break;
>> +             }
>> +
>> +             up_write(&vdev->memory_lock);
>> +
>> +             /*
>> +              * Call the runtime PM API's without any lock. Inside vfio driver
>> +              * runtime suspend/resume, the locks can be acquired again.
>> +              */
>> +             if (request_idle)
>> +                     pm_request_idle(&pdev->dev);
>> +
>> +             if (request_resume)
>> +                     pm_runtime_resume(&pdev->dev);
>> +
>> +             return ret;
>> +#endif
>>       }
>>
>>       return -ENOTTY;
>>  }
>> +
>> +long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
>> +                      unsigned long arg)
>> +{
>> +#ifdef CONFIG_PM
>> +     struct vfio_pci_core_device *vdev =
>> +             container_of(core_vdev, struct vfio_pci_core_device, vdev);
>> +     struct device *dev = &vdev->pdev->dev;
>> +     bool skip_runtime_resume = false;
>> +     long ret;
>> +
>> +     /*
>> +      * The list of commands which are safe to execute when the PCI device
>> +      * is in D3cold state. In D3cold state, the PCI config or any other IO
>> +      * access won't work.
>> +      */
>> +     switch (cmd) {
>> +     case VFIO_DEVICE_POWER_MANAGEMENT:
>> +     case VFIO_DEVICE_GET_INFO:
>> +     case VFIO_DEVICE_FEATURE:
>> +             skip_runtime_resume = true;
>> +             break;
> 
> How can we know that there won't be DEVICE_FEATURE calls that touch the
> device, the recently added migration via DEVICE_FEATURE does already.
> DEVICE_GET_INFO seems equally as prone to breaking via capabilities
> that could touch the device.  It seems easier to maintain and more
> consistent to the user interface if we simply define that any device
> access will resume the device.

 In that case, we can resume the device for all case without
 maintaining the safe list.

> We need to do something about interrupts though. > Maybe we could error the user ioctl to set d3cold
> for devices running in INTx mode, but we also have numerous ways that
> the device could be resumed under the user, which might start
> triggering MSI/X interrupts?
> 

 All the resuming we are mainly to prevent any malicious sequence.
 If we see from normal OS side, then once the guest kernel has moved
 the device into D3cold, then it should not do any config space
 access. Similarly, from hypervisor, it should not invoke any
 ioctl other than moving the device into D0 again when the device
 is in D3cold. But, preventing the device to go into D3cold when
 any other ioctl or config space access is happening is not easy,
 so incrementing usage count before these access will ensure that
 the device won't go into D3cold. 

 For interrupts, can the interrupt happen (Both INTx and MSI/x)
 if the device is in D3cold? In D3cold, the PME events are possible
 and these events will anyway resume the device first. If the
 interrupts are not possible then can we disable all the interrupts
 somehow before going calling runtime PM API's to move the device into D3cold
 and enable it again during runtime resume. We can wait for all existing
 Interrupt to be finished first. I am not sure if this is possible. 
 
 Returning error for user ioctl to set d3cold while interrupts are
 happening needs some synchronization at both interrupt handler and
 ioctl code and using runtime resume inside interrupt handler
 may not be safe.
   
>> +
>> +     default:
>> +             break;
>> +     }
>> +
>> +     if (!skip_runtime_resume) {
>> +             ret = pm_runtime_resume_and_get(dev);
>> +             if (ret < 0)
>> +                     return ret;
>> +     }
>> +
>> +     ret = vfio_pci_core_ioctl_internal(core_vdev, cmd, arg);
>> +
> 
> I'm not a fan of wrapping the main ioctl interface for power management
> like this.
> 

 We need to increment the usage count at entry and decrement it
 again at exit. Currently, from lot of places directly, we are
 calling 'return' instead of going at function end. If we need to
 get rid of wrapper function, then I need to replace all return with
 'goto' for going at the function end and return after decrementing
 the usage count. Will this be fine ? 

>> +     if (!skip_runtime_resume)
>> +             pm_runtime_put(dev);
>> +
>> +     return ret;
>> +#else
>> +     return vfio_pci_core_ioctl_internal(core_vdev, cmd, arg);
>> +#endif
>> +}
>>  EXPORT_SYMBOL_GPL(vfio_pci_core_ioctl);
>>
>>  static ssize_t vfio_pci_rw(struct vfio_pci_core_device *vdev, char __user *buf,
>> @@ -1897,6 +2017,7 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev)
>>               return -EBUSY;
>>       }
>>
>> +     dev_set_drvdata(&pdev->dev, vdev);
>>       if (pci_is_root_bus(pdev->bus)) {
>>               ret = vfio_assign_device_set(&vdev->vdev, vdev);
>>       } else if (!pci_probe_reset_slot(pdev->slot)) {
>> @@ -1966,6 +2087,7 @@ void vfio_pci_core_unregister_device(struct vfio_pci_core_device *vdev)
>>               pm_runtime_get_noresume(&pdev->dev);
>>
>>       pm_runtime_forbid(&pdev->dev);
>> +     dev_set_drvdata(&pdev->dev, NULL);
>>  }
>>  EXPORT_SYMBOL_GPL(vfio_pci_core_unregister_device);
>>
>> @@ -2219,11 +2341,61 @@ static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set)
>>  #ifdef CONFIG_PM
>>  static int vfio_pci_core_runtime_suspend(struct device *dev)
>>  {
>> +     struct pci_dev *pdev = to_pci_dev(dev);
>> +     struct vfio_pci_core_device *vdev = dev_get_drvdata(dev);
>> +
>> +     down_read(&vdev->memory_lock);
>> +
>> +     /*
>> +      * runtime_suspend_pending won't be set if there is no user of vfio pci
>> +      * device. In that case, return early and PCI core will take care of
>> +      * putting the device in the low power state.
>> +      */
>> +     if (!vdev->runtime_suspend_pending) {
>> +             up_read(&vdev->memory_lock);
>> +             return 0;
>> +     }
> 
> Doesn't this also mean that idle, unused devices can at best sit in
> d3hot rather than d3cold?
> 

 Sorry. I didn't get this point.

 For unused devices, the PCI core will move the device into D3cold directly.
 For the used devices, the config space write is happening first before
 this ioctl is called and the config space write is moving the device
 into D3hot so we need to do some manual thing here.

>> +
>> +     /*
>> +      * The runtime suspend will be called only if device is already at
>> +      * D3hot state. Now, change the device state from D3hot to D3cold by
>> +      * using platform power management. If setting of D3cold is not
>> +      * supported for the PCI device, then the device state will still be
>> +      * in D3hot state. The PCI core expects to save the PCI state, if
>> +      * driver runtime routine handles the power state management.
>> +      */
>> +     pci_save_state(pdev);
>> +     pci_platform_power_transition(pdev, PCI_D3cold);
>> +     up_read(&vdev->memory_lock);
>> +
>>       return 0;
>>  }
>>
>>  static int vfio_pci_core_runtime_resume(struct device *dev)
>>  {
>> +     struct pci_dev *pdev = to_pci_dev(dev);
>> +     struct vfio_pci_core_device *vdev = dev_get_drvdata(dev);
>> +
>> +     down_write(&vdev->memory_lock);
>> +
>> +     /*
>> +      * The PCI core will move the device to D0 state before calling the
>> +      * driver runtime resume.
>> +      */
>> +     vfio_pci_set_power_state_locked(vdev, PCI_D0);
> 
> Maybe this is where vdev->power_state is kept synchronized?
> 
 
 Yes. vdev->power_state will be changed here.

>> +
>> +     /*
>> +      * Some PCI device needs the SW involvement before going to D3cold
>> +      * state again. So if there is any wake-up which is not triggered
>> +      * by the guest, then increase the usage count to prevent the
>> +      * second runtime suspend.
>> +      */
> 
> Can you give examples of devices that need this and the reason they
> need this?  The interface is not terribly deterministic if a random
> unprivileged lspci on the host can move devices back to d3hot. 

 I am not sure about other device but this is happening for
 the nvidia GPU itself. 
 
 For nvidia GPU, during runtime suspend, we keep the GPU video memory
 in self-refresh mode for high video memory usage. Each video memory
 self refesh entry before D3cold requires nvidia SW involvement.
 Without SW self-refresh sequnece involvement, it won't work. 

 Details regarding runtime suspend with self-refresh can be found in

 https://download.nvidia.com/XFree86/Linux-x86_64/495.46/README/dynamicpowermanagement.html#VidMemThreshold

 But, if GPU video memory usage is low, then we turnoff video memory
 and save all the allocation in system memory. In this case, SW involvement 
 is not required. 

> How useful is this implementation if a notice to the guest of a resumed
> device is TBD?  Thanks,
> 
> Alex
> 

 I have prototyped this earlier by using eventfd_ctx for pme and whenever we get
 a resume triggered by host, then it will forward the same to hypervisor.
 Then in the hypervisor, it can write into virtual root port PME related registers
 and send PME event which will wake-up the PCI device in the guest side.
 It will help in handling PME events related wake-up also which are currently
 disabled in PATCH 2 of this patch series.
 
 Thanks,
 Abhishek

>> +     if (vdev->runtime_suspend_pending) {
>> +             vdev->runtime_suspend_pending = false;
>> +             pm_runtime_get_noresume(&pdev->dev);
>> +     }
>> +
>> +     up_write(&vdev->memory_lock);
>>       return 0;
>>  }
>>
>> diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
>> index 05db838e72cc..8bbfd028115a 100644
>> --- a/include/linux/vfio_pci_core.h
>> +++ b/include/linux/vfio_pci_core.h
>> @@ -124,6 +124,7 @@ struct vfio_pci_core_device {
>>       bool                    needs_reset;
>>       bool                    nointx;
>>       bool                    needs_pm_restore;
>> +     bool                    runtime_suspend_pending;
>>       pci_power_t             power_state;
>>       struct pci_saved_state  *pci_saved_state;
>>       struct pci_saved_state  *pm_save;
>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>> index ef33ea002b0b..7b7dadc6df71 100644
>> --- a/include/uapi/linux/vfio.h
>> +++ b/include/uapi/linux/vfio.h
>> @@ -1002,6 +1002,27 @@ struct vfio_device_feature {
>>   */
>>  #define VFIO_DEVICE_FEATURE_PCI_VF_TOKEN     (0)
>>
>> +/**
>> + * VFIO_DEVICE_POWER_MANAGEMENT - _IOW(VFIO_TYPE, VFIO_BASE + 18,
>> + *                          struct vfio_power_management)
>> + *
>> + * Provide the support for device power management.  The native PCI power
>> + * management does not support the D3cold power state.  For moving the device
>> + * into D3cold state, change the PCI state to D3hot with standard
>> + * configuration registers and then call this IOCTL to setting the D3cold
>> + * state.  Similarly, if the device in D3cold state, then call this IOCTL
>> + * to exit from D3cold state.
>> + *
>> + * Return 0 on success, -errno on failure.
>> + */
>> +#define VFIO_DEVICE_POWER_MANAGEMENT         _IO(VFIO_TYPE, VFIO_BASE + 18)
>> +struct vfio_power_management {
>> +     __u32   argsz;
>> +#define VFIO_DEVICE_D3COLD_STATE_EXIT                0x0
>> +#define VFIO_DEVICE_D3COLD_STATE_ENTER               0x1
>> +     __u32   d3cold_state;
>> +};
>> +
>>  /* -------- API for Type1 VFIO IOMMU -------- */
>>
>>  /**
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v2 5/5] vfio/pci: add the support for PCI D3cold state
  2022-03-09 17:26   ` Alex Williamson
  2022-03-11 15:45     ` Abhishek Sahu
@ 2022-03-11 16:17     ` Jason Gunthorpe
  1 sibling, 0 replies; 21+ messages in thread
From: Jason Gunthorpe @ 2022-03-11 16:17 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Abhishek Sahu, kvm, Cornelia Huck, Max Gurtovoy, Yishai Hadas,
	Zhen Lei, linux-kernel

On Wed, Mar 09, 2022 at 10:26:42AM -0700, Alex Williamson wrote:

> > diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> > index c8695baf3b54..4ac3338c8fc7 100644
> > +++ b/drivers/vfio/pci/vfio_pci.c
> > @@ -153,7 +153,6 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> >  	ret = vfio_pci_core_register_device(vdev);
> >  	if (ret)
> >  		goto out_free;
> > -	dev_set_drvdata(&pdev->dev, vdev);
> 
> Relocating the setting of drvdata should be proposed separately rather
> than buried in this patch.  The driver owns drvdata, the driver is the
> only consumer of drvdata, so pushing this into the core to impose a
> standard for drvdata across all vfio-pci variants doesn't seem like a
> good idea to me.

I've been wanting to do this for another reason - there is a few
places in the core vfio-pci that converts a struct device to a
vfio_device the slow way when the drvdata is the right way to do it.

So either have the core code set it or require drivers to set it to the
vfio_pci_core_device pointer seems necessary.

But yes, it should be a seperated patch

Jason

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v2 5/5] vfio/pci: add the support for PCI D3cold state
  2022-03-11 15:45     ` Abhishek Sahu
@ 2022-03-11 23:06       ` Alex Williamson
  2022-03-16  5:41         ` Abhishek Sahu
  0 siblings, 1 reply; 21+ messages in thread
From: Alex Williamson @ 2022-03-11 23:06 UTC (permalink / raw)
  To: Abhishek Sahu
  Cc: kvm, Cornelia Huck, Max Gurtovoy, Yishai Hadas, Zhen Lei,
	Jason Gunthorpe, linux-kernel

On Fri, 11 Mar 2022 21:15:38 +0530
Abhishek Sahu <abhsahu@nvidia.com> wrote:

> On 3/9/2022 10:56 PM, Alex Williamson wrote:
> > On Mon, 24 Jan 2022 23:47:26 +0530
> > Abhishek Sahu <abhsahu@nvidia.com> wrote:
> >   
> >> Currently, if the runtime power management is enabled for vfio-pci
> >> device in the guest OS, then guest OS will do the register write for
> >> PCI_PM_CTRL register. This write request will be handled in
> >> vfio_pm_config_write() where it will do the actual register write
> >> of PCI_PM_CTRL register. With this, the maximum D3hot state can be
> >> achieved for low power. If we can use the runtime PM framework,
> >> then we can achieve the D3cold state which will help in saving
> >> maximum power.
> >>
> >> 1. Since D3cold state can't be achieved by writing PCI standard
> >>    PM config registers, so this patch adds a new IOCTL which change the
> >>    PCI device from D3hot to D3cold state and then D3cold to D0 state.
> >>
> >> 2. The hypervisors can implement virtual ACPI methods. For
> >>    example, in guest linux OS if PCI device ACPI node has _PR3 and _PR0
> >>    power resources with _ON/_OFF method, then guest linux OS makes the
> >>    _OFF call during D3cold transition and then _ON during D0 transition.
> >>    The hypervisor can tap these virtual ACPI calls and then do the D3cold
> >>    related IOCTL in the vfio driver.
> >>
> >> 3. The vfio driver uses runtime PM framework to achieve the
> >>    D3cold state. For the D3cold transition, decrement the usage count and
> >>    during D0 transition increment the usage count.
> >>
> >> 4. For D3cold, the device current power state should be D3hot.
> >>    Then during runtime suspend, the pci_platform_power_transition() is
> >>    required for D3cold state. If the D3cold state is not supported, then
> >>    the device will still be in D3hot state. But with the runtime PM, the
> >>    root port can now also go into suspended state.
> >>
> >> 5. For most of the systems, the D3cold is supported at the root
> >>    port level. So, when root port will transition to D3cold state, then
> >>    the vfio PCI device will go from D3hot to D3cold state during its
> >>    runtime suspend. If root port does not support D3cold, then the root
> >>    will go into D3hot state.
> >>
> >> 6. The runtime suspend callback can now happen for 2 cases: there
> >>    is no user of vfio device and the case where user has initiated
> >>    D3cold. The 'runtime_suspend_pending' flag can help to distinguish
> >>    this case.
> >>
> >> 7. There are cases where guest has put PCI device into D3cold
> >>    state and then on the host side, user has run lspci or any other
> >>    command which requires access of the PCI config register. In this case,
> >>    the kernel runtime PM framework will resume the PCI device internally,
> >>    read the config space and put the device into D3cold state again. Some
> >>    PCI device needs the SW involvement before going into D3cold state.
> >>    For the first D3cold state, the driver running in guest side does the SW
> >>    side steps. But the second D3cold transition will be without guest
> >>    driver involvement. So, prevent this second d3cold transition by
> >>    incrementing the device usage count. This will make the device
> >>    unnecessary in D0 but it's better than failure. In future, we can some
> >>    mechanism by which we can forward these wake-up request to guest and
> >>    then the mentioned case can be handled also.
> >>
> >> 8. In D3cold, all kind of BAR related access needs to be disabled
> >>    like D3hot. Additionally, the config space will also be disabled in
> >>    D3cold state. To prevent access of config space in the D3cold state,
> >>    increment the runtime PM usage count before doing any config space
> >>    access. Also, most of the IOCTLs do the config space access, so
> >>    maintain one safe list and skip the resume only for these safe IOCTLs
> >>    alone. For other IOCTLs, the runtime PM usage count will be
> >>    incremented first.
> >>
> >> 9. Now, runtime suspend/resume callbacks need to get the vdev
> >>    reference which can be obtained by dev_get_drvdata(). Currently, the
> >>    dev_set_drvdata() is being set after returning from
> >>    vfio_pci_core_register_device(). The runtime callbacks can come
> >>    anytime after enabling runtime PM so dev_set_drvdata() must happen
> >>    before that. We can move dev_set_drvdata() inside
> >>    vfio_pci_core_register_device() itself.
> >>
> >> 10. The vfio device user can close the device after putting
> >>     the device into runtime suspended state so inside
> >>     vfio_pci_core_disable(), increment the runtime PM usage count.
> >>
> >> 11. Runtime PM will be possible only if CONFIG_PM is enabled on
> >>     the host. So, the IOCTL related code can be put under CONFIG_PM
> >>     Kconfig.
> >>
> >> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
> >> ---
> >>  drivers/vfio/pci/vfio_pci.c        |   1 -
> >>  drivers/vfio/pci/vfio_pci_config.c |  11 +-
> >>  drivers/vfio/pci/vfio_pci_core.c   | 186 +++++++++++++++++++++++++++--
> >>  include/linux/vfio_pci_core.h      |   1 +
> >>  include/uapi/linux/vfio.h          |  21 ++++
> >>  5 files changed, 211 insertions(+), 9 deletions(-)
> >>
> >> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> >> index c8695baf3b54..4ac3338c8fc7 100644
> >> --- a/drivers/vfio/pci/vfio_pci.c
> >> +++ b/drivers/vfio/pci/vfio_pci.c
> >> @@ -153,7 +153,6 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> >>       ret = vfio_pci_core_register_device(vdev);
> >>       if (ret)
> >>               goto out_free;
> >> -     dev_set_drvdata(&pdev->dev, vdev);  
> > 
> > Relocating the setting of drvdata should be proposed separately rather
> > than buried in this patch.  The driver owns drvdata, the driver is the
> > only consumer of drvdata, so pushing this into the core to impose a
> > standard for drvdata across all vfio-pci variants doesn't seem like a
> > good idea to me.
> >   
>  
>  I will check regarding this part.
>  Mainly drvdata is needed for the runtime PM callbacks which are added
>  inside core layer and we need to get vdev from struct device.
> 
> >>       return 0;
> >>
> >>  out_free:
> >> diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
> >> index dd9ed211ba6f..d20420657959 100644
> >> --- a/drivers/vfio/pci/vfio_pci_config.c
> >> +++ b/drivers/vfio/pci/vfio_pci_config.c
> >> @@ -25,6 +25,7 @@
> >>  #include <linux/uaccess.h>
> >>  #include <linux/vfio.h>
> >>  #include <linux/slab.h>
> >> +#include <linux/pm_runtime.h>
> >>
> >>  #include <linux/vfio_pci_core.h>
> >>
> >> @@ -1919,16 +1920,23 @@ static ssize_t vfio_config_do_rw(struct vfio_pci_core_device *vdev, char __user
> >>  ssize_t vfio_pci_config_rw(struct vfio_pci_core_device *vdev, char __user *buf,
> >>                          size_t count, loff_t *ppos, bool iswrite)
> >>  {
> >> +     struct device *dev = &vdev->pdev->dev;
> >>       size_t done = 0;
> >>       int ret = 0;
> >>       loff_t pos = *ppos;
> >>
> >>       pos &= VFIO_PCI_OFFSET_MASK;
> >>
> >> +     ret = pm_runtime_resume_and_get(dev);
> >> +     if (ret < 0)
> >> +             return ret;
> >> +
> >>       while (count) {
> >>               ret = vfio_config_do_rw(vdev, buf, count, &pos, iswrite);
> >> -             if (ret < 0)
> >> +             if (ret < 0) {
> >> +                     pm_runtime_put(dev);
> >>                       return ret;
> >> +             }
> >>
> >>               count -= ret;
> >>               done += ret;
> >> @@ -1936,6 +1944,7 @@ ssize_t vfio_pci_config_rw(struct vfio_pci_core_device *vdev, char __user *buf,
> >>               pos += ret;
> >>       }
> >>
> >> +     pm_runtime_put(dev);  
> > 
> > What about other config accesses, ex. shared INTx?  We need to
> > interact with the device command and status register on an incoming
> > interrupt to test if our device sent an interrupt and to mask it.  The
> > unmask eventfd can also trigger config space accesses.  Seems
> > incomplete relative to config space.
> >   
> 
>  I will check this path thoroughly.
>  But from initial analysis, it seems we have 2 path here:
> 
>  Most of the mentioned functions are being called from
>  vfio_pci_set_irqs_ioctl() and pm_runtime_resume_and_get()
>  should be called for this ioctl also in this patch.
> 
>  Second path is when we are inside IRQ handler. For that, we need some
>  other mechanism which I explained below.
>  
> >>       *ppos += done;
> >>
> >>       return done;
> >> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> >> index 38440d48973f..b70bb4fd940d 100644
> >> --- a/drivers/vfio/pci/vfio_pci_core.c
> >> +++ b/drivers/vfio/pci/vfio_pci_core.c
> >> @@ -371,12 +371,23 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
> >>       lockdep_assert_held(&vdev->vdev.dev_set->lock);
> >>
> >>       /*
> >> -      * If disable has been called while the power state is other than D0,
> >> -      * then set the power state in vfio driver to D0. It will help
> >> -      * in running the logic needed for D0 power state. The subsequent
> >> -      * runtime PM API's will put the device into the low power state again.
> >> +      * The vfio device user can close the device after putting the device
> >> +      * into runtime suspended state so wake up the device first in
> >> +      * this case.
> >>        */
> >> -     vfio_pci_set_power_state_locked(vdev, PCI_D0);
> >> +     if (vdev->runtime_suspend_pending) {
> >> +             vdev->runtime_suspend_pending = false;
> >> +             pm_runtime_resume_and_get(&pdev->dev);  
> > 
> > Doesn't vdev->power_state become unsynchronized from the actual device
> > state here and maybe elsewhere in this patch?  (I see below that maybe
> > the resume handler accounts for this)
> >   
> 
>  Yes. Inside runtime resume handler, it is being changed back to D0.
> 
> >> +     } else {
> >> +             /*
> >> +              * If disable has been called while the power state is other
> >> +              * than D0, then set the power state in vfio driver to D0. It
> >> +              * will help in running the logic needed for D0 power state.
> >> +              * The subsequent runtime PM API's will put the device into
> >> +              * the low power state again.
> >> +              */
> >> +             vfio_pci_set_power_state_locked(vdev, PCI_D0);
> >> +     }
> >>
> >>       /* Stop the device from further DMA */
> >>       pci_clear_master(pdev);
> >> @@ -693,8 +704,8 @@ int vfio_pci_register_dev_region(struct vfio_pci_core_device *vdev,
> >>  }
> >>  EXPORT_SYMBOL_GPL(vfio_pci_register_dev_region);
> >>
> >> -long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
> >> -             unsigned long arg)
> >> +static long vfio_pci_core_ioctl_internal(struct vfio_device *core_vdev,
> >> +                                      unsigned int cmd, unsigned long arg)
> >>  {
> >>       struct vfio_pci_core_device *vdev =
> >>               container_of(core_vdev, struct vfio_pci_core_device, vdev);
> >> @@ -1241,10 +1252,119 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
> >>               default:
> >>                       return -ENOTTY;
> >>               }
> >> +#ifdef CONFIG_PM
> >> +     } else if (cmd == VFIO_DEVICE_POWER_MANAGEMENT) {  
> > 
> > I'd suggest using a DEVICE_FEATURE ioctl for this.  This ioctl doesn't
> > follow the vfio standard of argsz/flags and doesn't seem to do anything
> > special that we couldn't achieve with a DEVICE_FEATURE ioctl.
> >   
> 
>  Sure. DEVICE_FEATURE can help for this.
> 
> >> +             struct vfio_power_management vfio_pm;
> >> +             struct pci_dev *pdev = vdev->pdev;
> >> +             bool request_idle = false, request_resume = false;
> >> +             int ret = 0;
> >> +
> >> +             if (copy_from_user(&vfio_pm, (void __user *)arg, sizeof(vfio_pm)))
> >> +                     return -EFAULT;
> >> +
> >> +             /*
> >> +              * The vdev power related fields are protected with memory_lock
> >> +              * semaphore.
> >> +              */
> >> +             down_write(&vdev->memory_lock);
> >> +             switch (vfio_pm.d3cold_state) {
> >> +             case VFIO_DEVICE_D3COLD_STATE_ENTER:
> >> +                     /*
> >> +                      * For D3cold, the device should already in D3hot
> >> +                      * state.
> >> +                      */
> >> +                     if (vdev->power_state < PCI_D3hot) {
> >> +                             ret = EINVAL;
> >> +                             break;
> >> +                     }
> >> +
> >> +                     if (!vdev->runtime_suspend_pending) {
> >> +                             vdev->runtime_suspend_pending = true;
> >> +                             pm_runtime_put_noidle(&pdev->dev);
> >> +                             request_idle = true;
> >> +                     }  
> > 
> > If I call this multiple times, runtime_suspend_pending prevents it from
> > doing anything, but what should the return value be in that case?  Same
> > question for exit.
> >   
> 
>  For entry, the user should not call moving the device to D3cold, if it has
>  already requested. So, we can return error in this case. For exit,
>  currently, in this patch, I am clearing runtime_suspend_pending if the
>  wake-up is triggered from the host side (with lspci or some other command).
>  In that case, the exit should not return error. Should we add code to 
>  detect multiple calling of these and ensure only one
>  VFIO_DEVICE_D3COLD_STATE_ENTER/VFIO_DEVICE_D3COLD_STATE_EXIT can be called.

AIUI, the argument is that we can't re-enter d3cold w/o guest driver
support, so if an lspci which was unknown to have occurred by the
device user were to wake the device, it seems the user would see
arbitrarily different results attempting to put the device to sleep
again.

> >> +
> >> +                     break;
> >> +
> >> +             case VFIO_DEVICE_D3COLD_STATE_EXIT:
> >> +                     /*
> >> +                      * If the runtime resume has already been run, then
> >> +                      * the device will be already in D0 state.
> >> +                      */
> >> +                     if (vdev->runtime_suspend_pending) {
> >> +                             vdev->runtime_suspend_pending = false;
> >> +                             pm_runtime_get_noresume(&pdev->dev);
> >> +                             request_resume = true;
> >> +                     }
> >> +
> >> +                     break;
> >> +
> >> +             default:
> >> +                     ret = EINVAL;
> >> +                     break;
> >> +             }
> >> +
> >> +             up_write(&vdev->memory_lock);
> >> +
> >> +             /*
> >> +              * Call the runtime PM API's without any lock. Inside vfio driver
> >> +              * runtime suspend/resume, the locks can be acquired again.
> >> +              */
> >> +             if (request_idle)
> >> +                     pm_request_idle(&pdev->dev);
> >> +
> >> +             if (request_resume)
> >> +                     pm_runtime_resume(&pdev->dev);
> >> +
> >> +             return ret;
> >> +#endif
> >>       }
> >>
> >>       return -ENOTTY;
> >>  }
> >> +
> >> +long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
> >> +                      unsigned long arg)
> >> +{
> >> +#ifdef CONFIG_PM
> >> +     struct vfio_pci_core_device *vdev =
> >> +             container_of(core_vdev, struct vfio_pci_core_device, vdev);
> >> +     struct device *dev = &vdev->pdev->dev;
> >> +     bool skip_runtime_resume = false;
> >> +     long ret;
> >> +
> >> +     /*
> >> +      * The list of commands which are safe to execute when the PCI device
> >> +      * is in D3cold state. In D3cold state, the PCI config or any other IO
> >> +      * access won't work.
> >> +      */
> >> +     switch (cmd) {
> >> +     case VFIO_DEVICE_POWER_MANAGEMENT:
> >> +     case VFIO_DEVICE_GET_INFO:
> >> +     case VFIO_DEVICE_FEATURE:
> >> +             skip_runtime_resume = true;
> >> +             break;  
> > 
> > How can we know that there won't be DEVICE_FEATURE calls that touch the
> > device, the recently added migration via DEVICE_FEATURE does already.
> > DEVICE_GET_INFO seems equally as prone to breaking via capabilities
> > that could touch the device.  It seems easier to maintain and more
> > consistent to the user interface if we simply define that any device
> > access will resume the device.  
> 
>  In that case, we can resume the device for all case without
>  maintaining the safe list.
> 
> > We need to do something about interrupts though. > Maybe we could error the user ioctl to set d3cold
> > for devices running in INTx mode, but we also have numerous ways that
> > the device could be resumed under the user, which might start
> > triggering MSI/X interrupts?
> >   
> 
>  All the resuming we are mainly to prevent any malicious sequence.
>  If we see from normal OS side, then once the guest kernel has moved
>  the device into D3cold, then it should not do any config space
>  access. Similarly, from hypervisor, it should not invoke any
>  ioctl other than moving the device into D0 again when the device
>  is in D3cold. But, preventing the device to go into D3cold when
>  any other ioctl or config space access is happening is not easy,
>  so incrementing usage count before these access will ensure that
>  the device won't go into D3cold. 
> 
>  For interrupts, can the interrupt happen (Both INTx and MSI/x)
>  if the device is in D3cold?

The device itself shouldn't be generating interrupts and we don't share
MSI interrupts between devices (afaik), but we do share INTx interrupts.

>  In D3cold, the PME events are possible
>  and these events will anyway resume the device first. If the
>  interrupts are not possible then can we disable all the interrupts
>  somehow before going calling runtime PM API's to move the device into D3cold
>  and enable it again during runtime resume. We can wait for all existing
>  Interrupt to be finished first. I am not sure if this is possible. 

In the case of shared INTx, it's not just inflight interrupts.
Personally I wouldn't have an issue if we increment the usage counter
when INTx is in use to simply avoid the issue, but does that invalidate
the use case you're trying to enable?  Otherwise I think we'd need to
remove and re-add the handler around d3cold.

>  Returning error for user ioctl to set d3cold while interrupts are
>  happening needs some synchronization at both interrupt handler and
>  ioctl code and using runtime resume inside interrupt handler
>  may not be safe.

It's not a race condition to synchronize, it's simply that a shared
INTX interrupt can occur any time and we need to make sure we don't
touch the device when that occurs, either by preventing d3cold and INTx
in combination, removing the handler, or maybe adding a test in the
handler to not touch the device - either of the latter we need to be
sure we're not risking introducing interrupts storms by being out of
sync with the device state.

> >> +
> >> +     default:
> >> +             break;
> >> +     }
> >> +
> >> +     if (!skip_runtime_resume) {
> >> +             ret = pm_runtime_resume_and_get(dev);
> >> +             if (ret < 0)
> >> +                     return ret;
> >> +     }
> >> +
> >> +     ret = vfio_pci_core_ioctl_internal(core_vdev, cmd, arg);
> >> +  
> > 
> > I'm not a fan of wrapping the main ioctl interface for power management
> > like this.
> >   
> 
>  We need to increment the usage count at entry and decrement it
>  again at exit. Currently, from lot of places directly, we are
>  calling 'return' instead of going at function end. If we need to
>  get rid of wrapper function, then I need to replace all return with
>  'goto' for going at the function end and return after decrementing
>  the usage count. Will this be fine ?


Yes, I think that would be preferable.
 
 
> >> +     if (!skip_runtime_resume)
> >> +             pm_runtime_put(dev);
> >> +
> >> +     return ret;
> >> +#else
> >> +     return vfio_pci_core_ioctl_internal(core_vdev, cmd, arg);
> >> +#endif
> >> +}
> >>  EXPORT_SYMBOL_GPL(vfio_pci_core_ioctl);
> >>
> >>  static ssize_t vfio_pci_rw(struct vfio_pci_core_device *vdev, char __user *buf,
> >> @@ -1897,6 +2017,7 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev)
> >>               return -EBUSY;
> >>       }
> >>
> >> +     dev_set_drvdata(&pdev->dev, vdev);
> >>       if (pci_is_root_bus(pdev->bus)) {
> >>               ret = vfio_assign_device_set(&vdev->vdev, vdev);
> >>       } else if (!pci_probe_reset_slot(pdev->slot)) {
> >> @@ -1966,6 +2087,7 @@ void vfio_pci_core_unregister_device(struct vfio_pci_core_device *vdev)
> >>               pm_runtime_get_noresume(&pdev->dev);
> >>
> >>       pm_runtime_forbid(&pdev->dev);
> >> +     dev_set_drvdata(&pdev->dev, NULL);
> >>  }
> >>  EXPORT_SYMBOL_GPL(vfio_pci_core_unregister_device);
> >>
> >> @@ -2219,11 +2341,61 @@ static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set)
> >>  #ifdef CONFIG_PM
> >>  static int vfio_pci_core_runtime_suspend(struct device *dev)
> >>  {
> >> +     struct pci_dev *pdev = to_pci_dev(dev);
> >> +     struct vfio_pci_core_device *vdev = dev_get_drvdata(dev);
> >> +
> >> +     down_read(&vdev->memory_lock);
> >> +
> >> +     /*
> >> +      * runtime_suspend_pending won't be set if there is no user of vfio pci
> >> +      * device. In that case, return early and PCI core will take care of
> >> +      * putting the device in the low power state.
> >> +      */
> >> +     if (!vdev->runtime_suspend_pending) {
> >> +             up_read(&vdev->memory_lock);
> >> +             return 0;
> >> +     }  
> > 
> > Doesn't this also mean that idle, unused devices can at best sit in
> > d3hot rather than d3cold?
> >   
> 
>  Sorry. I didn't get this point.
> 
>  For unused devices, the PCI core will move the device into D3cold directly.

Could you point out what path triggers that?  I inferred that this
function would be called any time the usage count allows transition to
d3cold and the above test would prevent the device entering d3cold
unless the user requested it.

>  For the used devices, the config space write is happening first before
>  this ioctl is called and the config space write is moving the device
>  into D3hot so we need to do some manual thing here.

Why is it that a user owned device cannot re-enter d3cold without
driver support, but and idle device does?  Simply because we expect to
reset the device before returning it back to the host or exposing it to
a user?  I'd expect that after d3cold->d0 we're essentially at a
power-on state, which ideally would be similar to a post-reset state,
so I don't follow how driver support factors in to re-entering d3cold.

> >> +
> >> +     /*
> >> +      * The runtime suspend will be called only if device is already at
> >> +      * D3hot state. Now, change the device state from D3hot to D3cold by
> >> +      * using platform power management. If setting of D3cold is not
> >> +      * supported for the PCI device, then the device state will still be
> >> +      * in D3hot state. The PCI core expects to save the PCI state, if
> >> +      * driver runtime routine handles the power state management.
> >> +      */
> >> +     pci_save_state(pdev);
> >> +     pci_platform_power_transition(pdev, PCI_D3cold);
> >> +     up_read(&vdev->memory_lock);
> >> +
> >>       return 0;
> >>  }
> >>
> >>  static int vfio_pci_core_runtime_resume(struct device *dev)
> >>  {
> >> +     struct pci_dev *pdev = to_pci_dev(dev);
> >> +     struct vfio_pci_core_device *vdev = dev_get_drvdata(dev);
> >> +
> >> +     down_write(&vdev->memory_lock);
> >> +
> >> +     /*
> >> +      * The PCI core will move the device to D0 state before calling the
> >> +      * driver runtime resume.
> >> +      */
> >> +     vfio_pci_set_power_state_locked(vdev, PCI_D0);  
> > 
> > Maybe this is where vdev->power_state is kept synchronized?
> >   
>  
>  Yes. vdev->power_state will be changed here.
> 
> >> +
> >> +     /*
> >> +      * Some PCI device needs the SW involvement before going to D3cold
> >> +      * state again. So if there is any wake-up which is not triggered
> >> +      * by the guest, then increase the usage count to prevent the
> >> +      * second runtime suspend.
> >> +      */  
> > 
> > Can you give examples of devices that need this and the reason they
> > need this?  The interface is not terribly deterministic if a random
> > unprivileged lspci on the host can move devices back to d3hot.   
> 
>  I am not sure about other device but this is happening for
>  the nvidia GPU itself. 
>  
>  For nvidia GPU, during runtime suspend, we keep the GPU video memory
>  in self-refresh mode for high video memory usage. Each video memory
>  self refesh entry before D3cold requires nvidia SW involvement.
>  Without SW self-refresh sequnece involvement, it won't work. 


So we're exposing acpi power interfaces to turn a device off, which
don't really turn the device off, but leaves it in some sort of
low-power memory refresh state, rather than a fully off state as I had
assumed above.  Does this suggest the host firmware ACPI has knowledge
of the device and does different things?

>  Details regarding runtime suspend with self-refresh can be found in
> 
>  https://download.nvidia.com/XFree86/Linux-x86_64/495.46/README/dynamicpowermanagement.html#VidMemThreshold
> 
>  But, if GPU video memory usage is low, then we turnoff video memory
>  and save all the allocation in system memory. In this case, SW involvement 
>  is not required. 

Ok, so there's some heuristically determined vram usage where the
driver favors suspend latency versus power savings and somehow keeps
the device in this low-power, refresh state versus a fully off state.
How unique is this behavior to NVIDIA devices?  It seems like we're
trying to add d3cold, but special case it based on a device that might
have a rather quirky d3cold behavior.  Is there something we can test
about the state of the device to know which mode it's using?  Is there
something we can virtualize on the device to force the driver to use
the higher latency, lower power d3cold mode that results in fewer
restrictions?  Or maybe this is just common practice?

> > How useful is this implementation if a notice to the guest of a resumed
> > device is TBD?  Thanks,
> > 
> > Alex
> >   
> 
>  I have prototyped this earlier by using eventfd_ctx for pme and whenever we get
>  a resume triggered by host, then it will forward the same to hypervisor.
>  Then in the hypervisor, it can write into virtual root port PME related registers
>  and send PME event which will wake-up the PCI device in the guest side.
>  It will help in handling PME events related wake-up also which are currently
>  disabled in PATCH 2 of this patch series.

But then what does the guest do with the device?  For example, if we
have a VM with an assigned GPU running an idle desktop where the
monitor has gone into power save, does running lspci on the host
randomly wake the desktop and monitor?  I'd like to understand how
unique the return to d3cold behavior is to this device and whether we
can restrict that in some way.  An option that's now at our disposal
would be to create an NVIDIA GPU variant of vfio-pci that has
sufficient device knowledge to perhaps retrigger the vram refresh
d3cold state rather than lose vram data going into a standard d3cold
state.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v2 5/5] vfio/pci: add the support for PCI D3cold state
  2022-03-11 23:06       ` Alex Williamson
@ 2022-03-16  5:41         ` Abhishek Sahu
  2022-03-16 18:44           ` Alex Williamson
  0 siblings, 1 reply; 21+ messages in thread
From: Abhishek Sahu @ 2022-03-16  5:41 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, Cornelia Huck, Max Gurtovoy, Yishai Hadas, Zhen Lei,
	Jason Gunthorpe, linux-kernel

On 3/12/2022 4:36 AM, Alex Williamson wrote:
> On Fri, 11 Mar 2022 21:15:38 +0530
> Abhishek Sahu <abhsahu@nvidia.com> wrote:
> 
>> On 3/9/2022 10:56 PM, Alex Williamson wrote:
>>> On Mon, 24 Jan 2022 23:47:26 +0530
>>> Abhishek Sahu <abhsahu@nvidia.com> wrote:
>>>
>>>> Currently, if the runtime power management is enabled for vfio-pci
>>>> device in the guest OS, then guest OS will do the register write for
>>>> PCI_PM_CTRL register. This write request will be handled in
>>>> vfio_pm_config_write() where it will do the actual register write
>>>> of PCI_PM_CTRL register. With this, the maximum D3hot state can be
>>>> achieved for low power. If we can use the runtime PM framework,
>>>> then we can achieve the D3cold state which will help in saving
>>>> maximum power.
>>>>
>>>> 1. Since D3cold state can't be achieved by writing PCI standard
>>>>    PM config registers, so this patch adds a new IOCTL which change the
>>>>    PCI device from D3hot to D3cold state and then D3cold to D0 state.
>>>>
>>>> 2. The hypervisors can implement virtual ACPI methods. For
>>>>    example, in guest linux OS if PCI device ACPI node has _PR3 and _PR0
>>>>    power resources with _ON/_OFF method, then guest linux OS makes the
>>>>    _OFF call during D3cold transition and then _ON during D0 transition.
>>>>    The hypervisor can tap these virtual ACPI calls and then do the D3cold
>>>>    related IOCTL in the vfio driver.
>>>>
>>>> 3. The vfio driver uses runtime PM framework to achieve the
>>>>    D3cold state. For the D3cold transition, decrement the usage count and
>>>>    during D0 transition increment the usage count.
>>>>
>>>> 4. For D3cold, the device current power state should be D3hot.
>>>>    Then during runtime suspend, the pci_platform_power_transition() is
>>>>    required for D3cold state. If the D3cold state is not supported, then
>>>>    the device will still be in D3hot state. But with the runtime PM, the
>>>>    root port can now also go into suspended state.
>>>>
>>>> 5. For most of the systems, the D3cold is supported at the root
>>>>    port level. So, when root port will transition to D3cold state, then
>>>>    the vfio PCI device will go from D3hot to D3cold state during its
>>>>    runtime suspend. If root port does not support D3cold, then the root
>>>>    will go into D3hot state.
>>>>
>>>> 6. The runtime suspend callback can now happen for 2 cases: there
>>>>    is no user of vfio device and the case where user has initiated
>>>>    D3cold. The 'runtime_suspend_pending' flag can help to distinguish
>>>>    this case.
>>>>
>>>> 7. There are cases where guest has put PCI device into D3cold
>>>>    state and then on the host side, user has run lspci or any other
>>>>    command which requires access of the PCI config register. In this case,
>>>>    the kernel runtime PM framework will resume the PCI device internally,
>>>>    read the config space and put the device into D3cold state again. Some
>>>>    PCI device needs the SW involvement before going into D3cold state.
>>>>    For the first D3cold state, the driver running in guest side does the SW
>>>>    side steps. But the second D3cold transition will be without guest
>>>>    driver involvement. So, prevent this second d3cold transition by
>>>>    incrementing the device usage count. This will make the device
>>>>    unnecessary in D0 but it's better than failure. In future, we can some
>>>>    mechanism by which we can forward these wake-up request to guest and
>>>>    then the mentioned case can be handled also.
>>>>
>>>> 8. In D3cold, all kind of BAR related access needs to be disabled
>>>>    like D3hot. Additionally, the config space will also be disabled in
>>>>    D3cold state. To prevent access of config space in the D3cold state,
>>>>    increment the runtime PM usage count before doing any config space
>>>>    access. Also, most of the IOCTLs do the config space access, so
>>>>    maintain one safe list and skip the resume only for these safe IOCTLs
>>>>    alone. For other IOCTLs, the runtime PM usage count will be
>>>>    incremented first.
>>>>
>>>> 9. Now, runtime suspend/resume callbacks need to get the vdev
>>>>    reference which can be obtained by dev_get_drvdata(). Currently, the
>>>>    dev_set_drvdata() is being set after returning from
>>>>    vfio_pci_core_register_device(). The runtime callbacks can come
>>>>    anytime after enabling runtime PM so dev_set_drvdata() must happen
>>>>    before that. We can move dev_set_drvdata() inside
>>>>    vfio_pci_core_register_device() itself.
>>>>
>>>> 10. The vfio device user can close the device after putting
>>>>     the device into runtime suspended state so inside
>>>>     vfio_pci_core_disable(), increment the runtime PM usage count.
>>>>
>>>> 11. Runtime PM will be possible only if CONFIG_PM is enabled on
>>>>     the host. So, the IOCTL related code can be put under CONFIG_PM
>>>>     Kconfig.
>>>>
>>>> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
>>>> ---
>>>>  drivers/vfio/pci/vfio_pci.c        |   1 -
>>>>  drivers/vfio/pci/vfio_pci_config.c |  11 +-
>>>>  drivers/vfio/pci/vfio_pci_core.c   | 186 +++++++++++++++++++++++++++--
>>>>  include/linux/vfio_pci_core.h      |   1 +
>>>>  include/uapi/linux/vfio.h          |  21 ++++
>>>>  5 files changed, 211 insertions(+), 9 deletions(-)
>>>>
>>>> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
>>>> index c8695baf3b54..4ac3338c8fc7 100644
>>>> --- a/drivers/vfio/pci/vfio_pci.c
>>>> +++ b/drivers/vfio/pci/vfio_pci.c
>>>> @@ -153,7 +153,6 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>>>>       ret = vfio_pci_core_register_device(vdev);
>>>>       if (ret)
>>>>               goto out_free;
>>>> -     dev_set_drvdata(&pdev->dev, vdev);
>>>
>>> Relocating the setting of drvdata should be proposed separately rather
>>> than buried in this patch.  The driver owns drvdata, the driver is the
>>> only consumer of drvdata, so pushing this into the core to impose a
>>> standard for drvdata across all vfio-pci variants doesn't seem like a
>>> good idea to me.
>>>
>>
>>  I will check regarding this part.
>>  Mainly drvdata is needed for the runtime PM callbacks which are added
>>  inside core layer and we need to get vdev from struct device.
>>
>>>>       return 0;
>>>>
>>>>  out_free:
>>>> diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
>>>> index dd9ed211ba6f..d20420657959 100644
>>>> --- a/drivers/vfio/pci/vfio_pci_config.c
>>>> +++ b/drivers/vfio/pci/vfio_pci_config.c
>>>> @@ -25,6 +25,7 @@
>>>>  #include <linux/uaccess.h>
>>>>  #include <linux/vfio.h>
>>>>  #include <linux/slab.h>
>>>> +#include <linux/pm_runtime.h>
>>>>
>>>>  #include <linux/vfio_pci_core.h>
>>>>
>>>> @@ -1919,16 +1920,23 @@ static ssize_t vfio_config_do_rw(struct vfio_pci_core_device *vdev, char __user
>>>>  ssize_t vfio_pci_config_rw(struct vfio_pci_core_device *vdev, char __user *buf,
>>>>                          size_t count, loff_t *ppos, bool iswrite)
>>>>  {
>>>> +     struct device *dev = &vdev->pdev->dev;
>>>>       size_t done = 0;
>>>>       int ret = 0;
>>>>       loff_t pos = *ppos;
>>>>
>>>>       pos &= VFIO_PCI_OFFSET_MASK;
>>>>
>>>> +     ret = pm_runtime_resume_and_get(dev);
>>>> +     if (ret < 0)
>>>> +             return ret;
>>>> +
>>>>       while (count) {
>>>>               ret = vfio_config_do_rw(vdev, buf, count, &pos, iswrite);
>>>> -             if (ret < 0)
>>>> +             if (ret < 0) {
>>>> +                     pm_runtime_put(dev);
>>>>                       return ret;
>>>> +             }
>>>>
>>>>               count -= ret;
>>>>               done += ret;
>>>> @@ -1936,6 +1944,7 @@ ssize_t vfio_pci_config_rw(struct vfio_pci_core_device *vdev, char __user *buf,
>>>>               pos += ret;
>>>>       }
>>>>
>>>> +     pm_runtime_put(dev);
>>>
>>> What about other config accesses, ex. shared INTx?  We need to
>>> interact with the device command and status register on an incoming
>>> interrupt to test if our device sent an interrupt and to mask it.  The
>>> unmask eventfd can also trigger config space accesses.  Seems
>>> incomplete relative to config space.
>>>
>>
>>  I will check this path thoroughly.
>>  But from initial analysis, it seems we have 2 path here:
>>
>>  Most of the mentioned functions are being called from
>>  vfio_pci_set_irqs_ioctl() and pm_runtime_resume_and_get()
>>  should be called for this ioctl also in this patch.
>>
>>  Second path is when we are inside IRQ handler. For that, we need some
>>  other mechanism which I explained below.
>>
>>>>       *ppos += done;
>>>>
>>>>       return done;
>>>> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
>>>> index 38440d48973f..b70bb4fd940d 100644
>>>> --- a/drivers/vfio/pci/vfio_pci_core.c
>>>> +++ b/drivers/vfio/pci/vfio_pci_core.c
>>>> @@ -371,12 +371,23 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
>>>>       lockdep_assert_held(&vdev->vdev.dev_set->lock);
>>>>
>>>>       /*
>>>> -      * If disable has been called while the power state is other than D0,
>>>> -      * then set the power state in vfio driver to D0. It will help
>>>> -      * in running the logic needed for D0 power state. The subsequent
>>>> -      * runtime PM API's will put the device into the low power state again.
>>>> +      * The vfio device user can close the device after putting the device
>>>> +      * into runtime suspended state so wake up the device first in
>>>> +      * this case.
>>>>        */
>>>> -     vfio_pci_set_power_state_locked(vdev, PCI_D0);
>>>> +     if (vdev->runtime_suspend_pending) {
>>>> +             vdev->runtime_suspend_pending = false;
>>>> +             pm_runtime_resume_and_get(&pdev->dev);
>>>
>>> Doesn't vdev->power_state become unsynchronized from the actual device
>>> state here and maybe elsewhere in this patch?  (I see below that maybe
>>> the resume handler accounts for this)
>>>
>>
>>  Yes. Inside runtime resume handler, it is being changed back to D0.
>>
>>>> +     } else {
>>>> +             /*
>>>> +              * If disable has been called while the power state is other
>>>> +              * than D0, then set the power state in vfio driver to D0. It
>>>> +              * will help in running the logic needed for D0 power state.
>>>> +              * The subsequent runtime PM API's will put the device into
>>>> +              * the low power state again.
>>>> +              */
>>>> +             vfio_pci_set_power_state_locked(vdev, PCI_D0);
>>>> +     }
>>>>
>>>>       /* Stop the device from further DMA */
>>>>       pci_clear_master(pdev);
>>>> @@ -693,8 +704,8 @@ int vfio_pci_register_dev_region(struct vfio_pci_core_device *vdev,
>>>>  }
>>>>  EXPORT_SYMBOL_GPL(vfio_pci_register_dev_region);
>>>>
>>>> -long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
>>>> -             unsigned long arg)
>>>> +static long vfio_pci_core_ioctl_internal(struct vfio_device *core_vdev,
>>>> +                                      unsigned int cmd, unsigned long arg)
>>>>  {
>>>>       struct vfio_pci_core_device *vdev =
>>>>               container_of(core_vdev, struct vfio_pci_core_device, vdev);
>>>> @@ -1241,10 +1252,119 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
>>>>               default:
>>>>                       return -ENOTTY;
>>>>               }
>>>> +#ifdef CONFIG_PM
>>>> +     } else if (cmd == VFIO_DEVICE_POWER_MANAGEMENT) {
>>>
>>> I'd suggest using a DEVICE_FEATURE ioctl for this.  This ioctl doesn't
>>> follow the vfio standard of argsz/flags and doesn't seem to do anything
>>> special that we couldn't achieve with a DEVICE_FEATURE ioctl.
>>>
>>
>>  Sure. DEVICE_FEATURE can help for this.
>>
>>>> +             struct vfio_power_management vfio_pm;
>>>> +             struct pci_dev *pdev = vdev->pdev;
>>>> +             bool request_idle = false, request_resume = false;
>>>> +             int ret = 0;
>>>> +
>>>> +             if (copy_from_user(&vfio_pm, (void __user *)arg, sizeof(vfio_pm)))
>>>> +                     return -EFAULT;
>>>> +
>>>> +             /*
>>>> +              * The vdev power related fields are protected with memory_lock
>>>> +              * semaphore.
>>>> +              */
>>>> +             down_write(&vdev->memory_lock);
>>>> +             switch (vfio_pm.d3cold_state) {
>>>> +             case VFIO_DEVICE_D3COLD_STATE_ENTER:
>>>> +                     /*
>>>> +                      * For D3cold, the device should already in D3hot
>>>> +                      * state.
>>>> +                      */
>>>> +                     if (vdev->power_state < PCI_D3hot) {
>>>> +                             ret = EINVAL;
>>>> +                             break;
>>>> +                     }
>>>> +
>>>> +                     if (!vdev->runtime_suspend_pending) {
>>>> +                             vdev->runtime_suspend_pending = true;
>>>> +                             pm_runtime_put_noidle(&pdev->dev);
>>>> +                             request_idle = true;
>>>> +                     }
>>>
>>> If I call this multiple times, runtime_suspend_pending prevents it from
>>> doing anything, but what should the return value be in that case?  Same
>>> question for exit.
>>>
>>
>>  For entry, the user should not call moving the device to D3cold, if it has
>>  already requested. So, we can return error in this case. For exit,
>>  currently, in this patch, I am clearing runtime_suspend_pending if the
>>  wake-up is triggered from the host side (with lspci or some other command).
>>  In that case, the exit should not return error. Should we add code to
>>  detect multiple calling of these and ensure only one
>>  VFIO_DEVICE_D3COLD_STATE_ENTER/VFIO_DEVICE_D3COLD_STATE_EXIT can be called.
> 
> AIUI, the argument is that we can't re-enter d3cold w/o guest driver
> support, so if an lspci which was unknown to have occurred by the
> device user were to wake the device, it seems the user would see
> arbitrarily different results attempting to put the device to sleep
> again.
> 

 Sorry. I still didn't get this point.

 For guest to go into D3cold, it will follow 2 steps

 1. Move the device from D0 to D3hot state by using config register.
 2. Then use this IOCTL to move D3hot state to D3cold state.

Now, on the guest side if we run lspci, then following will be behavior:

 1. If we call it before step 2, then the config space register
    can still be read in D3hot.
 2. If we call it after step 2, then the guest os should move the
    device into D0 first, read the config space and then again,
    the guest os should move the device to D3cold with the
    above steps. In this process, the guest OS driver will be involved.
    This is current behavior with Linux guest OS. 

 Now, on the host side, if we run lspci,

 1. If we call it before step 2, then the config space register can
    still be read in D3hot.
 2. If we call after step 2, then the D3cold to D0 will happen in
    the runtime resume and then it will be in D0 state. But if we
    add support to allow re-entering into D3cold again as I mentioned
    below. then it will again go into D3cold state. 

>>>> +
>>>> +                     break;
>>>> +
>>>> +             case VFIO_DEVICE_D3COLD_STATE_EXIT:
>>>> +                     /*
>>>> +                      * If the runtime resume has already been run, then
>>>> +                      * the device will be already in D0 state.
>>>> +                      */
>>>> +                     if (vdev->runtime_suspend_pending) {
>>>> +                             vdev->runtime_suspend_pending = false;
>>>> +                             pm_runtime_get_noresume(&pdev->dev);
>>>> +                             request_resume = true;
>>>> +                     }
>>>> +
>>>> +                     break;
>>>> +
>>>> +             default:
>>>> +                     ret = EINVAL;
>>>> +                     break;
>>>> +             }
>>>> +
>>>> +             up_write(&vdev->memory_lock);
>>>> +
>>>> +             /*
>>>> +              * Call the runtime PM API's without any lock. Inside vfio driver
>>>> +              * runtime suspend/resume, the locks can be acquired again.
>>>> +              */
>>>> +             if (request_idle)
>>>> +                     pm_request_idle(&pdev->dev);
>>>> +
>>>> +             if (request_resume)
>>>> +                     pm_runtime_resume(&pdev->dev);
>>>> +
>>>> +             return ret;
>>>> +#endif
>>>>       }
>>>>
>>>>       return -ENOTTY;
>>>>  }
>>>> +
>>>> +long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
>>>> +                      unsigned long arg)
>>>> +{
>>>> +#ifdef CONFIG_PM
>>>> +     struct vfio_pci_core_device *vdev =
>>>> +             container_of(core_vdev, struct vfio_pci_core_device, vdev);
>>>> +     struct device *dev = &vdev->pdev->dev;
>>>> +     bool skip_runtime_resume = false;
>>>> +     long ret;
>>>> +
>>>> +     /*
>>>> +      * The list of commands which are safe to execute when the PCI device
>>>> +      * is in D3cold state. In D3cold state, the PCI config or any other IO
>>>> +      * access won't work.
>>>> +      */
>>>> +     switch (cmd) {
>>>> +     case VFIO_DEVICE_POWER_MANAGEMENT:
>>>> +     case VFIO_DEVICE_GET_INFO:
>>>> +     case VFIO_DEVICE_FEATURE:
>>>> +             skip_runtime_resume = true;
>>>> +             break;
>>>
>>> How can we know that there won't be DEVICE_FEATURE calls that touch the
>>> device, the recently added migration via DEVICE_FEATURE does already.
>>> DEVICE_GET_INFO seems equally as prone to breaking via capabilities
>>> that could touch the device.  It seems easier to maintain and more
>>> consistent to the user interface if we simply define that any device
>>> access will resume the device.
>>
>>  In that case, we can resume the device for all case without
>>  maintaining the safe list.
>>
>>> We need to do something about interrupts though. > Maybe we could error the user ioctl to set d3cold
>>> for devices running in INTx mode, but we also have numerous ways that
>>> the device could be resumed under the user, which might start
>>> triggering MSI/X interrupts?
>>>
>>
>>  All the resuming we are mainly to prevent any malicious sequence.
>>  If we see from normal OS side, then once the guest kernel has moved
>>  the device into D3cold, then it should not do any config space
>>  access. Similarly, from hypervisor, it should not invoke any
>>  ioctl other than moving the device into D0 again when the device
>>  is in D3cold. But, preventing the device to go into D3cold when
>>  any other ioctl or config space access is happening is not easy,
>>  so incrementing usage count before these access will ensure that
>>  the device won't go into D3cold.
>>
>>  For interrupts, can the interrupt happen (Both INTx and MSI/x)
>>  if the device is in D3cold?
> 
> The device itself shouldn't be generating interrupts and we don't share
> MSI interrupts between devices (afaik), but we do share INTx interrupts.
> 
>>  In D3cold, the PME events are possible
>>  and these events will anyway resume the device first. If the
>>  interrupts are not possible then can we disable all the interrupts
>>  somehow before going calling runtime PM API's to move the device into D3cold
>>  and enable it again during runtime resume. We can wait for all existing
>>  Interrupt to be finished first. I am not sure if this is possible.
> 
> In the case of shared INTx, it's not just inflight interrupts.
> Personally I wouldn't have an issue if we increment the usage counter
> when INTx is in use to simply avoid the issue, but does that invalidate
> the use case you're trying to enable?

 It should not invalidate the use case which I am trying to support.

 But incrementing the usage count for device already in D3cold
 state will cause it to wake-up. Wake-up from D3cold may take
 somewhere around 500 ms – 1500 ms (or sometimes more than that since
 it depends upon root port wake-up time). So, it will make the
 ISR time high. For the root port wake-up time, please refer

 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ad9001f2f41198784b0423646450ba2cb24793a3

 where it can take 1100ms alone for the port port on
 older platforms. 
 
 
> Otherwise I think we'd need to
> remove and re-add the handler around d3cold.
> 
>>  Returning error for user ioctl to set d3cold while interrupts are
>>  happening needs some synchronization at both interrupt handler and
>>  ioctl code and using runtime resume inside interrupt handler
>>  may not be safe.
> 
> It's not a race condition to synchronize, it's simply that a shared
> INTX interrupt can occur any time and we need to make sure we don't
> touch the device when that occurs, either by preventing d3cold and INTx
> in combination, removing the handler, or maybe adding a test in the
> handler to not touch the device - either of the latter we need to be
> sure we're not risking introducing interrupts storms by being out of
> sync with the device state.
> 

 Adding a test to detect the D3cold seems to be better option in
 this case but not sure about interrupts storms.

>>>> +
>>>> +     default:
>>>> +             break;
>>>> +     }
>>>> +
>>>> +     if (!skip_runtime_resume) {
>>>> +             ret = pm_runtime_resume_and_get(dev);
>>>> +             if (ret < 0)
>>>> +                     return ret;
>>>> +     }
>>>> +
>>>> +     ret = vfio_pci_core_ioctl_internal(core_vdev, cmd, arg);
>>>> +
>>>
>>> I'm not a fan of wrapping the main ioctl interface for power management
>>> like this.
>>>
>>
>>  We need to increment the usage count at entry and decrement it
>>  again at exit. Currently, from lot of places directly, we are
>>  calling 'return' instead of going at function end. If we need to
>>  get rid of wrapper function, then I need to replace all return with
>>  'goto' for going at the function end and return after decrementing
>>  the usage count. Will this be fine ?
> 
> 
> Yes, I think that would be preferable.
> 
> 
>>>> +     if (!skip_runtime_resume)
>>>> +             pm_runtime_put(dev);
>>>> +
>>>> +     return ret;
>>>> +#else
>>>> +     return vfio_pci_core_ioctl_internal(core_vdev, cmd, arg);
>>>> +#endif
>>>> +}
>>>>  EXPORT_SYMBOL_GPL(vfio_pci_core_ioctl);
>>>>
>>>>  static ssize_t vfio_pci_rw(struct vfio_pci_core_device *vdev, char __user *buf,
>>>> @@ -1897,6 +2017,7 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev)
>>>>               return -EBUSY;
>>>>       }
>>>>
>>>> +     dev_set_drvdata(&pdev->dev, vdev);
>>>>       if (pci_is_root_bus(pdev->bus)) {
>>>>               ret = vfio_assign_device_set(&vdev->vdev, vdev);
>>>>       } else if (!pci_probe_reset_slot(pdev->slot)) {
>>>> @@ -1966,6 +2087,7 @@ void vfio_pci_core_unregister_device(struct vfio_pci_core_device *vdev)
>>>>               pm_runtime_get_noresume(&pdev->dev);
>>>>
>>>>       pm_runtime_forbid(&pdev->dev);
>>>> +     dev_set_drvdata(&pdev->dev, NULL);
>>>>  }
>>>>  EXPORT_SYMBOL_GPL(vfio_pci_core_unregister_device);
>>>>
>>>> @@ -2219,11 +2341,61 @@ static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set)
>>>>  #ifdef CONFIG_PM
>>>>  static int vfio_pci_core_runtime_suspend(struct device *dev)
>>>>  {
>>>> +     struct pci_dev *pdev = to_pci_dev(dev);
>>>> +     struct vfio_pci_core_device *vdev = dev_get_drvdata(dev);
>>>> +
>>>> +     down_read(&vdev->memory_lock);
>>>> +
>>>> +     /*
>>>> +      * runtime_suspend_pending won't be set if there is no user of vfio pci
>>>> +      * device. In that case, return early and PCI core will take care of
>>>> +      * putting the device in the low power state.
>>>> +      */
>>>> +     if (!vdev->runtime_suspend_pending) {
>>>> +             up_read(&vdev->memory_lock);
>>>> +             return 0;
>>>> +     }
>>>
>>> Doesn't this also mean that idle, unused devices can at best sit in
>>> d3hot rather than d3cold?
>>>
>>
>>  Sorry. I didn't get this point.
>>
>>  For unused devices, the PCI core will move the device into D3cold directly.
> 
> Could you point out what path triggers that?  I inferred that this
> function would be called any time the usage count allows transition to
> d3cold and the above test would prevent the device entering d3cold
> unless the user requested it.
> 

 For PCI runtime suspend, there are 2 options:

 1. Don’t change the device power state from D0 in the driver
    runtime suspend callback. In this case, pci_pm_runtime_suspend()
    will handle all the things.

    https://elixir.bootlin.com/linux/v5.17-rc8/source/drivers/pci/pci-driver.c#L1285
    
    For unused device, runtime_suspend_pending will be false since
    it can be set by d3cold ioctl.

 2. With the used device, the device state will be changed to D3hot first
    with vfio_pm_config_write(). In this case, the pci_pm_runtime_suspend()
    expects that all the handling has already been done by driver,
    otherwise it will print the warning and return early.

    “PCI PM: State of device not saved by %pS”

    https://elixir.bootlin.com/linux/v5.17-rc8/source/drivers/pci/pci-driver.c#L1280

>>  For the used devices, the config space write is happening first before
>>  this ioctl is called and the config space write is moving the device
>>  into D3hot so we need to do some manual thing here.
> 
> Why is it that a user owned device cannot re-enter d3cold without
> driver support, but and idle device does?  Simply because we expect to
> reset the device before returning it back to the host or exposing it to
> a user?  I'd expect that after d3cold->d0 we're essentially at a
> power-on state, which ideally would be similar to a post-reset state,
> so I don't follow how driver support factors in to re-entering d3cold.
> 

 In terms of nvidia GPU, the idle unused device is equivalent to
 uninitialized PCI device. In this case, no internal HW modules
 will be initialized like video memory. So, it does not matter
 what we do with the device before that. It is fine to re-enter
 d3cold since the HW itself is not initialized. Once the device
 is owned by user, then in the guest OS side the nvidia driver will run
 and initialize all the HW modules including video memory. Now,
 before removing the power, we need to make sure that video
 memory should come in the same state after resume as before
 suspending. 

 If we don’t keep the video memory in self refresh state, then it is
 equivalent to power on state. But if we keep the video memory
 in self refresh state, then it is different from power-on state.

>>>> +
>>>> +     /*
>>>> +      * The runtime suspend will be called only if device is already at
>>>> +      * D3hot state. Now, change the device state from D3hot to D3cold by
>>>> +      * using platform power management. If setting of D3cold is not
>>>> +      * supported for the PCI device, then the device state will still be
>>>> +      * in D3hot state. The PCI core expects to save the PCI state, if
>>>> +      * driver runtime routine handles the power state management.
>>>> +      */
>>>> +     pci_save_state(pdev);
>>>> +     pci_platform_power_transition(pdev, PCI_D3cold);
>>>> +     up_read(&vdev->memory_lock);
>>>> +
>>>>       return 0;
>>>>  }
>>>>
>>>>  static int vfio_pci_core_runtime_resume(struct device *dev)
>>>>  {
>>>> +     struct pci_dev *pdev = to_pci_dev(dev);
>>>> +     struct vfio_pci_core_device *vdev = dev_get_drvdata(dev);
>>>> +
>>>> +     down_write(&vdev->memory_lock);
>>>> +
>>>> +     /*
>>>> +      * The PCI core will move the device to D0 state before calling the
>>>> +      * driver runtime resume.
>>>> +      */
>>>> +     vfio_pci_set_power_state_locked(vdev, PCI_D0);
>>>
>>> Maybe this is where vdev->power_state is kept synchronized?
>>>
>>
>>  Yes. vdev->power_state will be changed here.
>>
>>>> +
>>>> +     /*
>>>> +      * Some PCI device needs the SW involvement before going to D3cold
>>>> +      * state again. So if there is any wake-up which is not triggered
>>>> +      * by the guest, then increase the usage count to prevent the
>>>> +      * second runtime suspend.
>>>> +      */
>>>
>>> Can you give examples of devices that need this and the reason they
>>> need this?  The interface is not terribly deterministic if a random
>>> unprivileged lspci on the host can move devices back to d3hot.
>>
>>  I am not sure about other device but this is happening for
>>  the nvidia GPU itself.
>>
>>  For nvidia GPU, during runtime suspend, we keep the GPU video memory
>>  in self-refresh mode for high video memory usage. Each video memory
>>  self refesh entry before D3cold requires nvidia SW involvement.
>>  Without SW self-refresh sequnece involvement, it won't work.
> 
> 
> So we're exposing acpi power interfaces to turn a device off, which
> don't really turn the device off, but leaves it in some sort of
> low-power memory refresh state, rather than a fully off state as I had
> assumed above.  Does this suggest the host firmware ACPI has knowledge
> of the device and does different things?
> 

 I was trying to find the public document regarding this part and
 it seems following Windows document can help in providing some
 information related with this

 https://docs.microsoft.com/en-us/windows-hardware/drivers/bringup/firmware-requirements-for-d3cold

 “Putting a device in D3cold does not necessarily mean that all
  sources of power to the device have been removed—it means only
  that the main power source, Vcc, is removed. The auxiliary power
  source, Vaux, might also be removed if it is not required for
  the wake logic”.
 
 So, for generic self-refresh D3cold (means in Desktop), it is mainly
 relying on auxiliary power. For notebooks, we ask to do some
 customization in acpi power interfaces side to support
 video memory self-refresh. 

>>  Details regarding runtime suspend with self-refresh can be found in
>>
>>  https://download.nvidia.com/XFree86/Linux-x86_64/495.46/README/dynamicpowermanagement.html#VidMemThreshold
>>
>>  But, if GPU video memory usage is low, then we turnoff video memory
>>  and save all the allocation in system memory. In this case, SW involvement
>>  is not required.
> 
> Ok, so there's some heuristically determined vram usage where the
> driver favors suspend latency versus power savings and somehow keeps
> the device in this low-power, refresh state versus a fully off state.
> How unique is this behavior to NVIDIA devices?  It seems like we're
> trying to add d3cold, but special case it based on a device that might
> have a rather quirky d3cold behavior.  Is there something we can test
> about the state of the device to know which mode it's using? 

 Since vfio is generic driver so testing the device mode here
 seems to be challenging.
 
> Is there something we can virtualize on the device to force the driver to use
> the higher latency, lower power d3cold mode that results in fewer
> restrictions?  Or maybe this is just common practice?
> 

 Yes. We can enforce this. But this option won’t be useful for modern
 use cases. Let’s assume if we have 16GB video memory usage, in that
 case, it will take lot of time in entry and exit and make the feature
 unusable. Also, the system memory will be limited in the guest
 side so enough system memory is again challenge. 

>>> How useful is this implementation if a notice to the guest of a resumed
>>> device is TBD?  Thanks,
>>>
>>> Alex
>>>
>>
>>  I have prototyped this earlier by using eventfd_ctx for pme and whenever we get
>>  a resume triggered by host, then it will forward the same to hypervisor.
>>  Then in the hypervisor, it can write into virtual root port PME related registers
>>  and send PME event which will wake-up the PCI device in the guest side.
>>  It will help in handling PME events related wake-up also which are currently
>>  disabled in PATCH 2 of this patch series.
> 
> But then what does the guest do with the device?  For example, if we
> have a VM with an assigned GPU running an idle desktop where the
> monitor has gone into power save, does running lspci on the host
> randomly wake the desktop and monitor?

 For Linux OS + NVIDIA driver, it seems it will just wake-up the
 GPU up and not the monitor. With the bare-metal setup, I waited
 for monitor to go off with DPMS and then the GPU went into
 suspended state. After that, If I run lspci command,
 then the GPU moved to active state but monitor was
 still in the off state and after lspci, the GPU went
 into suspended state again. 

 The monitor is waking up only if I do keyborad or mouse
 movement.

> I'd like to understand how
> unique the return to d3cold behavior is to this device and whether we
> can restrict that in some way.  An option that's now at our disposal
> would be to create an NVIDIA GPU variant of vfio-pci that has
> sufficient device knowledge to perhaps retrigger the vram refresh
> d3cold state rather than lose vram data going into a standard d3cold
> state.  Thanks,
> 
> Alex
> 

 Adding vram refresh d3cold state with vfio-pci variant is not straight
 forward without involvement of nvidia driver itself. 

 One option is to add one flag in D3cold IOCTL itself to differentiate
 between 2 variants of D3cold entry (One which allows re-entering to
 D3cold and another one which won’t allow re-entering to D3cold) and
 set it default for re-entering to D3cold. For nvidia or similar use
 case, the hypervisor can set this flag to prevent re-entering to D3cold.

 Otherwise, we can add NVIDIA vendor ID check and restrict this
 to nvidia alone. 

 Thanks,
 Abhishek

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v2 5/5] vfio/pci: add the support for PCI D3cold state
  2022-03-16  5:41         ` Abhishek Sahu
@ 2022-03-16 18:44           ` Alex Williamson
  2022-03-24 14:27             ` Abhishek Sahu
  0 siblings, 1 reply; 21+ messages in thread
From: Alex Williamson @ 2022-03-16 18:44 UTC (permalink / raw)
  To: Abhishek Sahu
  Cc: kvm, Cornelia Huck, Max Gurtovoy, Yishai Hadas, Zhen Lei,
	Jason Gunthorpe, linux-kernel

On Wed, 16 Mar 2022 11:11:04 +0530
Abhishek Sahu <abhsahu@nvidia.com> wrote:

> On 3/12/2022 4:36 AM, Alex Williamson wrote:
> > On Fri, 11 Mar 2022 21:15:38 +0530
> > Abhishek Sahu <abhsahu@nvidia.com> wrote:
> >   
> >> On 3/9/2022 10:56 PM, Alex Williamson wrote:  
> >>> On Mon, 24 Jan 2022 23:47:26 +0530
> >>> Abhishek Sahu <abhsahu@nvidia.com> wrote:
...
> >>>> +             struct vfio_power_management vfio_pm;
> >>>> +             struct pci_dev *pdev = vdev->pdev;
> >>>> +             bool request_idle = false, request_resume = false;
> >>>> +             int ret = 0;
> >>>> +
> >>>> +             if (copy_from_user(&vfio_pm, (void __user *)arg, sizeof(vfio_pm)))
> >>>> +                     return -EFAULT;
> >>>> +
> >>>> +             /*
> >>>> +              * The vdev power related fields are protected with memory_lock
> >>>> +              * semaphore.
> >>>> +              */
> >>>> +             down_write(&vdev->memory_lock);
> >>>> +             switch (vfio_pm.d3cold_state) {
> >>>> +             case VFIO_DEVICE_D3COLD_STATE_ENTER:
> >>>> +                     /*
> >>>> +                      * For D3cold, the device should already in D3hot
> >>>> +                      * state.
> >>>> +                      */
> >>>> +                     if (vdev->power_state < PCI_D3hot) {
> >>>> +                             ret = EINVAL;
> >>>> +                             break;
> >>>> +                     }
> >>>> +
> >>>> +                     if (!vdev->runtime_suspend_pending) {
> >>>> +                             vdev->runtime_suspend_pending = true;
> >>>> +                             pm_runtime_put_noidle(&pdev->dev);
> >>>> +                             request_idle = true;
> >>>> +                     }  
> >>>
> >>> If I call this multiple times, runtime_suspend_pending prevents it from
> >>> doing anything, but what should the return value be in that case?  Same
> >>> question for exit.
> >>>  
> >>
> >>  For entry, the user should not call moving the device to D3cold, if it has
> >>  already requested. So, we can return error in this case. For exit,
> >>  currently, in this patch, I am clearing runtime_suspend_pending if the
> >>  wake-up is triggered from the host side (with lspci or some other command).
> >>  In that case, the exit should not return error. Should we add code to
> >>  detect multiple calling of these and ensure only one
> >>  VFIO_DEVICE_D3COLD_STATE_ENTER/VFIO_DEVICE_D3COLD_STATE_EXIT can be called.  
> > 
> > AIUI, the argument is that we can't re-enter d3cold w/o guest driver
> > support, so if an lspci which was unknown to have occurred by the
> > device user were to wake the device, it seems the user would see
> > arbitrarily different results attempting to put the device to sleep
> > again.
> >   
> 
>  Sorry. I still didn't get this point.
> 
>  For guest to go into D3cold, it will follow 2 steps
> 
>  1. Move the device from D0 to D3hot state by using config register.
>  2. Then use this IOCTL to move D3hot state to D3cold state.
> 
> Now, on the guest side if we run lspci, then following will be behavior:
> 
>  1. If we call it before step 2, then the config space register
>     can still be read in D3hot.
>  2. If we call it after step 2, then the guest os should move the
>     device into D0 first, read the config space and then again,
>     the guest os should move the device to D3cold with the
>     above steps. In this process, the guest OS driver will be involved.
>     This is current behavior with Linux guest OS. 
> 
>  Now, on the host side, if we run lspci,
> 
>  1. If we call it before step 2, then the config space register can
>     still be read in D3hot.
>  2. If we call after step 2, then the D3cold to D0 will happen in
>     the runtime resume and then it will be in D0 state. But if we
>     add support to allow re-entering into D3cold again as I mentioned
>     below. then it will again go into D3cold state. 

I was speculating about the latter scenario mechanics.  If the user has
already called STATE_ENTER for d3cold, should a subsequent STATE_ENTER
for d3cold generate an error?  Likewise should STATE_EXIT generate an
error if the device was not previously placed in d3cold?  But then any
host access that triggers vfio_pci_core_runtime_resume() is effective
the same as a STATE_EXIT, which may be unknown to the user.  So if we
had decided to generate errors on duplicate STATE_ENTER/EXIT calls, the
user's state model is broken by the arbitrary host activity and either
way the device is no longer in the user requested state and the user
receives no notification of this.

...
> >>>> +long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
> >>>> +                      unsigned long arg)
> >>>> +{
> >>>> +#ifdef CONFIG_PM
> >>>> +     struct vfio_pci_core_device *vdev =
> >>>> +             container_of(core_vdev, struct vfio_pci_core_device, vdev);
> >>>> +     struct device *dev = &vdev->pdev->dev;
> >>>> +     bool skip_runtime_resume = false;
> >>>> +     long ret;
> >>>> +
> >>>> +     /*
> >>>> +      * The list of commands which are safe to execute when the PCI device
> >>>> +      * is in D3cold state. In D3cold state, the PCI config or any other IO
> >>>> +      * access won't work.
> >>>> +      */
> >>>> +     switch (cmd) {
> >>>> +     case VFIO_DEVICE_POWER_MANAGEMENT:
> >>>> +     case VFIO_DEVICE_GET_INFO:
> >>>> +     case VFIO_DEVICE_FEATURE:
> >>>> +             skip_runtime_resume = true;
> >>>> +             break;  
> >>>
> >>> How can we know that there won't be DEVICE_FEATURE calls that touch the
> >>> device, the recently added migration via DEVICE_FEATURE does already.
> >>> DEVICE_GET_INFO seems equally as prone to breaking via capabilities
> >>> that could touch the device.  It seems easier to maintain and more
> >>> consistent to the user interface if we simply define that any device
> >>> access will resume the device.  
> >>
> >>  In that case, we can resume the device for all case without
> >>  maintaining the safe list.
> >>  
> >>> We need to do something about interrupts though. > Maybe we could error the user ioctl to set d3cold
> >>> for devices running in INTx mode, but we also have numerous ways that
> >>> the device could be resumed under the user, which might start
> >>> triggering MSI/X interrupts?
> >>>  
> >>
> >>  All the resuming we are mainly to prevent any malicious sequence.
> >>  If we see from normal OS side, then once the guest kernel has moved
> >>  the device into D3cold, then it should not do any config space
> >>  access. Similarly, from hypervisor, it should not invoke any
> >>  ioctl other than moving the device into D0 again when the device
> >>  is in D3cold. But, preventing the device to go into D3cold when
> >>  any other ioctl or config space access is happening is not easy,
> >>  so incrementing usage count before these access will ensure that
> >>  the device won't go into D3cold.
> >>
> >>  For interrupts, can the interrupt happen (Both INTx and MSI/x)
> >>  if the device is in D3cold?  
> > 
> > The device itself shouldn't be generating interrupts and we don't share
> > MSI interrupts between devices (afaik), but we do share INTx interrupts.
> >   
> >>  In D3cold, the PME events are possible
> >>  and these events will anyway resume the device first. If the
> >>  interrupts are not possible then can we disable all the interrupts
> >>  somehow before going calling runtime PM API's to move the device into D3cold
> >>  and enable it again during runtime resume. We can wait for all existing
> >>  Interrupt to be finished first. I am not sure if this is possible.  
> > 
> > In the case of shared INTx, it's not just inflight interrupts.
> > Personally I wouldn't have an issue if we increment the usage counter
> > when INTx is in use to simply avoid the issue, but does that invalidate
> > the use case you're trying to enable?  
> 
>  It should not invalidate the use case which I am trying to support.
> 
>  But incrementing the usage count for device already in D3cold
>  state will cause it to wake-up. Wake-up from D3cold may take
>  somewhere around 500 ms – 1500 ms (or sometimes more than that since
>  it depends upon root port wake-up time). So, it will make the
>  ISR time high. For the root port wake-up time, please refer
> 
>  https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ad9001f2f41198784b0423646450ba2cb24793a3
> 
>  where it can take 1100ms alone for the port port on
>  older platforms. 


Configuring interrupts on the device requires it to be in D0, there's
no case I can imagine where we're incrementing the usage counter for
the purpose of setting INTx where the device is not already in D0.  I'm
certainly not suggesting incrementing the usage counter from within the
interrupt handler.


> > Otherwise I think we'd need to
> > remove and re-add the handler around d3cold.
> >   
> >>  Returning error for user ioctl to set d3cold while interrupts are
> >>  happening needs some synchronization at both interrupt handler and
> >>  ioctl code and using runtime resume inside interrupt handler
> >>  may not be safe.  
> > 
> > It's not a race condition to synchronize, it's simply that a shared
> > INTX interrupt can occur any time and we need to make sure we don't
> > touch the device when that occurs, either by preventing d3cold and INTx
> > in combination, removing the handler, or maybe adding a test in the
> > handler to not touch the device - either of the latter we need to be
> > sure we're not risking introducing interrupts storms by being out of
> > sync with the device state.
> >   
> 
>  Adding a test to detect the D3cold seems to be better option in
>  this case but not sure about interrupts storms.

This seems to be another case where the device power state being out of
sync from the user is troublesome.  For instance, if an arbitrary host
access to the device wakes it to D0, it could theoretically trigger
device interrupts.  Is a guest prepared to handle interrupts from a
device that it's put in D3cold and not known to have been waked?

...
> >>>> @@ -2219,11 +2341,61 @@ static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set)
> >>>>  #ifdef CONFIG_PM
> >>>>  static int vfio_pci_core_runtime_suspend(struct device *dev)
> >>>>  {
> >>>> +     struct pci_dev *pdev = to_pci_dev(dev);
> >>>> +     struct vfio_pci_core_device *vdev = dev_get_drvdata(dev);
> >>>> +
> >>>> +     down_read(&vdev->memory_lock);
> >>>> +
> >>>> +     /*
> >>>> +      * runtime_suspend_pending won't be set if there is no user of vfio pci
> >>>> +      * device. In that case, return early and PCI core will take care of
> >>>> +      * putting the device in the low power state.
> >>>> +      */
> >>>> +     if (!vdev->runtime_suspend_pending) {
> >>>> +             up_read(&vdev->memory_lock);
> >>>> +             return 0;
> >>>> +     }  
> >>>
> >>> Doesn't this also mean that idle, unused devices can at best sit in
> >>> d3hot rather than d3cold?
> >>>  
> >>
> >>  Sorry. I didn't get this point.
> >>
> >>  For unused devices, the PCI core will move the device into D3cold directly.  
> > 
> > Could you point out what path triggers that?  I inferred that this
> > function would be called any time the usage count allows transition to
> > d3cold and the above test would prevent the device entering d3cold
> > unless the user requested it.
> >   
> 
>  For PCI runtime suspend, there are 2 options:
> 
>  1. Don’t change the device power state from D0 in the driver
>     runtime suspend callback. In this case, pci_pm_runtime_suspend()
>     will handle all the things.
> 
>     https://elixir.bootlin.com/linux/v5.17-rc8/source/drivers/pci/pci-driver.c#L1285
>     
>     For unused device, runtime_suspend_pending will be false since
>     it can be set by d3cold ioctl.

So our runtime_suspend callback is not gating putting the device into
d3cold, we effectively do the same thing either way, it's only
protected by the memory_lock in the case that the user has requested
it.  Using runtime_suspend_pending here seems a bit misleading since
theoretically we'd want to hold memory_lock in any case of getting to
th runtime_suspend callback while the device is opened.

>  2. With the used device, the device state will be changed to D3hot first
>     with vfio_pm_config_write(). In this case, the pci_pm_runtime_suspend()
>     expects that all the handling has already been done by driver,
>     otherwise it will print the warning and return early.
> 
>     “PCI PM: State of device not saved by %pS”
> 
>     https://elixir.bootlin.com/linux/v5.17-rc8/source/drivers/pci/pci-driver.c#L1280
> 
> >>  For the used devices, the config space write is happening first before
> >>  this ioctl is called and the config space write is moving the device
> >>  into D3hot so we need to do some manual thing here.  
> > 
> > Why is it that a user owned device cannot re-enter d3cold without
> > driver support, but and idle device does?  Simply because we expect to
> > reset the device before returning it back to the host or exposing it to
> > a user?  I'd expect that after d3cold->d0 we're essentially at a
> > power-on state, which ideally would be similar to a post-reset state,
> > so I don't follow how driver support factors in to re-entering d3cold.
> >   
> 
>  In terms of nvidia GPU, the idle unused device is equivalent to
>  uninitialized PCI device. In this case, no internal HW modules
>  will be initialized like video memory. So, it does not matter
>  what we do with the device before that. It is fine to re-enter
>  d3cold since the HW itself is not initialized. Once the device
>  is owned by user, then in the guest OS side the nvidia driver will run
>  and initialize all the HW modules including video memory. Now,
>  before removing the power, we need to make sure that video
>  memory should come in the same state after resume as before
>  suspending. 
> 
>  If we don’t keep the video memory in self refresh state, then it is
>  equivalent to power on state. But if we keep the video memory
>  in self refresh state, then it is different from power-on state.
> 
> >>>> +
> >>>> +     /*
> >>>> +      * The runtime suspend will be called only if device is already at
> >>>> +      * D3hot state. Now, change the device state from D3hot to D3cold by
> >>>> +      * using platform power management. If setting of D3cold is not
> >>>> +      * supported for the PCI device, then the device state will still be
> >>>> +      * in D3hot state. The PCI core expects to save the PCI state, if
> >>>> +      * driver runtime routine handles the power state management.
> >>>> +      */
> >>>> +     pci_save_state(pdev);
> >>>> +     pci_platform_power_transition(pdev, PCI_D3cold);
> >>>> +     up_read(&vdev->memory_lock);
> >>>> +
> >>>>       return 0;
> >>>>  }
> >>>>
> >>>>  static int vfio_pci_core_runtime_resume(struct device *dev)
> >>>>  {
> >>>> +     struct pci_dev *pdev = to_pci_dev(dev);
> >>>> +     struct vfio_pci_core_device *vdev = dev_get_drvdata(dev);
> >>>> +
> >>>> +     down_write(&vdev->memory_lock);
> >>>> +
> >>>> +     /*
> >>>> +      * The PCI core will move the device to D0 state before calling the
> >>>> +      * driver runtime resume.
> >>>> +      */
> >>>> +     vfio_pci_set_power_state_locked(vdev, PCI_D0);  
> >>>
> >>> Maybe this is where vdev->power_state is kept synchronized?
> >>>  
> >>
> >>  Yes. vdev->power_state will be changed here.
> >>  
> >>>> +
> >>>> +     /*
> >>>> +      * Some PCI device needs the SW involvement before going to D3cold
> >>>> +      * state again. So if there is any wake-up which is not triggered
> >>>> +      * by the guest, then increase the usage count to prevent the
> >>>> +      * second runtime suspend.
> >>>> +      */  
> >>>
> >>> Can you give examples of devices that need this and the reason they
> >>> need this?  The interface is not terribly deterministic if a random
> >>> unprivileged lspci on the host can move devices back to d3hot.  
> >>
> >>  I am not sure about other device but this is happening for
> >>  the nvidia GPU itself.
> >>
> >>  For nvidia GPU, during runtime suspend, we keep the GPU video memory
> >>  in self-refresh mode for high video memory usage. Each video memory
> >>  self refesh entry before D3cold requires nvidia SW involvement.
> >>  Without SW self-refresh sequnece involvement, it won't work.  
> > 
> > 
> > So we're exposing acpi power interfaces to turn a device off, which
> > don't really turn the device off, but leaves it in some sort of
> > low-power memory refresh state, rather than a fully off state as I had
> > assumed above.  Does this suggest the host firmware ACPI has knowledge
> > of the device and does different things?
> >   
> 
>  I was trying to find the public document regarding this part and
>  it seems following Windows document can help in providing some
>  information related with this
> 
>  https://docs.microsoft.com/en-us/windows-hardware/drivers/bringup/firmware-requirements-for-d3cold
> 
>  “Putting a device in D3cold does not necessarily mean that all
>   sources of power to the device have been removed—it means only
>   that the main power source, Vcc, is removed. The auxiliary power
>   source, Vaux, might also be removed if it is not required for
>   the wake logic”.
>  
>  So, for generic self-refresh D3cold (means in Desktop), it is mainly
>  relying on auxiliary power. For notebooks, we ask to do some
>  customization in acpi power interfaces side to support
>  video memory self-refresh. 

And that customization must rely on some aspect of the GPU state,
right?  We send the GPU to d3cold the first time and we get this memory
self-refresh behavior, but the claim here is that if we wake the device
to d0 and send it back to d3cold that video memory will be lost.  So
the variable here has something to do with the device state itself.
Therefore is there some device register that could be preserved and
restored around d3cold so that we could go back into the self-refresh
state?


> >>  Details regarding runtime suspend with self-refresh can be found in
> >>
> >>  https://download.nvidia.com/XFree86/Linux-x86_64/495.46/README/dynamicpowermanagement.html#VidMemThreshold
> >>
> >>  But, if GPU video memory usage is low, then we turnoff video memory
> >>  and save all the allocation in system memory. In this case, SW involvement
> >>  is not required.  
> > 
> > Ok, so there's some heuristically determined vram usage where the
> > driver favors suspend latency versus power savings and somehow keeps
> > the device in this low-power, refresh state versus a fully off state.
> > How unique is this behavior to NVIDIA devices?  It seems like we're
> > trying to add d3cold, but special case it based on a device that might
> > have a rather quirky d3cold behavior.  Is there something we can test
> > about the state of the device to know which mode it's using?   
> 
>  Since vfio is generic driver so testing the device mode here
>  seems to be challenging.
>  
> > Is there something we can virtualize on the device to force the driver to use
> > the higher latency, lower power d3cold mode that results in fewer
> > restrictions?  Or maybe this is just common practice?
> >   
> 
>  Yes. We can enforce this. But this option won’t be useful for modern
>  use cases. Let’s assume if we have 16GB video memory usage, in that
>  case, it will take lot of time in entry and exit and make the feature
>  unusable. Also, the system memory will be limited in the guest
>  side so enough system memory is again challenge. 

Good point, the potential extent of video memory is too excessive to
not support a self-refresh mode.

> >>> How useful is this implementation if a notice to the guest of a resumed
> >>> device is TBD?  Thanks,
> >>>
> >>> Alex
> >>>  
> >>
> >>  I have prototyped this earlier by using eventfd_ctx for pme and whenever we get
> >>  a resume triggered by host, then it will forward the same to hypervisor.
> >>  Then in the hypervisor, it can write into virtual root port PME related registers
> >>  and send PME event which will wake-up the PCI device in the guest side.
> >>  It will help in handling PME events related wake-up also which are currently
> >>  disabled in PATCH 2 of this patch series.  
> > 
> > But then what does the guest do with the device?  For example, if we
> > have a VM with an assigned GPU running an idle desktop where the
> > monitor has gone into power save, does running lspci on the host
> > randomly wake the desktop and monitor?  
> 
>  For Linux OS + NVIDIA driver, it seems it will just wake-up the
>  GPU up and not the monitor. With the bare-metal setup, I waited
>  for monitor to go off with DPMS and then the GPU went into
>  suspended state. After that, If I run lspci command,
>  then the GPU moved to active state but monitor was
>  still in the off state and after lspci, the GPU went
>  into suspended state again. 
> 
>  The monitor is waking up only if I do keyborad or mouse
>  movement.

The monitor waking would clearly be a user visible sign that this
doesn't work according to plan, but we still have the fact that the GPU
is awake and consuming power, wasting battery on a mobile platform,
still seems like a symptom that this solution is incomplete.

> > I'd like to understand how
> > unique the return to d3cold behavior is to this device and whether we
> > can restrict that in some way.  An option that's now at our disposal
> > would be to create an NVIDIA GPU variant of vfio-pci that has
> > sufficient device knowledge to perhaps retrigger the vram refresh
> > d3cold state rather than lose vram data going into a standard d3cold
> > state.  Thanks,
> > 
> > Alex
> >   
> 
>  Adding vram refresh d3cold state with vfio-pci variant is not straight
>  forward without involvement of nvidia driver itself. 
> 
>  One option is to add one flag in D3cold IOCTL itself to differentiate
>  between 2 variants of D3cold entry (One which allows re-entering to
>  D3cold and another one which won’t allow re-entering to D3cold) and
>  set it default for re-entering to D3cold. For nvidia or similar use
>  case, the hypervisor can set this flag to prevent re-entering to D3cold.

QEMU doesn't know the hardware behavior either.
 
>  Otherwise, we can add NVIDIA vendor ID check and restrict this
>  to nvidia alone. 

Either of these solutions presumes there's a worthwhile use case
regardless of the fact that the GPU can be woken by arbitrary,
unprivileged actions on the host.  It seems that either we should be
able to put the device back into a low power state ourselves after such
an event or be able to trigger an eventfd to the user which is plumbed
through pme in the guest so that the guest can put the device back to
low power after such an event.  Getting the device into a transient low
power state that it can slip out of so easily doesn't seem like a
complete solution to me.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v2 5/5] vfio/pci: add the support for PCI D3cold state
  2022-03-16 18:44           ` Alex Williamson
@ 2022-03-24 14:27             ` Abhishek Sahu
  0 siblings, 0 replies; 21+ messages in thread
From: Abhishek Sahu @ 2022-03-24 14:27 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, Cornelia Huck, Max Gurtovoy, Yishai Hadas, Zhen Lei,
	Jason Gunthorpe, linux-kernel

On 3/17/2022 12:14 AM, Alex Williamson wrote:
> On Wed, 16 Mar 2022 11:11:04 +0530
> Abhishek Sahu <abhsahu@nvidia.com> wrote:
> 
>> On 3/12/2022 4:36 AM, Alex Williamson wrote:
>>> On Fri, 11 Mar 2022 21:15:38 +0530
>>> Abhishek Sahu <abhsahu@nvidia.com> wrote:
>>>
>>>> On 3/9/2022 10:56 PM, Alex Williamson wrote:
>>>>> On Mon, 24 Jan 2022 23:47:26 +0530
>>>>> Abhishek Sahu <abhsahu@nvidia.com> wrote:
> ...
>>>>>> +             struct vfio_power_management vfio_pm;
>>>>>> +             struct pci_dev *pdev = vdev->pdev;
>>>>>> +             bool request_idle = false, request_resume = false;
>>>>>> +             int ret = 0;
>>>>>> +
>>>>>> +             if (copy_from_user(&vfio_pm, (void __user *)arg, sizeof(vfio_pm)))
>>>>>> +                     return -EFAULT;
>>>>>> +
>>>>>> +             /*
>>>>>> +              * The vdev power related fields are protected with memory_lock
>>>>>> +              * semaphore.
>>>>>> +              */
>>>>>> +             down_write(&vdev->memory_lock);
>>>>>> +             switch (vfio_pm.d3cold_state) {
>>>>>> +             case VFIO_DEVICE_D3COLD_STATE_ENTER:
>>>>>> +                     /*
>>>>>> +                      * For D3cold, the device should already in D3hot
>>>>>> +                      * state.
>>>>>> +                      */
>>>>>> +                     if (vdev->power_state < PCI_D3hot) {
>>>>>> +                             ret = EINVAL;
>>>>>> +                             break;
>>>>>> +                     }
>>>>>> +
>>>>>> +                     if (!vdev->runtime_suspend_pending) {
>>>>>> +                             vdev->runtime_suspend_pending = true;
>>>>>> +                             pm_runtime_put_noidle(&pdev->dev);
>>>>>> +                             request_idle = true;
>>>>>> +                     }
>>>>>
>>>>> If I call this multiple times, runtime_suspend_pending prevents it from
>>>>> doing anything, but what should the return value be in that case?  Same
>>>>> question for exit.
>>>>>
>>>>
>>>>  For entry, the user should not call moving the device to D3cold, if it has
>>>>  already requested. So, we can return error in this case. For exit,
>>>>  currently, in this patch, I am clearing runtime_suspend_pending if the
>>>>  wake-up is triggered from the host side (with lspci or some other command).
>>>>  In that case, the exit should not return error. Should we add code to
>>>>  detect multiple calling of these and ensure only one
>>>>  VFIO_DEVICE_D3COLD_STATE_ENTER/VFIO_DEVICE_D3COLD_STATE_EXIT can be called.
>>>
>>> AIUI, the argument is that we can't re-enter d3cold w/o guest driver
>>> support, so if an lspci which was unknown to have occurred by the
>>> device user were to wake the device, it seems the user would see
>>> arbitrarily different results attempting to put the device to sleep
>>> again.
>>>
>>
>>  Sorry. I still didn't get this point.
>>
>>  For guest to go into D3cold, it will follow 2 steps
>>
>>  1. Move the device from D0 to D3hot state by using config register.
>>  2. Then use this IOCTL to move D3hot state to D3cold state.
>>
>> Now, on the guest side if we run lspci, then following will be behavior:
>>
>>  1. If we call it before step 2, then the config space register
>>     can still be read in D3hot.
>>  2. If we call it after step 2, then the guest os should move the
>>     device into D0 first, read the config space and then again,
>>     the guest os should move the device to D3cold with the
>>     above steps. In this process, the guest OS driver will be involved.
>>     This is current behavior with Linux guest OS.
>>
>>  Now, on the host side, if we run lspci,
>>
>>  1. If we call it before step 2, then the config space register can
>>     still be read in D3hot.
>>  2. If we call after step 2, then the D3cold to D0 will happen in
>>     the runtime resume and then it will be in D0 state. But if we
>>     add support to allow re-entering into D3cold again as I mentioned
>>     below. then it will again go into D3cold state.
> 
> I was speculating about the latter scenario mechanics.  If the user has
> already called STATE_ENTER for d3cold, should a subsequent STATE_ENTER
> for d3cold generate an error?  Likewise should STATE_EXIT generate an
> error if the device was not previously placed in d3cold?  But then any
> host access that triggers vfio_pci_core_runtime_resume() is effective
> the same as a STATE_EXIT, which may be unknown to the user.  So if we
> had decided to generate errors on duplicate STATE_ENTER/EXIT calls, the
> user's state model is broken by the arbitrary host activity and either
> way the device is no longer in the user requested state and the user
> receives no notification of this.
> 

 I will check regarding this part and explore how we can do error
 handling along with arbitrary host activity.

> ...
>>>>>> +long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
>>>>>> +                      unsigned long arg)
>>>>>> +{
>>>>>> +#ifdef CONFIG_PM
>>>>>> +     struct vfio_pci_core_device *vdev =
>>>>>> +             container_of(core_vdev, struct vfio_pci_core_device, vdev);
>>>>>> +     struct device *dev = &vdev->pdev->dev;
>>>>>> +     bool skip_runtime_resume = false;
>>>>>> +     long ret;
>>>>>> +
>>>>>> +     /*
>>>>>> +      * The list of commands which are safe to execute when the PCI device
>>>>>> +      * is in D3cold state. In D3cold state, the PCI config or any other IO
>>>>>> +      * access won't work.
>>>>>> +      */
>>>>>> +     switch (cmd) {
>>>>>> +     case VFIO_DEVICE_POWER_MANAGEMENT:
>>>>>> +     case VFIO_DEVICE_GET_INFO:
>>>>>> +     case VFIO_DEVICE_FEATURE:
>>>>>> +             skip_runtime_resume = true;
>>>>>> +             break;
>>>>>
>>>>> How can we know that there won't be DEVICE_FEATURE calls that touch the
>>>>> device, the recently added migration via DEVICE_FEATURE does already.
>>>>> DEVICE_GET_INFO seems equally as prone to breaking via capabilities
>>>>> that could touch the device.  It seems easier to maintain and more
>>>>> consistent to the user interface if we simply define that any device
>>>>> access will resume the device.
>>>>
>>>>  In that case, we can resume the device for all case without
>>>>  maintaining the safe list.
>>>>
>>>>> We need to do something about interrupts though. > Maybe we could error the user ioctl to set d3cold
>>>>> for devices running in INTx mode, but we also have numerous ways that
>>>>> the device could be resumed under the user, which might start
>>>>> triggering MSI/X interrupts?
>>>>>
>>>>
>>>>  All the resuming we are mainly to prevent any malicious sequence.
>>>>  If we see from normal OS side, then once the guest kernel has moved
>>>>  the device into D3cold, then it should not do any config space
>>>>  access. Similarly, from hypervisor, it should not invoke any
>>>>  ioctl other than moving the device into D0 again when the device
>>>>  is in D3cold. But, preventing the device to go into D3cold when
>>>>  any other ioctl or config space access is happening is not easy,
>>>>  so incrementing usage count before these access will ensure that
>>>>  the device won't go into D3cold.
>>>>
>>>>  For interrupts, can the interrupt happen (Both INTx and MSI/x)
>>>>  if the device is in D3cold?
>>>
>>> The device itself shouldn't be generating interrupts and we don't share
>>> MSI interrupts between devices (afaik), but we do share INTx interrupts.
>>>
>>>>  In D3cold, the PME events are possible
>>>>  and these events will anyway resume the device first. If the
>>>>  interrupts are not possible then can we disable all the interrupts
>>>>  somehow before going calling runtime PM API's to move the device into D3cold
>>>>  and enable it again during runtime resume. We can wait for all existing
>>>>  Interrupt to be finished first. I am not sure if this is possible.
>>>
>>> In the case of shared INTx, it's not just inflight interrupts.
>>> Personally I wouldn't have an issue if we increment the usage counter
>>> when INTx is in use to simply avoid the issue, but does that invalidate
>>> the use case you're trying to enable?
>>
>>  It should not invalidate the use case which I am trying to support.
>>
>>  But incrementing the usage count for device already in D3cold
>>  state will cause it to wake-up. Wake-up from D3cold may take
>>  somewhere around 500 ms – 1500 ms (or sometimes more than that since
>>  it depends upon root port wake-up time). So, it will make the
>>  ISR time high. For the root port wake-up time, please refer
>>
>>  https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit.kernel.org%2Fpub%2Fscm%2Flinux%2Fkernel%2Fgit%2Ftorvalds%2Flinux.git%2Fcommit%2F%3Fid%3Dad9001f2f41198784b0423646450ba2cb24793a3&amp;data=04%7C01%7Cabhsahu%40nvidia.com%7C55ab0ec2867c46a2a16208da077cf4ff%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637830530524517878%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=A28ExovVhK3MB19U1aFTuwd9LTRcjbDxqYQ7eHC83VQ%3D&amp;reserved=0
>>
>>  where it can take 1100ms alone for the port port on
>>  older platforms.
> 
> 
> Configuring interrupts on the device requires it to be in D0, there's
> no case I can imagine where we're incrementing the usage counter for
> the purpose of setting INTx where the device is not already in D0.  I'm
> certainly not suggesting incrementing the usage counter from within the
> interrupt handler.
> 
> 
>>> Otherwise I think we'd need to
>>> remove and re-add the handler around d3cold.
>>>
>>>>  Returning error for user ioctl to set d3cold while interrupts are
>>>>  happening needs some synchronization at both interrupt handler and
>>>>  ioctl code and using runtime resume inside interrupt handler
>>>>  may not be safe.
>>>
>>> It's not a race condition to synchronize, it's simply that a shared
>>> INTX interrupt can occur any time and we need to make sure we don't
>>> touch the device when that occurs, either by preventing d3cold and INTx
>>> in combination, removing the handler, or maybe adding a test in the
>>> handler to not touch the device - either of the latter we need to be
>>> sure we're not risking introducing interrupts storms by being out of
>>> sync with the device state.
>>>
>>
>>  Adding a test to detect the D3cold seems to be better option in
>>  this case but not sure about interrupts storms.
> 
> This seems to be another case where the device power state being out of
> sync from the user is troublesome.  For instance, if an arbitrary host
> access to the device wakes it to D0, it could theoretically trigger
> device interrupts.  Is a guest prepared to handle interrupts from a
> device that it's put in D3cold and not known to have been waked?
> 

 This behavior should depend upon driver and hypervisor implementation.
 From the host side, vfio driver triggers eventfd and then the driver
 ISR routine will be invoked directly without involvement of
 any other code in the kernel side. Now, if driver maintains internally 
 that the device in low power state then it can return early from ISR.
 But if driver tries to do any config space access, then the behavior
 will be undefined.

 To handle this case, would removing the handler be better option since
 it will make sure that no interrupt will be forwarded to guest or
 the vfio driver can handle this interrupt and discard it when the
 guest in D3cold. 

> ...
>>>>>> @@ -2219,11 +2341,61 @@ static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set)
>>>>>>  #ifdef CONFIG_PM
>>>>>>  static int vfio_pci_core_runtime_suspend(struct device *dev)
>>>>>>  {
>>>>>> +     struct pci_dev *pdev = to_pci_dev(dev);
>>>>>> +     struct vfio_pci_core_device *vdev = dev_get_drvdata(dev);
>>>>>> +
>>>>>> +     down_read(&vdev->memory_lock);
>>>>>> +
>>>>>> +     /*
>>>>>> +      * runtime_suspend_pending won't be set if there is no user of vfio pci
>>>>>> +      * device. In that case, return early and PCI core will take care of
>>>>>> +      * putting the device in the low power state.
>>>>>> +      */
>>>>>> +     if (!vdev->runtime_suspend_pending) {
>>>>>> +             up_read(&vdev->memory_lock);
>>>>>> +             return 0;
>>>>>> +     }
>>>>>
>>>>> Doesn't this also mean that idle, unused devices can at best sit in
>>>>> d3hot rather than d3cold?
>>>>>
>>>>
>>>>  Sorry. I didn't get this point.
>>>>
>>>>  For unused devices, the PCI core will move the device into D3cold directly.
>>>
>>> Could you point out what path triggers that?  I inferred that this
>>> function would be called any time the usage count allows transition to
>>> d3cold and the above test would prevent the device entering d3cold
>>> unless the user requested it.
>>>
>>
>>  For PCI runtime suspend, there are 2 options:
>>
>>  1. Don’t change the device power state from D0 in the driver
>>     runtime suspend callback. In this case, pci_pm_runtime_suspend()
>>     will handle all the things.
>>
>>     https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Felixir.bootlin.com%2Flinux%2Fv5.17-rc8%2Fsource%2Fdrivers%2Fpci%2Fpci-driver.c%23L1285&amp;data=04%7C01%7Cabhsahu%40nvidia.com%7C55ab0ec2867c46a2a16208da077cf4ff%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637830530524517878%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=yt%2FjogIO6Wbz6LJmcvTsU3nthgt6DrJvdF%2FNprnYkaI%3D&amp;reserved=0
>>
>>     For unused device, runtime_suspend_pending will be false since
>>     it can be set by d3cold ioctl.
> 
> So our runtime_suspend callback is not gating putting the device into
> d3cold, we effectively do the same thing either way, it's only
> protected by the memory_lock in the case that the user has requested
> it.  Using runtime_suspend_pending here seems a bit misleading since
> theoretically we'd want to hold memory_lock in any case of getting to
> th runtime_suspend callback while the device is opened.
> 
>>  2. With the used device, the device state will be changed to D3hot first
>>     with vfio_pm_config_write(). In this case, the pci_pm_runtime_suspend()
>>     expects that all the handling has already been done by driver,
>>     otherwise it will print the warning and return early.
>>
>>     “PCI PM: State of device not saved by %pS”
>>
>>     https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Felixir.bootlin.com%2Flinux%2Fv5.17-rc8%2Fsource%2Fdrivers%2Fpci%2Fpci-driver.c%23L1280&amp;data=04%7C01%7Cabhsahu%40nvidia.com%7C55ab0ec2867c46a2a16208da077cf4ff%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637830530524517878%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=feiNcCYVX%2FBElbXzd%2BzHqctiPYJGO0T2g71Kx1eHIXY%3D&amp;reserved=0
>>
>>>>  For the used devices, the config space write is happening first before
>>>>  this ioctl is called and the config space write is moving the device
>>>>  into D3hot so we need to do some manual thing here.
>>>
>>> Why is it that a user owned device cannot re-enter d3cold without
>>> driver support, but and idle device does?  Simply because we expect to
>>> reset the device before returning it back to the host or exposing it to
>>> a user?  I'd expect that after d3cold->d0 we're essentially at a
>>> power-on state, which ideally would be similar to a post-reset state,
>>> so I don't follow how driver support factors in to re-entering d3cold.
>>>
>>
>>  In terms of nvidia GPU, the idle unused device is equivalent to
>>  uninitialized PCI device. In this case, no internal HW modules
>>  will be initialized like video memory. So, it does not matter
>>  what we do with the device before that. It is fine to re-enter
>>  d3cold since the HW itself is not initialized. Once the device
>>  is owned by user, then in the guest OS side the nvidia driver will run
>>  and initialize all the HW modules including video memory. Now,
>>  before removing the power, we need to make sure that video
>>  memory should come in the same state after resume as before
>>  suspending.
>>
>>  If we don’t keep the video memory in self refresh state, then it is
>>  equivalent to power on state. But if we keep the video memory
>>  in self refresh state, then it is different from power-on state.
>>
>>>>>> +
>>>>>> +     /*
>>>>>> +      * The runtime suspend will be called only if device is already at
>>>>>> +      * D3hot state. Now, change the device state from D3hot to D3cold by
>>>>>> +      * using platform power management. If setting of D3cold is not
>>>>>> +      * supported for the PCI device, then the device state will still be
>>>>>> +      * in D3hot state. The PCI core expects to save the PCI state, if
>>>>>> +      * driver runtime routine handles the power state management.
>>>>>> +      */
>>>>>> +     pci_save_state(pdev);
>>>>>> +     pci_platform_power_transition(pdev, PCI_D3cold);
>>>>>> +     up_read(&vdev->memory_lock);
>>>>>> +
>>>>>>       return 0;
>>>>>>  }
>>>>>>
>>>>>>  static int vfio_pci_core_runtime_resume(struct device *dev)
>>>>>>  {
>>>>>> +     struct pci_dev *pdev = to_pci_dev(dev);
>>>>>> +     struct vfio_pci_core_device *vdev = dev_get_drvdata(dev);
>>>>>> +
>>>>>> +     down_write(&vdev->memory_lock);
>>>>>> +
>>>>>> +     /*
>>>>>> +      * The PCI core will move the device to D0 state before calling the
>>>>>> +      * driver runtime resume.
>>>>>> +      */
>>>>>> +     vfio_pci_set_power_state_locked(vdev, PCI_D0);
>>>>>
>>>>> Maybe this is where vdev->power_state is kept synchronized?
>>>>>
>>>>
>>>>  Yes. vdev->power_state will be changed here.
>>>>
>>>>>> +
>>>>>> +     /*
>>>>>> +      * Some PCI device needs the SW involvement before going to D3cold
>>>>>> +      * state again. So if there is any wake-up which is not triggered
>>>>>> +      * by the guest, then increase the usage count to prevent the
>>>>>> +      * second runtime suspend.
>>>>>> +      */
>>>>>
>>>>> Can you give examples of devices that need this and the reason they
>>>>> need this?  The interface is not terribly deterministic if a random
>>>>> unprivileged lspci on the host can move devices back to d3hot.
>>>>
>>>>  I am not sure about other device but this is happening for
>>>>  the nvidia GPU itself.
>>>>
>>>>  For nvidia GPU, during runtime suspend, we keep the GPU video memory
>>>>  in self-refresh mode for high video memory usage. Each video memory
>>>>  self refesh entry before D3cold requires nvidia SW involvement.
>>>>  Without SW self-refresh sequnece involvement, it won't work.
>>>
>>>
>>> So we're exposing acpi power interfaces to turn a device off, which
>>> don't really turn the device off, but leaves it in some sort of
>>> low-power memory refresh state, rather than a fully off state as I had
>>> assumed above.  Does this suggest the host firmware ACPI has knowledge
>>> of the device and does different things?
>>>
>>
>>  I was trying to find the public document regarding this part and
>>  it seems following Windows document can help in providing some
>>  information related with this
>>
>>  https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.microsoft.com%2Fen-us%2Fwindows-hardware%2Fdrivers%2Fbringup%2Ffirmware-requirements-for-d3cold&amp;data=04%7C01%7Cabhsahu%40nvidia.com%7C55ab0ec2867c46a2a16208da077cf4ff%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637830530524517878%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=KVAt%2BajnWl5eM4XITUMmR2EYIDZdfYWt9fyDD%2BhKB4Y%3D&amp;reserved=0
>>
>>  “Putting a device in D3cold does not necessarily mean that all
>>   sources of power to the device have been removed—it means only
>>   that the main power source, Vcc, is removed. The auxiliary power
>>   source, Vaux, might also be removed if it is not required for
>>   the wake logic”.
>>
>>  So, for generic self-refresh D3cold (means in Desktop), it is mainly
>>  relying on auxiliary power. For notebooks, we ask to do some
>>  customization in acpi power interfaces side to support
>>  video memory self-refresh.
> 
> And that customization must rely on some aspect of the GPU state,
> right?

 The notebook customization is mainly related with ACPI
 implementation and HW side design so that the main power is
 turned off and the aux power is supplied during that time.

> We send the GPU to d3cold the first time and we get this memory
> self-refresh behavior, but the claim here is that if we wake the device
> to d0 and send it back to d3cold that video memory will be lost.  So
> the variable here has something to do with the device state itself.
> Therefore is there some device register that could be preserved and
> restored around d3cold so that we could go back into the self-refresh
> state?
> 

 Following is the more detail around this.

 1. For the first D0->D3cold transition, the driver running in the
    guest side uses the firmware state machine. The firmware will be
    still running since the GPU will be in powered on state unless
    the root port actually removes power.

 2. Before removing power, the root port sends PME_Turn_Off message.
    The detail around this area is documented in

    [PCIe spec v5, PME Synchronization 5.3.3.2.1]

 5.3.3.2.1 PME Synchronization

 PCI Express-PM introduces a fence mechanism that serves to
 initiate the power removal sequence while also
 coordinating the behavior of the platform's power management
 controller and PME handling by PCI Express agents.
 PME_Turn_Off Broadcast Message

 Before main component power and reference clocks are
 turned off, the Root Complex or Switch Downstream Port must
 issue a broadcast Message that instructs all agents
 Downstream of that point within the hierarchy to cease initiation of
 any subsequent PM_PME Messages, effective immediately upon
 receipt of the PME_Turn_Off Message. Each PCI Express agent is required
 to respond with a TLP “acknowledgement” Message, PME_TO_Ack that is always
 routed Upstream.


 3. The firmware running inside nvidia GPU listens for this
    PME_Turn_Off and move the video memory in self-refresh state.
 
 4. If video memory self-refresh is not required, then no handling
    is needed for these PME messages since power of all components
    is going to be removed and the driver will initialize the
    video memory from scratch again.

 Now, in the second D3cold transition, we won’t have firmware running
 which is responsible for putting the memory in self-refresh state and
 that is the reason why we can’t go into self-refresh state during
 second D3cold transition. This firmware load and initialization
 can happen only from the driver.

> 
>>>>  Details regarding runtime suspend with self-refresh can be found in
>>>>
>>>>  https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdownload.nvidia.com%2FXFree86%2FLinux-x86_64%2F495.46%2FREADME%2Fdynamicpowermanagement.html%23VidMemThreshold&amp;data=04%7C01%7Cabhsahu%40nvidia.com%7C55ab0ec2867c46a2a16208da077cf4ff%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637830530524517878%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=kddgrKLWH8j4upDlN4pXtU9pjjHa6Cxa3nQ8QS3Nq30%3D&amp;reserved=0
>>>>
>>>>  But, if GPU video memory usage is low, then we turnoff video memory
>>>>  and save all the allocation in system memory. In this case, SW involvement
>>>>  is not required.
>>>
>>> Ok, so there's some heuristically determined vram usage where the
>>> driver favors suspend latency versus power savings and somehow keeps
>>> the device in this low-power, refresh state versus a fully off state.
>>> How unique is this behavior to NVIDIA devices?  It seems like we're
>>> trying to add d3cold, but special case it based on a device that might
>>> have a rather quirky d3cold behavior.  Is there something we can test
>>> about the state of the device to know which mode it's using?
>>
>>  Since vfio is generic driver so testing the device mode here
>>  seems to be challenging.
>>
>>> Is there something we can virtualize on the device to force the driver to use
>>> the higher latency, lower power d3cold mode that results in fewer
>>> restrictions?  Or maybe this is just common practice?
>>>
>>
>>  Yes. We can enforce this. But this option won’t be useful for modern
>>  use cases. Let’s assume if we have 16GB video memory usage, in that
>>  case, it will take lot of time in entry and exit and make the feature
>>  unusable. Also, the system memory will be limited in the guest
>>  side so enough system memory is again challenge.
> 
> Good point, the potential extent of video memory is too excessive to
> not support a self-refresh mode.
> 
>>>>> How useful is this implementation if a notice to the guest of a resumed
>>>>> device is TBD?  Thanks,
>>>>>
>>>>> Alex
>>>>>
>>>>
>>>>  I have prototyped this earlier by using eventfd_ctx for pme and whenever we get
>>>>  a resume triggered by host, then it will forward the same to hypervisor.
>>>>  Then in the hypervisor, it can write into virtual root port PME related registers
>>>>  and send PME event which will wake-up the PCI device in the guest side.
>>>>  It will help in handling PME events related wake-up also which are currently
>>>>  disabled in PATCH 2 of this patch series.
>>>
>>> But then what does the guest do with the device?  For example, if we
>>> have a VM with an assigned GPU running an idle desktop where the
>>> monitor has gone into power save, does running lspci on the host
>>> randomly wake the desktop and monitor?
>>
>>  For Linux OS + NVIDIA driver, it seems it will just wake-up the
>>  GPU up and not the monitor. With the bare-metal setup, I waited
>>  for monitor to go off with DPMS and then the GPU went into
>>  suspended state. After that, If I run lspci command,
>>  then the GPU moved to active state but monitor was
>>  still in the off state and after lspci, the GPU went
>>  into suspended state again.
>>
>>  The monitor is waking up only if I do keyborad or mouse
>>  movement.
> 
> The monitor waking would clearly be a user visible sign that this
> doesn't work according to plan,

  I have confirmed this internally also from user space graphics driver
  folks and the GPU wake-up should not cause any monitor wake-up.

> but we still have the fact that the GPU
> is awake and consuming power, wasting battery on a mobile platform,
> still seems like a symptom that this solution is incomplete.
> >>> I'd like to understand how
>>> unique the return to d3cold behavior is to this device and whether we
>>> can restrict that in some way.  An option that's now at our disposal
>>> would be to create an NVIDIA GPU variant of vfio-pci that has
>>> sufficient device knowledge to perhaps retrigger the vram refresh
>>> d3cold state rather than lose vram data going into a standard d3cold
>>> state.  Thanks,
>>>
>>> Alex
>>>
>>
>>  Adding vram refresh d3cold state with vfio-pci variant is not straight
>>  forward without involvement of nvidia driver itself.
>>
>>  One option is to add one flag in D3cold IOCTL itself to differentiate
>>  between 2 variants of D3cold entry (One which allows re-entering to
>>  D3cold and another one which won’t allow re-entering to D3cold) and
>>  set it default for re-entering to D3cold. For nvidia or similar use
>>  case, the hypervisor can set this flag to prevent re-entering to D3cold.
> 
> QEMU doesn't know the hardware behavior either.
> 

 We can pass this information by command line parameter which user
 can set. This command line parameter can be specified per device,
 but this requires user to be aware of the behavior for PCI device.

>>  Otherwise, we can add NVIDIA vendor ID check and restrict this
>>  to nvidia alone.
> 
> Either of these solutions presumes there's a worthwhile use case
> regardless of the fact that the GPU can be woken by arbitrary,
> unprivileged actions on the host.

 Yes. In that case, another option would be to prevent GPU wake-up
 completely on the host side. If we add some flag in core PM code
 and return early without waking up the device if this flag is set.
 So, if user runs lspci or similar command, then the error will
 be returned from PM runtime resume API and all the value will be
 0xffff. In pass-through mode, the PCI device is owned by guest
 so if guest has put device into D3cold then host can honor that
 instead of waking the device.

 But we need to see how this model works with multi function device
 where only not all functions are bind with vfio driver.

> It seems that either we should be
> able to put the device back into a low power state ourselves after such
> an event

 For NVIDIA GPU, we can’t put the device into low power state
 since we need firmware to be present on the GPU and this can be
 loaded and initialized only by the driver.

> or be able to trigger an eventfd to the user which is plumbed
> through pme in the guest so that the guest can put the device back to
> low power after such an event.  Getting the device into a transient low
> power state that it can slip out of so easily doesn't seem like a
> complete solution to me.  Thanks,
> 
> Alex
> 

 Yes. But this is also won’t work in all the cases since it has
 dependency upon QEMU or hypervisor. The PME interrupt will go to
 root port inside of end-point so we need to create virtual root port
 also in that case. If nothing else works out then we can go for this
 approach.

 Thanks,
 Abhishek

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2022-03-24 14:27 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-24 18:17 [RFC PATCH v2 0/5] vfio/pci: Enable runtime power management support Abhishek Sahu
2022-01-24 18:17 ` [RFC PATCH v2 1/5] vfio/pci: register vfio-pci driver with runtime PM framework Abhishek Sahu
2022-02-16 23:48   ` Alex Williamson
2022-02-21  6:35     ` Abhishek Sahu
2022-01-24 18:17 ` [RFC PATCH v2 2/5] vfio/pci: virtualize PME related registers bits and initialize to zero Abhishek Sahu
2022-01-24 18:17 ` [RFC PATCH v2 3/5] vfio/pci: fix memory leak during D3hot to D0 tranistion Abhishek Sahu
2022-01-28  0:05   ` Alex Williamson
2022-01-31 11:34     ` Abhishek Sahu
2022-01-31 15:33       ` Alex Williamson
2022-01-24 18:17 ` [RFC PATCH v2 4/5] vfio/pci: Invalidate mmaps and block the access in D3hot power state Abhishek Sahu
2022-01-25  2:35   ` kernel test robot
2022-02-17 23:14   ` Alex Williamson
2022-02-21  8:12     ` Abhishek Sahu
2022-01-24 18:17 ` [RFC PATCH v2 5/5] vfio/pci: add the support for PCI D3cold state Abhishek Sahu
2022-03-09 17:26   ` Alex Williamson
2022-03-11 15:45     ` Abhishek Sahu
2022-03-11 23:06       ` Alex Williamson
2022-03-16  5:41         ` Abhishek Sahu
2022-03-16 18:44           ` Alex Williamson
2022-03-24 14:27             ` Abhishek Sahu
2022-03-11 16:17     ` Jason Gunthorpe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.