All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 0/6] vfio/pci: power management changes
@ 2022-07-01 11:08 Abhishek Sahu
  2022-07-01 11:08 ` [PATCH v4 1/6] vfio/pci: Mask INTx during runtime suspend Abhishek Sahu
                   ` (5 more replies)
  0 siblings, 6 replies; 26+ messages in thread
From: Abhishek Sahu @ 2022-07-01 11:08 UTC (permalink / raw)
  To: Alex Williamson, Cornelia Huck, Yishai Hadas, Jason Gunthorpe,
	Shameer Kolothum, Kevin Tian, Rafael J . Wysocki
  Cc: Max Gurtovoy, Bjorn Helgaas, linux-kernel, kvm, linux-pm,
	linux-pci, Abhishek Sahu

This is part 2 for the vfio-pci driver power management support.
Part 1 of this patch series was related to adding D3cold support
when there is no user of the VFIO device and has already merged in the
mainline kernel. If we enable the runtime power management for
vfio-pci device in the guest OS, then the device is being runtime
suspended (for linux guest OS) and the PCI device will be put into
D3hot state (in function vfio_pm_config_write()). If the D3cold
state can be used instead of D3hot, then it will help in saving
maximum power. The D3cold state can't be possible with native
PCI PM. It requires interaction with platform firmware which is
system-specific. To go into low power states (Including D3cold),
the runtime PM framework can be used which internally interacts
with PCI and platform firmware and puts the device into the
lowest possible D-States.

This patch series adds the support to engage runtime power management
initiated by the user. Since D3cold state can't be achieved by writing
PCI standard PM config registers, so a feature has been added in
DEVICE_FEATURE IOCTL for power management related handling. It
includes different flags which can be used for moving the device into
low power state and out of low power state. For the PCI device, this
low power state will be D3cold (if platform supports the D3cold
state). The hypervisors can implement virtual ACPI methods to make the
integration with guest OS. For example, in guest Linux OS if PCI device
ACPI node has _PR3 and _PR0 power resources with _ON/_OFF method,
then guest Linux OS makes the _OFF call during D3cold transition and
then _ON during D0 transition. The hypervisor can tap these virtual
ACPI calls and then do the low power related IOCTL.

Some devices (Like NVIDIA VGA or 3D controller) require driver
involvement each time before going into D3cold. Once the guest put the
device into D3cold, then the user can run some commands on the host side
(like lspci). The runtime PM framework will resume the device before
accessing the device and will suspend the device again. Now, this
second time suspend will be without guest driver involvement. vfio-pci
driver won't suspend the device if re-entry to low power is not
allowed. This patch series also adds virtual PME (power management
event) support which can be used to notify the guest OS for such kind
of host access. The guest can then put the device again into the
suspended state.

* Changes in v4

- Rebased patches on v5.19-rc4.
- Added virtual PME support.
- Used flags for low power entry and exit instead of explicit variable.
- Add the support to keep NVIDIA display related controllers in active
  state if there is any activity on the host side.
- Add a flag that can be set by the user to keep the device in the active
  state if there is any activity on the host side.
- Split the D3cold patch into smaller patches.
- Kept the runtime PM usage count incremented for all the IOCTL
  (except power management IOCTL) and all the PCI region access.
- Masked the runtime errors behind -EIO.
- Refactored logic in runtime suspend/resume routine and for power
  management device feature IOCTL.
- Add helper function for pm_runtime_put() also in the
  drivers/vfio/vfio.c and use the 'struct vfio_device' for the
  function parameter.
- Removed the requirement to move the device into D3hot before calling
  low power entry.
- Renamed power management related new members in the structure.
- Used 'pm_runtime_engaged' check in __vfio_pci_memory_enabled().

* Changes in v3
  (https://lore.kernel.org/lkml/20220425092615.10133-1-abhsahu@nvidia.com)

- Rebased patches on v5.18-rc3.
- Marked this series as PATCH instead of RFC.
- Addressed the review comments given in v2.
- Removed the limitation to keep device in D0 state if there is any
  access from host side. This is specific to NVIDIA use case and
  will be handled separately.
- Used the existing DEVICE_FEATURE IOCTL itself instead of adding new
  IOCTL for power management.
- Removed all custom code related with power management in runtime
  suspend/resume callbacks and IOCTL handling. Now, the callbacks
  contain code related with INTx handling and few other stuffs and
  all the PCI state and platform PM handling will be done by PCI core
  functions itself.
- Add the support of wake-up in main vfio layer itself since now we have
  more vfio/pci based drivers.
- Instead of assigning the 'struct dev_pm_ops' in individual parent
  driver, now the vfio_pci_core tself assigns the 'struct dev_pm_ops'. 
- Added handling of power management around SR-IOV handling.
- Moved the setting of drvdata in a separate patch.
- Masked INTx before during runtime suspended state.
- Changed the order of patches so that Fix related things are at beginning
  of this patch series.
- Removed storing the power state locally and used one new boolean to
  track the d3 (D3cold and D3hot) power state 
- Removed check for IO access in D3 power state.
- Used another helper function vfio_lock_and_set_power_state() instead
  of touching vfio_pci_set_power_state().
- Considered the fixes made in
  https://lore.kernel.org/lkml/20220217122107.22434-1-abhsahu@nvidia.com
  and updated the patches accordingly.

* Changes in v2
  (https://lore.kernel.org/lkml/20220124181726.19174-1-abhsahu@nvidia.com)

- Rebased patches on v5.17-rc1.
- Included the patch to handle BAR access in D3cold.
- Included the patch to fix memory leak.
- Made a separate IOCTL that can be used to change the power state from
  D3hot to D3cold and D3cold to D0.
- Addressed the review comments given in v1.

Abhishek Sahu (6):
  vfio/pci: Mask INTx during runtime suspend
  vfio: Add a new device feature for the power management
  vfio: Increment the runtime PM usage count during IOCTL call
  vfio/pci: Add the support for PCI D3cold state
  vfio/pci: Prevent low power re-entry without guest driver
  vfio/pci: Add support for virtual PME

 drivers/vfio/pci/vfio_pci_config.c |  41 +++-
 drivers/vfio/pci/vfio_pci_core.c   | 312 +++++++++++++++++++++++++++--
 drivers/vfio/pci/vfio_pci_intrs.c  |  24 ++-
 drivers/vfio/vfio.c                |  82 +++++++-
 include/linux/vfio_pci_core.h      |   8 +-
 include/uapi/linux/vfio.h          |  56 ++++++
 6 files changed, 492 insertions(+), 31 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v4 1/6] vfio/pci: Mask INTx during runtime suspend
  2022-07-01 11:08 [PATCH v4 0/6] vfio/pci: power management changes Abhishek Sahu
@ 2022-07-01 11:08 ` Abhishek Sahu
  2022-07-06 15:39   ` Alex Williamson
  2022-07-01 11:08 ` [PATCH v4 2/6] vfio: Add a new device feature for the power management Abhishek Sahu
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 26+ messages in thread
From: Abhishek Sahu @ 2022-07-01 11:08 UTC (permalink / raw)
  To: Alex Williamson, Cornelia Huck, Yishai Hadas, Jason Gunthorpe,
	Shameer Kolothum, Kevin Tian, Rafael J . Wysocki
  Cc: Max Gurtovoy, Bjorn Helgaas, linux-kernel, kvm, linux-pm,
	linux-pci, Abhishek Sahu

This patch adds INTx handling during runtime suspend/resume.
All the suspend/resume related code for the user to put the device
into the low power state will be added in subsequent patches.

The INTx are shared among devices. Whenever any INTx interrupt comes
for the VFIO devices, then vfio_intx_handler() will be called for each
device. Inside vfio_intx_handler(), it calls pci_check_and_mask_intx()
and checks if the interrupt has been generated for the current device.
Now, if the device is already in the D3cold state, then the config space
can not be read. Attempt to read config space in D3cold state can
cause system unresponsiveness in a few systems. To prevent this, mask
INTx in runtime suspend callback and unmask the same in runtime resume
callback. If INTx has been already masked, then no handling is needed
in runtime suspend/resume callbacks. 'pm_intx_masked' tracks this, and
vfio_pci_intx_mask() has been updated to return true if INTx has been
masked inside this function.

For the runtime suspend which is triggered for the no user of VFIO
device, the is_intx() will return false and these callbacks won't do
anything.

The MSI/MSI-X are not shared so similar handling should not be
needed for MSI/MSI-X. vfio_msihandler() triggers eventfd_signal()
without doing any device-specific config access. When the user performs
any config access or IOCTL after receiving the eventfd notification,
then the device will be moved to the D0 state first before
servicing any request.

Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
---
 drivers/vfio/pci/vfio_pci_core.c  | 37 +++++++++++++++++++++++++++----
 drivers/vfio/pci/vfio_pci_intrs.c |  6 ++++-
 include/linux/vfio_pci_core.h     |  3 ++-
 3 files changed, 40 insertions(+), 6 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index a0d69ddaf90d..5948d930449b 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -259,16 +259,45 @@ int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t stat
 	return ret;
 }
 
+#ifdef CONFIG_PM
+static int vfio_pci_core_runtime_suspend(struct device *dev)
+{
+	struct vfio_pci_core_device *vdev = dev_get_drvdata(dev);
+
+	/*
+	 * If INTx is enabled, then mask INTx before going into the runtime
+	 * suspended state and unmask the same in the runtime resume.
+	 * If INTx has already been masked by the user, then
+	 * vfio_pci_intx_mask() will return false and in that case, INTx
+	 * should not be unmasked in the runtime resume.
+	 */
+	vdev->pm_intx_masked = (is_intx(vdev) && vfio_pci_intx_mask(vdev));
+
+	return 0;
+}
+
+static int vfio_pci_core_runtime_resume(struct device *dev)
+{
+	struct vfio_pci_core_device *vdev = dev_get_drvdata(dev);
+
+	if (vdev->pm_intx_masked)
+		vfio_pci_intx_unmask(vdev);
+
+	return 0;
+}
+#endif /* CONFIG_PM */
+
 /*
- * The dev_pm_ops needs to be provided to make pci-driver runtime PM working,
- * so use structure without any callbacks.
- *
  * The pci-driver core runtime PM routines always save the device state
  * before going into suspended state. If the device is going into low power
  * state with only with runtime PM ops, then no explicit handling is needed
  * for the devices which have NoSoftRst-.
  */
-static const struct dev_pm_ops vfio_pci_core_pm_ops = { };
+static const struct dev_pm_ops vfio_pci_core_pm_ops = {
+	SET_RUNTIME_PM_OPS(vfio_pci_core_runtime_suspend,
+			   vfio_pci_core_runtime_resume,
+			   NULL)
+};
 
 int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
 {
diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index 6069a11fb51a..1a37db99df48 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -33,10 +33,12 @@ static void vfio_send_intx_eventfd(void *opaque, void *unused)
 		eventfd_signal(vdev->ctx[0].trigger, 1);
 }
 
-void vfio_pci_intx_mask(struct vfio_pci_core_device *vdev)
+/* Returns true if INTx has been masked by this function. */
+bool vfio_pci_intx_mask(struct vfio_pci_core_device *vdev)
 {
 	struct pci_dev *pdev = vdev->pdev;
 	unsigned long flags;
+	bool intx_masked = false;
 
 	spin_lock_irqsave(&vdev->irqlock, flags);
 
@@ -60,9 +62,11 @@ void vfio_pci_intx_mask(struct vfio_pci_core_device *vdev)
 			disable_irq_nosync(pdev->irq);
 
 		vdev->ctx[0].masked = true;
+		intx_masked = true;
 	}
 
 	spin_unlock_irqrestore(&vdev->irqlock, flags);
+	return intx_masked;
 }
 
 /*
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index 23c176d4b073..cdfd328ba6b1 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -124,6 +124,7 @@ struct vfio_pci_core_device {
 	bool			needs_reset;
 	bool			nointx;
 	bool			needs_pm_restore;
+	bool			pm_intx_masked;
 	struct pci_saved_state	*pci_saved_state;
 	struct pci_saved_state	*pm_save;
 	int			ioeventfds_nr;
@@ -147,7 +148,7 @@ struct vfio_pci_core_device {
 #define is_irq_none(vdev) (!(is_intx(vdev) || is_msi(vdev) || is_msix(vdev)))
 #define irq_is(vdev, type) (vdev->irq_type == type)
 
-extern void vfio_pci_intx_mask(struct vfio_pci_core_device *vdev);
+extern bool vfio_pci_intx_mask(struct vfio_pci_core_device *vdev);
 extern void vfio_pci_intx_unmask(struct vfio_pci_core_device *vdev);
 
 extern int vfio_pci_set_irqs_ioctl(struct vfio_pci_core_device *vdev,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v4 2/6] vfio: Add a new device feature for the power management
  2022-07-01 11:08 [PATCH v4 0/6] vfio/pci: power management changes Abhishek Sahu
  2022-07-01 11:08 ` [PATCH v4 1/6] vfio/pci: Mask INTx during runtime suspend Abhishek Sahu
@ 2022-07-01 11:08 ` Abhishek Sahu
  2022-07-06 15:39   ` Alex Williamson
  2022-07-01 11:08 ` [PATCH v4 3/6] vfio: Increment the runtime PM usage count during IOCTL call Abhishek Sahu
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 26+ messages in thread
From: Abhishek Sahu @ 2022-07-01 11:08 UTC (permalink / raw)
  To: Alex Williamson, Cornelia Huck, Yishai Hadas, Jason Gunthorpe,
	Shameer Kolothum, Kevin Tian, Rafael J . Wysocki
  Cc: Max Gurtovoy, Bjorn Helgaas, linux-kernel, kvm, linux-pm,
	linux-pci, Abhishek Sahu

This patch adds the new feature VFIO_DEVICE_FEATURE_POWER_MANAGEMENT
for the power management in the header file. The implementation for the
same will be added in the subsequent patches.

With the standard registers, all power states cannot be achieved. The
platform-based power management needs to be involved to go into the
lowest power state. For all the platform-based power management, this
device feature can be used.

This device feature uses flags to specify the different operations. In
the future, if any more power management functionality is needed then
a new flag can be added to it. It supports both GET and SET operations.

Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
---
 include/uapi/linux/vfio.h | 55 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 55 insertions(+)

diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 733a1cddde30..7e00de5c21ea 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -986,6 +986,61 @@ enum vfio_device_mig_state {
 	VFIO_DEVICE_STATE_RUNNING_P2P = 5,
 };
 
+/*
+ * Perform power management-related operations for the VFIO device.
+ *
+ * The low power feature uses platform-based power management to move the
+ * device into the low power state.  This low power state is device-specific.
+ *
+ * This device feature uses flags to specify the different operations.
+ * It supports both the GET and SET operations.
+ *
+ * - VFIO_PM_LOW_POWER_ENTER flag moves the VFIO device into the low power
+ *   state with platform-based power management.  This low power state will be
+ *   internal to the VFIO driver and the user will not come to know which power
+ *   state is chosen.  Once the user has moved the VFIO device into the low
+ *   power state, then the user should not do any device access without moving
+ *   the device out of the low power state.
+ *
+ * - VFIO_PM_LOW_POWER_EXIT flag moves the VFIO device out of the low power
+ *    state.  This flag should only be set if the user has previously put the
+ *    device into low power state with the VFIO_PM_LOW_POWER_ENTER flag.
+ *
+ * - VFIO_PM_LOW_POWER_ENTER and VFIO_PM_LOW_POWER_EXIT are mutually exclusive.
+ *
+ * - VFIO_PM_LOW_POWER_REENTERY_DISABLE flag is only valid with
+ *   VFIO_PM_LOW_POWER_ENTER.  If there is any access for the VFIO device on
+ *   the host side, then the device will be moved out of the low power state
+ *   without the user's guest driver involvement.  Some devices require the
+ *   user's guest driver involvement for each low-power entry.  If this flag is
+ *   set, then the re-entry to the low power state will be disabled, and the
+ *   host kernel will not move the device again into the low power state.
+ *   The VFIO driver internally maintains a list of devices for which low
+ *   power re-entry is disabled by default and for those devices, the
+ *   re-entry will be disabled even if the user has not set this flag
+ *   explicitly.
+ *
+ * For the IOCTL call with VFIO_DEVICE_FEATURE_GET:
+ *
+ * - VFIO_PM_LOW_POWER_ENTER will be set if the user has put the device into
+ *   the low power state, otherwise, VFIO_PM_LOW_POWER_EXIT will be set.
+ *
+ * - If the device is in a normal power state currently, then
+ *   VFIO_PM_LOW_POWER_REENTERY_DISABLE will be set for the devices where low
+ *   power re-entry is disabled by default.  If the device is in the low power
+ *   state currently, then VFIO_PM_LOW_POWER_REENTERY_DISABLE will be set
+ *   according to the current transition.
+ */
+struct vfio_device_feature_power_management {
+	__u32	flags;
+#define VFIO_PM_LOW_POWER_ENTER			(1 << 0)
+#define VFIO_PM_LOW_POWER_EXIT			(1 << 1)
+#define VFIO_PM_LOW_POWER_REENTERY_DISABLE	(1 << 2)
+	__u32	reserved;
+};
+
+#define VFIO_DEVICE_FEATURE_POWER_MANAGEMENT	3
+
 /* -------- API for Type1 VFIO IOMMU -------- */
 
 /**
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v4 3/6] vfio: Increment the runtime PM usage count during IOCTL call
  2022-07-01 11:08 [PATCH v4 0/6] vfio/pci: power management changes Abhishek Sahu
  2022-07-01 11:08 ` [PATCH v4 1/6] vfio/pci: Mask INTx during runtime suspend Abhishek Sahu
  2022-07-01 11:08 ` [PATCH v4 2/6] vfio: Add a new device feature for the power management Abhishek Sahu
@ 2022-07-01 11:08 ` Abhishek Sahu
  2022-07-06 15:40   ` Alex Williamson
  2022-07-01 11:08 ` [PATCH v4 4/6] vfio/pci: Add the support for PCI D3cold state Abhishek Sahu
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 26+ messages in thread
From: Abhishek Sahu @ 2022-07-01 11:08 UTC (permalink / raw)
  To: Alex Williamson, Cornelia Huck, Yishai Hadas, Jason Gunthorpe,
	Shameer Kolothum, Kevin Tian, Rafael J . Wysocki
  Cc: Max Gurtovoy, Bjorn Helgaas, linux-kernel, kvm, linux-pm,
	linux-pci, Abhishek Sahu

The vfio-pci based driver will have runtime power management
support where the user can put the device into the low power state
and then PCI devices can go into the D3cold state. If the device is
in the low power state and the user issues any IOCTL, then the
device should be moved out of the low power state first. Once
the IOCTL is serviced, then it can go into the low power state again.
The runtime PM framework manages this with help of usage count.

One option was to add the runtime PM related API's inside vfio-pci
driver but some IOCTL (like VFIO_DEVICE_FEATURE) can follow a
different path and more IOCTL can be added in the future. Also, the
runtime PM will be added for vfio-pci based drivers variant currently,
but the other VFIO based drivers can use the same in the
future. So, this patch adds the runtime calls runtime-related API in
the top-level IOCTL function itself.

For the VFIO drivers which do not have runtime power management
support currently, the runtime PM API's won't be invoked. Only for
vfio-pci based drivers currently, the runtime PM API's will be invoked
to increment and decrement the usage count.

Taking this usage count incremented while servicing IOCTL will make
sure that the user won't put the device into low power state when any
other IOCTL is being serviced in parallel. Let's consider the
following scenario:

 1. Some other IOCTL is called.
 2. The user has opened another device instance and called the power
    management IOCTL for the low power entry.
 3. The power management IOCTL moves the device into the low power state.
 4. The other IOCTL finishes.

If we don't keep the usage count incremented then the device
access will happen between step 3 and 4 while the device has already
gone into the low power state.

The runtime PM API's should not be invoked for
VFIO_DEVICE_FEATURE_POWER_MANAGEMENT since this IOCTL itself performs
the runtime power management entry and exit for the VFIO device.

The pm_runtime_resume_and_get() will be the first call so its error
should not be propagated to user space directly. For example, if
pm_runtime_resume_and_get() can return -EINVAL for the cases where the
user has passed the correct argument. So the
pm_runtime_resume_and_get() errors have been masked behind -EIO.

Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
---
 drivers/vfio/vfio.c | 82 ++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 74 insertions(+), 8 deletions(-)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 61e71c1154be..61a8d9f7629a 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -32,6 +32,7 @@
 #include <linux/vfio.h>
 #include <linux/wait.h>
 #include <linux/sched/signal.h>
+#include <linux/pm_runtime.h>
 #include "vfio.h"
 
 #define DRIVER_VERSION	"0.3"
@@ -1333,6 +1334,39 @@ static const struct file_operations vfio_group_fops = {
 	.release	= vfio_group_fops_release,
 };
 
+/*
+ * Wrapper around pm_runtime_resume_and_get().
+ * Return error code on failure or 0 on success.
+ */
+static inline int vfio_device_pm_runtime_get(struct vfio_device *device)
+{
+	struct device *dev = device->dev;
+
+	if (dev->driver && dev->driver->pm) {
+		int ret;
+
+		ret = pm_runtime_resume_and_get(dev);
+		if (ret < 0) {
+			dev_info_ratelimited(dev,
+				"vfio: runtime resume failed %d\n", ret);
+			return -EIO;
+		}
+	}
+
+	return 0;
+}
+
+/*
+ * Wrapper around pm_runtime_put().
+ */
+static inline void vfio_device_pm_runtime_put(struct vfio_device *device)
+{
+	struct device *dev = device->dev;
+
+	if (dev->driver && dev->driver->pm)
+		pm_runtime_put(dev);
+}
+
 /*
  * VFIO Device fd
  */
@@ -1607,6 +1641,8 @@ static int vfio_ioctl_device_feature(struct vfio_device *device,
 {
 	size_t minsz = offsetofend(struct vfio_device_feature, flags);
 	struct vfio_device_feature feature;
+	int ret = 0;
+	u16 feature_cmd;
 
 	if (copy_from_user(&feature, arg, minsz))
 		return -EFAULT;
@@ -1626,28 +1662,51 @@ static int vfio_ioctl_device_feature(struct vfio_device *device,
 	    (feature.flags & VFIO_DEVICE_FEATURE_GET))
 		return -EINVAL;
 
-	switch (feature.flags & VFIO_DEVICE_FEATURE_MASK) {
+	feature_cmd = feature.flags & VFIO_DEVICE_FEATURE_MASK;
+
+	/*
+	 * The VFIO_DEVICE_FEATURE_POWER_MANAGEMENT itself performs the runtime
+	 * power management entry and exit for the VFIO device, so the runtime
+	 * PM API's should not be called for this feature.
+	 */
+	if (feature_cmd != VFIO_DEVICE_FEATURE_POWER_MANAGEMENT) {
+		ret = vfio_device_pm_runtime_get(device);
+		if (ret)
+			return ret;
+	}
+
+	switch (feature_cmd) {
 	case VFIO_DEVICE_FEATURE_MIGRATION:
-		return vfio_ioctl_device_feature_migration(
+		ret = vfio_ioctl_device_feature_migration(
 			device, feature.flags, arg->data,
 			feature.argsz - minsz);
+		break;
 	case VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE:
-		return vfio_ioctl_device_feature_mig_device_state(
+		ret = vfio_ioctl_device_feature_mig_device_state(
 			device, feature.flags, arg->data,
 			feature.argsz - minsz);
+		break;
 	default:
 		if (unlikely(!device->ops->device_feature))
-			return -EINVAL;
-		return device->ops->device_feature(device, feature.flags,
-						   arg->data,
-						   feature.argsz - minsz);
+			ret = -EINVAL;
+		else
+			ret = device->ops->device_feature(
+				device, feature.flags, arg->data,
+				feature.argsz - minsz);
+		break;
 	}
+
+	if (feature_cmd != VFIO_DEVICE_FEATURE_POWER_MANAGEMENT)
+		vfio_device_pm_runtime_put(device);
+
+	return ret;
 }
 
 static long vfio_device_fops_unl_ioctl(struct file *filep,
 				       unsigned int cmd, unsigned long arg)
 {
 	struct vfio_device *device = filep->private_data;
+	int ret;
 
 	switch (cmd) {
 	case VFIO_DEVICE_FEATURE:
@@ -1655,7 +1714,14 @@ static long vfio_device_fops_unl_ioctl(struct file *filep,
 	default:
 		if (unlikely(!device->ops->ioctl))
 			return -EINVAL;
-		return device->ops->ioctl(device, cmd, arg);
+
+		ret = vfio_device_pm_runtime_get(device);
+		if (ret)
+			return ret;
+
+		ret = device->ops->ioctl(device, cmd, arg);
+		vfio_device_pm_runtime_put(device);
+		return ret;
 	}
 }
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v4 4/6] vfio/pci: Add the support for PCI D3cold state
  2022-07-01 11:08 [PATCH v4 0/6] vfio/pci: power management changes Abhishek Sahu
                   ` (2 preceding siblings ...)
  2022-07-01 11:08 ` [PATCH v4 3/6] vfio: Increment the runtime PM usage count during IOCTL call Abhishek Sahu
@ 2022-07-01 11:08 ` Abhishek Sahu
  2022-07-06 15:40   ` Alex Williamson
  2022-07-01 11:08 ` [PATCH v4 5/6] vfio/pci: Prevent low power re-entry without guest driver Abhishek Sahu
  2022-07-01 11:08 ` [PATCH v4 6/6] vfio/pci: Add support for virtual PME Abhishek Sahu
  5 siblings, 1 reply; 26+ messages in thread
From: Abhishek Sahu @ 2022-07-01 11:08 UTC (permalink / raw)
  To: Alex Williamson, Cornelia Huck, Yishai Hadas, Jason Gunthorpe,
	Shameer Kolothum, Kevin Tian, Rafael J . Wysocki
  Cc: Max Gurtovoy, Bjorn Helgaas, linux-kernel, kvm, linux-pm,
	linux-pci, Abhishek Sahu

Currently, if the runtime power management is enabled for vfio-pci
based devices in the guest OS, then the guest OS will do the register
write for PCI_PM_CTRL register. This write request will be handled in
vfio_pm_config_write() where it will do the actual register write of
PCI_PM_CTRL register. With this, the maximum D3hot state can be
achieved for low power. If we can use the runtime PM framework, then
we can achieve the D3cold state (on the supported systems) which will
help in saving maximum power.

1. D3cold state can't be achieved by writing PCI standard
   PM config registers. This patch implements the newly added
   'VFIO_DEVICE_FEATURE_POWER_MANAGEMENT' device feature which
    can be used for putting the device into the D3cold state.

2. The hypervisors can implement virtual ACPI methods. For example,
   in guest linux OS if PCI device ACPI node has _PR3 and _PR0 power
   resources with _ON/_OFF method, then guest linux OS invokes
   the _OFF method during D3cold transition and then _ON during D0
   transition. The hypervisor can tap these virtual ACPI calls and then
   call the  'VFIO_DEVICE_FEATURE_POWER_MANAGEMENT' with respective flags.

3. The vfio-pci driver uses runtime PM framework to achieve the
   D3cold state. For the D3cold transition, decrement the usage count and
   for the D0 transition, increment the usage count.

4. If the D3cold state is not supported, then the device will
   still be in the D3hot state. But with the runtime PM, the root port
   can now also go into the suspended state.

5. The 'pm_runtime_engaged' flag tracks the entry and exit to
   runtime PM. This flag is protected with 'memory_lock' semaphore.

6. During exit time, the flag clearing and usage count increment
   are protected with 'memory_lock'. The actual wake-up is happening
   outside 'memory_lock' since 'memory_lock' will be needed inside
   runtime_resume callback also in subsequent patches.

7. In D3cold, all kinds of device-related access (BAR read/write,
   config read/write, etc.) need to be disabled. For BAR-related access,
   we can use existing D3hot memory disable support. During the low power
   entry, invalidate the mmap mappings and add the check for
   'pm_runtime_engaged' flag.

8. For config space, ideally, we need to return an error whenever
   there is any config access from the user side once the user moved the
   device into low power state. But adding a check for
   'pm_runtime_engaged' flag alone won't be sufficient due to the
   following possible scenario from the user side where config space
   access happens parallelly with the low power entry IOCTL.

   a. Config space access happens and vfio_pci_config_rw() will be
      called.
   b. The IOCTL to move into low power state is called.
   c. The IOCTL will move the device into d3cold.
   d. Exit from vfio_pci_config_rw() happened.

   Now, if we just check 'pm_runtime_engaged', then in the above
   sequence the config space access will happen when the device already
   is in the low power state. To prevent this situation, we increment the
   usage count before any config space access and decrement the same
   after completing the access. Also, to prevent any similar cases for
   other types of access, the usage count will be incremented for all
   kinds of access.

Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
---
 drivers/vfio/pci/vfio_pci_config.c |   2 +-
 drivers/vfio/pci/vfio_pci_core.c   | 169 +++++++++++++++++++++++++++--
 include/linux/vfio_pci_core.h      |   1 +
 3 files changed, 164 insertions(+), 8 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
index 9343f597182d..21a4743d011f 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -408,7 +408,7 @@ bool __vfio_pci_memory_enabled(struct vfio_pci_core_device *vdev)
 	 * PF SR-IOV capability, there's therefore no need to trigger
 	 * faults based on the virtual value.
 	 */
-	return pdev->current_state < PCI_D3hot &&
+	return !vdev->pm_runtime_engaged && pdev->current_state < PCI_D3hot &&
 	       (pdev->no_command_memory || (cmd & PCI_COMMAND_MEMORY));
 }
 
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 5948d930449b..8c17ca41d156 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -264,6 +264,18 @@ static int vfio_pci_core_runtime_suspend(struct device *dev)
 {
 	struct vfio_pci_core_device *vdev = dev_get_drvdata(dev);
 
+	down_write(&vdev->memory_lock);
+	/*
+	 * The user can move the device into D3hot state before invoking
+	 * power management IOCTL. Move the device into D0 state here and then
+	 * the pci-driver core runtime PM suspend function will move the device
+	 * into the low power state. Also, for the devices which have
+	 * NoSoftRst-, it will help in restoring the original state
+	 * (saved locally in 'vdev->pm_save').
+	 */
+	vfio_pci_set_power_state(vdev, PCI_D0);
+	up_write(&vdev->memory_lock);
+
 	/*
 	 * If INTx is enabled, then mask INTx before going into the runtime
 	 * suspended state and unmask the same in the runtime resume.
@@ -386,6 +398,7 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
 	struct pci_dev *pdev = vdev->pdev;
 	struct vfio_pci_dummy_resource *dummy_res, *tmp;
 	struct vfio_pci_ioeventfd *ioeventfd, *ioeventfd_tmp;
+	bool do_resume = false;
 	int i, bar;
 
 	/* For needs_reset */
@@ -393,6 +406,25 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
 
 	/*
 	 * This function can be invoked while the power state is non-D0.
+	 * This non-D0 power state can be with or without runtime PM.
+	 * Increment the usage count corresponding to pm_runtime_put()
+	 * called during setting of 'pm_runtime_engaged'. The device will
+	 * wake up if it has already gone into the suspended state.
+	 * Otherwise, the next vfio_pci_set_power_state() will change the
+	 * device power state to D0.
+	 */
+	down_write(&vdev->memory_lock);
+	if (vdev->pm_runtime_engaged) {
+		vdev->pm_runtime_engaged = false;
+		pm_runtime_get_noresume(&pdev->dev);
+		do_resume = true;
+	}
+	up_write(&vdev->memory_lock);
+
+	if (do_resume)
+		pm_runtime_resume(&pdev->dev);
+
+	/*
 	 * This function calls __pci_reset_function_locked() which internally
 	 * can use pci_pm_reset() for the function reset. pci_pm_reset() will
 	 * fail if the power state is non-D0. Also, for the devices which
@@ -1190,6 +1222,99 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
 }
 EXPORT_SYMBOL_GPL(vfio_pci_core_ioctl);
 
+static int vfio_pci_pm_validate_flags(u32 flags)
+{
+	if (!flags)
+		return -EINVAL;
+	/* Only valid flags should be set */
+	if (flags & ~(VFIO_PM_LOW_POWER_ENTER | VFIO_PM_LOW_POWER_EXIT))
+		return -EINVAL;
+	/* Both enter and exit should not be set */
+	if ((flags & (VFIO_PM_LOW_POWER_ENTER | VFIO_PM_LOW_POWER_EXIT)) ==
+	    (VFIO_PM_LOW_POWER_ENTER | VFIO_PM_LOW_POWER_EXIT))
+		return -EINVAL;
+
+	return 0;
+}
+
+static int vfio_pci_core_feature_pm(struct vfio_device *device, u32 flags,
+				    void __user *arg, size_t argsz)
+{
+	struct vfio_pci_core_device *vdev =
+		container_of(device, struct vfio_pci_core_device, vdev);
+	struct pci_dev *pdev = vdev->pdev;
+	struct vfio_device_feature_power_management vfio_pm = { 0 };
+	int ret = 0;
+
+	ret = vfio_check_feature(flags, argsz,
+				 VFIO_DEVICE_FEATURE_SET |
+				 VFIO_DEVICE_FEATURE_GET,
+				 sizeof(vfio_pm));
+	if (ret != 1)
+		return ret;
+
+	if (flags & VFIO_DEVICE_FEATURE_GET) {
+		down_read(&vdev->memory_lock);
+		if (vdev->pm_runtime_engaged)
+			vfio_pm.flags = VFIO_PM_LOW_POWER_ENTER;
+		else
+			vfio_pm.flags = VFIO_PM_LOW_POWER_EXIT;
+		up_read(&vdev->memory_lock);
+
+		if (copy_to_user(arg, &vfio_pm, sizeof(vfio_pm)))
+			return -EFAULT;
+
+		return 0;
+	}
+
+	if (copy_from_user(&vfio_pm, arg, sizeof(vfio_pm)))
+		return -EFAULT;
+
+	ret = vfio_pci_pm_validate_flags(vfio_pm.flags);
+	if (ret)
+		return ret;
+
+	/*
+	 * The vdev power related flags are protected with 'memory_lock'
+	 * semaphore.
+	 */
+	if (vfio_pm.flags & VFIO_PM_LOW_POWER_ENTER) {
+		vfio_pci_zap_and_down_write_memory_lock(vdev);
+		if (vdev->pm_runtime_engaged) {
+			up_write(&vdev->memory_lock);
+			return -EINVAL;
+		}
+
+		vdev->pm_runtime_engaged = true;
+		up_write(&vdev->memory_lock);
+		pm_runtime_put(&pdev->dev);
+	} else if (vfio_pm.flags & VFIO_PM_LOW_POWER_EXIT) {
+		down_write(&vdev->memory_lock);
+		if (!vdev->pm_runtime_engaged) {
+			up_write(&vdev->memory_lock);
+			return -EINVAL;
+		}
+
+		vdev->pm_runtime_engaged = false;
+		pm_runtime_get_noresume(&pdev->dev);
+		up_write(&vdev->memory_lock);
+		ret = pm_runtime_resume(&pdev->dev);
+		if (ret < 0) {
+			down_write(&vdev->memory_lock);
+			if (!vdev->pm_runtime_engaged) {
+				vdev->pm_runtime_engaged = true;
+				pm_runtime_put_noidle(&pdev->dev);
+			}
+			up_write(&vdev->memory_lock);
+			return ret;
+		}
+	} else {
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
 static int vfio_pci_core_feature_token(struct vfio_device *device, u32 flags,
 				       void __user *arg, size_t argsz)
 {
@@ -1224,6 +1349,8 @@ int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
 	switch (flags & VFIO_DEVICE_FEATURE_MASK) {
 	case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN:
 		return vfio_pci_core_feature_token(device, flags, arg, argsz);
+	case VFIO_DEVICE_FEATURE_POWER_MANAGEMENT:
+		return vfio_pci_core_feature_pm(device, flags, arg, argsz);
 	default:
 		return -ENOTTY;
 	}
@@ -1234,31 +1361,47 @@ static ssize_t vfio_pci_rw(struct vfio_pci_core_device *vdev, char __user *buf,
 			   size_t count, loff_t *ppos, bool iswrite)
 {
 	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+	int ret;
 
 	if (index >= VFIO_PCI_NUM_REGIONS + vdev->num_regions)
 		return -EINVAL;
 
+	ret = pm_runtime_resume_and_get(&vdev->pdev->dev);
+	if (ret < 0) {
+		pci_info_ratelimited(vdev->pdev, "runtime resume failed %d\n",
+				     ret);
+		return -EIO;
+	}
+
 	switch (index) {
 	case VFIO_PCI_CONFIG_REGION_INDEX:
-		return vfio_pci_config_rw(vdev, buf, count, ppos, iswrite);
+		ret = vfio_pci_config_rw(vdev, buf, count, ppos, iswrite);
+		break;
 
 	case VFIO_PCI_ROM_REGION_INDEX:
 		if (iswrite)
-			return -EINVAL;
-		return vfio_pci_bar_rw(vdev, buf, count, ppos, false);
+			ret = -EINVAL;
+		else
+			ret = vfio_pci_bar_rw(vdev, buf, count, ppos, false);
+		break;
 
 	case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
-		return vfio_pci_bar_rw(vdev, buf, count, ppos, iswrite);
+		ret = vfio_pci_bar_rw(vdev, buf, count, ppos, iswrite);
+		break;
 
 	case VFIO_PCI_VGA_REGION_INDEX:
-		return vfio_pci_vga_rw(vdev, buf, count, ppos, iswrite);
+		ret = vfio_pci_vga_rw(vdev, buf, count, ppos, iswrite);
+		break;
+
 	default:
 		index -= VFIO_PCI_NUM_REGIONS;
-		return vdev->region[index].ops->rw(vdev, buf,
+		ret = vdev->region[index].ops->rw(vdev, buf,
 						   count, ppos, iswrite);
+		break;
 	}
 
-	return -EINVAL;
+	pm_runtime_put(&vdev->pdev->dev);
+	return ret;
 }
 
 ssize_t vfio_pci_core_read(struct vfio_device *core_vdev, char __user *buf,
@@ -2157,6 +2300,15 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
 		goto err_unlock;
 	}
 
+	/*
+	 * Some of the devices in the dev_set can be in the runtime suspended
+	 * state. Increment the usage count for all the devices in the dev_set
+	 * before reset and decrement the same after reset.
+	 */
+	ret = vfio_pci_dev_set_pm_runtime_get(dev_set);
+	if (ret)
+		goto err_unlock;
+
 	list_for_each_entry(cur_vma, &dev_set->device_list, vdev.dev_set_list) {
 		/*
 		 * Test whether all the affected devices are contained by the
@@ -2212,6 +2364,9 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
 		else
 			mutex_unlock(&cur->vma_lock);
 	}
+
+	list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list)
+		pm_runtime_put(&cur->pdev->dev);
 err_unlock:
 	mutex_unlock(&dev_set->lock);
 	return ret;
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index cdfd328ba6b1..bf4823b008f9 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -125,6 +125,7 @@ struct vfio_pci_core_device {
 	bool			nointx;
 	bool			needs_pm_restore;
 	bool			pm_intx_masked;
+	bool			pm_runtime_engaged;
 	struct pci_saved_state	*pci_saved_state;
 	struct pci_saved_state	*pm_save;
 	int			ioeventfds_nr;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v4 5/6] vfio/pci: Prevent low power re-entry without guest driver
  2022-07-01 11:08 [PATCH v4 0/6] vfio/pci: power management changes Abhishek Sahu
                   ` (3 preceding siblings ...)
  2022-07-01 11:08 ` [PATCH v4 4/6] vfio/pci: Add the support for PCI D3cold state Abhishek Sahu
@ 2022-07-01 11:08 ` Abhishek Sahu
  2022-07-06 15:40   ` Alex Williamson
  2022-07-01 11:08 ` [PATCH v4 6/6] vfio/pci: Add support for virtual PME Abhishek Sahu
  5 siblings, 1 reply; 26+ messages in thread
From: Abhishek Sahu @ 2022-07-01 11:08 UTC (permalink / raw)
  To: Alex Williamson, Cornelia Huck, Yishai Hadas, Jason Gunthorpe,
	Shameer Kolothum, Kevin Tian, Rafael J . Wysocki
  Cc: Max Gurtovoy, Bjorn Helgaas, linux-kernel, kvm, linux-pm,
	linux-pci, Abhishek Sahu

Some devices (Like NVIDIA VGA or 3D controller) require driver
involvement each time before going into D3cold. In the regular flow,
the guest driver do all the required steps inside the guest OS and
then the IOCTL will be called for D3cold entry by the hypervisor. Now,
if there is any activity on the host side (For example user has run
lspci, dump the config space through sysfs, etc.), then the runtime PM
framework will resume the device first, perform the operation and then
suspend the device again. This second time suspend will be without
guest driver involvement. This patch adds the support to prevent
second-time runtime suspend if there is any wake-up. This prevention
is either based on the predefined vendor/class id list or the user can
specify the flag (VFIO_PM_LOW_POWER_REENTERY_DISABLE) during entry for
the same.

'pm_runtime_reentry_allowed' flag tracks if this re-entry is allowed.
It will be set during the entry time.

'pm_runtime_resumed' flag tracks if there is any wake-up before the
guest performs the wake-up. If re-entry is not allowed, then during
runtime resume, the runtime PM count will be incremented, and this
flag will be set. This flag will be checked during guest D3cold exit
and then skip the runtime PM-related handling if this flag is set.

During guest low power exit time, all vdev power-related flags are
accessed under 'memory_lock' and usage count will be incremented. The
resume will be triggered after releasing the lock since the runtime
resume callback again requires the lock. pm_runtime_get_noresume()/
pm_runtime_resume() have been used instead of
pm_runtime_resume_and_get() to handle the following scenario during
the race condition.

 a. The guest triggered the low power exit.
 b. The guest thread got the lock and cleared the vdev related
    flags and released the locks.
 c. Before pm_runtime_resume_and_get(), the host lspci thread got
    scheduled and triggered the runtime resume.
 d. Now, all the vdev related flags are cleared so there won't be
    any extra handling inside the runtime resume.
 e. The runtime PM put the device again into the suspended state.
 f. The guest triggered pm_runtime_resume_and_get() got called.

So, at step (e), the suspend is happening without the guest driver
involvement. Now, by using pm_runtime_get_noresume() before releasing
'memory_lock', the runtime PM framework can't suspend the device due
to incremented usage count.

Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
---
 drivers/vfio/pci/vfio_pci_core.c | 87 ++++++++++++++++++++++++++++++--
 include/linux/vfio_pci_core.h    |  2 +
 2 files changed, 84 insertions(+), 5 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 8c17ca41d156..1ddaaa6ccef5 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -191,6 +191,20 @@ static bool vfio_pci_nointx(struct pci_dev *pdev)
 	return false;
 }
 
+static bool vfio_pci_low_power_reentry_allowed(struct pci_dev *pdev)
+{
+	/*
+	 * The NVIDIA display class requires driver involvement for every
+	 * D3cold entry. The audio and other classes can go into D3cold
+	 * without driver involvement.
+	 */
+	if (pdev->vendor == PCI_VENDOR_ID_NVIDIA &&
+	    ((pdev->class >> 16) == PCI_BASE_CLASS_DISPLAY))
+		return false;
+
+	return true;
+}
+
 static void vfio_pci_probe_power_state(struct vfio_pci_core_device *vdev)
 {
 	struct pci_dev *pdev = vdev->pdev;
@@ -295,6 +309,27 @@ static int vfio_pci_core_runtime_resume(struct device *dev)
 	if (vdev->pm_intx_masked)
 		vfio_pci_intx_unmask(vdev);
 
+	down_write(&vdev->memory_lock);
+
+	/*
+	 * The runtime resume callback will be called for one of the following
+	 * two cases:
+	 *
+	 * - If the user has called IOCTL explicitly to move the device out of
+	 *   the low power state or closed the device.
+	 * - If there is device access on the host side.
+	 *
+	 * For the second case, check if re-entry to the low power state is
+	 * allowed. If not, then increment the usage count so that runtime PM
+	 * framework won't suspend the device and set the 'pm_runtime_resumed'
+	 * flag.
+	 */
+	if (vdev->pm_runtime_engaged && !vdev->pm_runtime_reentry_allowed) {
+		pm_runtime_get_noresume(dev);
+		vdev->pm_runtime_resumed = true;
+	}
+	up_write(&vdev->memory_lock);
+
 	return 0;
 }
 #endif /* CONFIG_PM */
@@ -415,9 +450,12 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
 	 */
 	down_write(&vdev->memory_lock);
 	if (vdev->pm_runtime_engaged) {
+		if (!vdev->pm_runtime_resumed) {
+			pm_runtime_get_noresume(&pdev->dev);
+			do_resume = true;
+		}
+		vdev->pm_runtime_resumed = false;
 		vdev->pm_runtime_engaged = false;
-		pm_runtime_get_noresume(&pdev->dev);
-		do_resume = true;
 	}
 	up_write(&vdev->memory_lock);
 
@@ -1227,12 +1265,17 @@ static int vfio_pci_pm_validate_flags(u32 flags)
 	if (!flags)
 		return -EINVAL;
 	/* Only valid flags should be set */
-	if (flags & ~(VFIO_PM_LOW_POWER_ENTER | VFIO_PM_LOW_POWER_EXIT))
+	if (flags & ~(VFIO_PM_LOW_POWER_ENTER | VFIO_PM_LOW_POWER_EXIT |
+		      VFIO_PM_LOW_POWER_REENTERY_DISABLE))
 		return -EINVAL;
 	/* Both enter and exit should not be set */
 	if ((flags & (VFIO_PM_LOW_POWER_ENTER | VFIO_PM_LOW_POWER_EXIT)) ==
 	    (VFIO_PM_LOW_POWER_ENTER | VFIO_PM_LOW_POWER_EXIT))
 		return -EINVAL;
+	/* re-entry disable can only be set with enter */
+	if ((flags & VFIO_PM_LOW_POWER_REENTERY_DISABLE) &&
+	    !(flags & VFIO_PM_LOW_POWER_ENTER))
+		return -EINVAL;
 
 	return 0;
 }
@@ -1255,10 +1298,17 @@ static int vfio_pci_core_feature_pm(struct vfio_device *device, u32 flags,
 
 	if (flags & VFIO_DEVICE_FEATURE_GET) {
 		down_read(&vdev->memory_lock);
-		if (vdev->pm_runtime_engaged)
+		if (vdev->pm_runtime_engaged) {
 			vfio_pm.flags = VFIO_PM_LOW_POWER_ENTER;
-		else
+			if (!vdev->pm_runtime_reentry_allowed)
+				vfio_pm.flags |=
+					VFIO_PM_LOW_POWER_REENTERY_DISABLE;
+		} else {
 			vfio_pm.flags = VFIO_PM_LOW_POWER_EXIT;
+			if (!vfio_pci_low_power_reentry_allowed(pdev))
+				vfio_pm.flags |=
+					VFIO_PM_LOW_POWER_REENTERY_DISABLE;
+		}
 		up_read(&vdev->memory_lock);
 
 		if (copy_to_user(arg, &vfio_pm, sizeof(vfio_pm)))
@@ -1286,6 +1336,19 @@ static int vfio_pci_core_feature_pm(struct vfio_device *device, u32 flags,
 		}
 
 		vdev->pm_runtime_engaged = true;
+		vdev->pm_runtime_resumed = false;
+
+		/*
+		 * If there is any access when the device is in the runtime
+		 * suspended state, then the device will be resumed first
+		 * before access and then the device will be suspended again.
+		 * Check if this second time suspend is allowed and track the
+		 * same in 'pm_runtime_reentry_allowed' flag.
+		 */
+		vdev->pm_runtime_reentry_allowed =
+			vfio_pci_low_power_reentry_allowed(pdev) &&
+			!(vfio_pm.flags & VFIO_PM_LOW_POWER_REENTERY_DISABLE);
+
 		up_write(&vdev->memory_lock);
 		pm_runtime_put(&pdev->dev);
 	} else if (vfio_pm.flags & VFIO_PM_LOW_POWER_EXIT) {
@@ -1296,6 +1359,20 @@ static int vfio_pci_core_feature_pm(struct vfio_device *device, u32 flags,
 		}
 
 		vdev->pm_runtime_engaged = false;
+		if (vdev->pm_runtime_resumed) {
+			vdev->pm_runtime_resumed = false;
+			up_write(&vdev->memory_lock);
+			return 0;
+		}
+
+		/*
+		 * The 'memory_lock' will be acquired again inside the runtime
+		 * resume callback. So, increment the usage count inside the
+		 * lock and call pm_runtime_resume() after releasing the lock.
+		 * If there is any race condition between the wake-up generated
+		 * at the host and the current path. Then the incremented usage
+		 * count will prevent the device to go into the suspended state.
+		 */
 		pm_runtime_get_noresume(&pdev->dev);
 		up_write(&vdev->memory_lock);
 		ret = pm_runtime_resume(&pdev->dev);
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index bf4823b008f9..18cc83b767b8 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -126,6 +126,8 @@ struct vfio_pci_core_device {
 	bool			needs_pm_restore;
 	bool			pm_intx_masked;
 	bool			pm_runtime_engaged;
+	bool			pm_runtime_resumed;
+	bool			pm_runtime_reentry_allowed;
 	struct pci_saved_state	*pci_saved_state;
 	struct pci_saved_state	*pm_save;
 	int			ioeventfds_nr;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v4 6/6] vfio/pci: Add support for virtual PME
  2022-07-01 11:08 [PATCH v4 0/6] vfio/pci: power management changes Abhishek Sahu
                   ` (4 preceding siblings ...)
  2022-07-01 11:08 ` [PATCH v4 5/6] vfio/pci: Prevent low power re-entry without guest driver Abhishek Sahu
@ 2022-07-01 11:08 ` Abhishek Sahu
  2022-07-06 15:40   ` Alex Williamson
  5 siblings, 1 reply; 26+ messages in thread
From: Abhishek Sahu @ 2022-07-01 11:08 UTC (permalink / raw)
  To: Alex Williamson, Cornelia Huck, Yishai Hadas, Jason Gunthorpe,
	Shameer Kolothum, Kevin Tian, Rafael J . Wysocki
  Cc: Max Gurtovoy, Bjorn Helgaas, linux-kernel, kvm, linux-pm,
	linux-pci, Abhishek Sahu

If the PCI device is in low power state and the device requires
wake-up, then it can generate PME (Power Management Events). Mostly
these PME events will be propagated to the root port and then the
root port will generate the system interrupt. Then the OS should
identify the device which generated the PME and should resume
the device.

We can implement a similar virtual PME framework where if the device
already went into the runtime suspended state and then there is any
wake-up on the host side, then it will send the virtual PME
notification to the guest. This virtual PME will be helpful for the cases
where the device will not be suspended again if there is any wake-up
triggered by the host. Following is the overall approach regarding
the virtual PME.

1. Add one more event like VFIO_PCI_ERR_IRQ_INDEX named
   VFIO_PCI_PME_IRQ_INDEX and do the required code changes to get/set
   this new IRQ.

2. From the guest side, the guest needs to enable eventfd for the
   virtual PME notification.

3. In the vfio-pci driver, the PME support bits are currently
   virtualized and set to 0. We can set PME capability support for all
   the power states. This PME capability support is independent of the
   physical PME support.

4. The PME enable (PME_En bit in Power Management Control/Status
   Register) and PME status (PME_Status bit in Power Management
   Control/Status Register) are also virtualized currently.
   The write support for PME_En bit can be enabled.

5. The PME_Status bit is a write-1-clear bit where the write with
   zero value will have no effect and write with 1 value will clear the
   bit. The write for this bit will be trapped inside
   vfio_pm_config_write() similar to PCI_PM_CTRL write for PM_STATES.

6. When the host gets a request for resuming the device other than from
   low power exit feature IOCTL, then PME_Status bit will be set.
   According to [PCIe v5 7.5.2.2],
     "PME_Status - This bit is Set when the Function would normally
      generate a PME signal. The value of this bit is not affected by
      the value of the PME_En bit."

   So even if PME_En bit is not set, we can set PME_Status bit.

7. If the guest has enabled PME_En and registered for PME events
   through eventfd, then the usage count will be incremented to prevent
   the device to go into the suspended state and notify the guest through
   eventfd trigger.

The virtual PME can help in handling physical PME also. When
physical PME comes, then also the runtime resume will be called. If
the guest has registered for virtual PME, then it will be sent in this
case also.

* Implementation for handling the virtual PME on the hypervisor:

If we take the implementation in Linux OS, then during runtime suspend
time, then it calls __pci_enable_wake(). It internally enables PME
through pci_pme_active() and also enables the ACPI side wake-up
through platform_pci_set_wakeup(). To handle the PME, the hypervisor has
the following two options:

1. Create a virtual root port for the VFIO device and trigger
   interrupt when the PME comes. It will call pcie_pme_irq() which will
   resume the device.

2. Create a virtual ACPI _PRW resource and associate it with the device
   itself. In _PRW, any GPE (General Purpose Event) can be assigned for
   the wake-up. When PME comes, then GPE can be triggered by the
   hypervisor. GPE interrupt will call pci_acpi_wake_dev() function
   internally and it will resume the device.

Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
---
 drivers/vfio/pci/vfio_pci_config.c | 39 +++++++++++++++++++++------
 drivers/vfio/pci/vfio_pci_core.c   | 43 ++++++++++++++++++++++++------
 drivers/vfio/pci/vfio_pci_intrs.c  | 18 +++++++++++++
 include/linux/vfio_pci_core.h      |  2 ++
 include/uapi/linux/vfio.h          |  1 +
 5 files changed, 87 insertions(+), 16 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
index 21a4743d011f..a06375a03758 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -719,6 +719,20 @@ static int vfio_pm_config_write(struct vfio_pci_core_device *vdev, int pos,
 	if (count < 0)
 		return count;
 
+	/*
+	 * PME_STATUS is write-1-clear bit. If PME_STATUS is 1, then clear the
+	 * bit in vconfig. The PME_STATUS is in the upper byte of the control
+	 * register and user can do single byte write also.
+	 */
+	if (offset <= PCI_PM_CTRL + 1 && offset + count > PCI_PM_CTRL + 1) {
+		if (le32_to_cpu(val) &
+		    (PCI_PM_CTRL_PME_STATUS >> (offset - PCI_PM_CTRL) * 8)) {
+			__le16 *ctrl = (__le16 *)&vdev->vconfig
+					[vdev->pm_cap_offset + PCI_PM_CTRL];
+			*ctrl &= ~cpu_to_le16(PCI_PM_CTRL_PME_STATUS);
+		}
+	}
+
 	if (offset == PCI_PM_CTRL) {
 		pci_power_t state;
 
@@ -771,14 +785,16 @@ static int __init init_pci_cap_pm_perm(struct perm_bits *perm)
 	 * the user change power state, but we trap and initiate the
 	 * change ourselves, so the state bits are read-only.
 	 *
-	 * The guest can't process PME from D3cold so virtualize PME_Status
-	 * and PME_En bits. The vconfig bits will be cleared during device
-	 * capability initialization.
+	 * The guest can't process physical PME from D3cold so virtualize
+	 * PME_Status and PME_En bits. These bits will be used for the
+	 * virtual PME between host and guest. The vconfig bits will be
+	 * updated during device capability initialization. PME_Status is
+	 * write-1-clear bit, so it is read-only. We trap and update the
+	 * vconfig bit manually during write.
 	 */
 	p_setd(perm, PCI_PM_CTRL,
 	       PCI_PM_CTRL_PME_ENABLE | PCI_PM_CTRL_PME_STATUS,
-	       ~(PCI_PM_CTRL_PME_ENABLE | PCI_PM_CTRL_PME_STATUS |
-		 PCI_PM_CTRL_STATE_MASK));
+	       ~(PCI_PM_CTRL_STATE_MASK | PCI_PM_CTRL_PME_STATUS));
 
 	return 0;
 }
@@ -1454,8 +1470,13 @@ static void vfio_update_pm_vconfig_bytes(struct vfio_pci_core_device *vdev,
 	__le16 *pmc = (__le16 *)&vdev->vconfig[offset + PCI_PM_PMC];
 	__le16 *ctrl = (__le16 *)&vdev->vconfig[offset + PCI_PM_CTRL];
 
-	/* Clear vconfig PME_Support, PME_Status, and PME_En bits */
-	*pmc &= ~cpu_to_le16(PCI_PM_CAP_PME_MASK);
+	/*
+	 * Set the vconfig PME_Support bits. The PME_Status is being used for
+	 * virtual PME support and is not dependent upon the physical
+	 * PME support.
+	 */
+	*pmc |= cpu_to_le16(PCI_PM_CAP_PME_MASK);
+	/* Clear vconfig PME_Support and PME_En bits */
 	*ctrl &= ~cpu_to_le16(PCI_PM_CTRL_PME_ENABLE | PCI_PM_CTRL_PME_STATUS);
 }
 
@@ -1582,8 +1603,10 @@ static int vfio_cap_init(struct vfio_pci_core_device *vdev)
 		if (ret)
 			return ret;
 
-		if (cap == PCI_CAP_ID_PM)
+		if (cap == PCI_CAP_ID_PM) {
+			vdev->pm_cap_offset = pos;
 			vfio_update_pm_vconfig_bytes(vdev, pos);
+		}
 
 		prev = &vdev->vconfig[pos + PCI_CAP_LIST_NEXT];
 		pos = next;
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 1ddaaa6ccef5..6c1225bc2aeb 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -319,14 +319,35 @@ static int vfio_pci_core_runtime_resume(struct device *dev)
 	 *   the low power state or closed the device.
 	 * - If there is device access on the host side.
 	 *
-	 * For the second case, check if re-entry to the low power state is
-	 * allowed. If not, then increment the usage count so that runtime PM
-	 * framework won't suspend the device and set the 'pm_runtime_resumed'
-	 * flag.
+	 * For the second case:
+	 * - The virtual PME_STATUS bit will be set. If PME_ENABLE bit is set
+	 *   and user has registered for virtual PME events, then send the PME
+	 *   virtual PME event.
+	 * - Check if re-entry to the low power state is not allowed.
+	 *
+	 * For the above conditions, increment the usage count so that
+	 * runtime PM framework won't suspend the device and set the
+	 * 'pm_runtime_resumed' flag.
 	 */
-	if (vdev->pm_runtime_engaged && !vdev->pm_runtime_reentry_allowed) {
-		pm_runtime_get_noresume(dev);
-		vdev->pm_runtime_resumed = true;
+	if (vdev->pm_runtime_engaged) {
+		bool pme_triggered = false;
+		__le16 *ctrl = (__le16 *)&vdev->vconfig
+				[vdev->pm_cap_offset + PCI_PM_CTRL];
+
+		*ctrl |= cpu_to_le16(PCI_PM_CTRL_PME_STATUS);
+		if (le16_to_cpu(*ctrl) & PCI_PM_CTRL_PME_ENABLE) {
+			mutex_lock(&vdev->igate);
+			if (vdev->pme_trigger) {
+				pme_triggered = true;
+				eventfd_signal(vdev->pme_trigger, 1);
+			}
+			mutex_unlock(&vdev->igate);
+		}
+
+		if (!vdev->pm_runtime_reentry_allowed || pme_triggered) {
+			pm_runtime_get_noresume(dev);
+			vdev->pm_runtime_resumed = true;
+		}
 	}
 	up_write(&vdev->memory_lock);
 
@@ -586,6 +607,10 @@ void vfio_pci_core_close_device(struct vfio_device *core_vdev)
 		eventfd_ctx_put(vdev->req_trigger);
 		vdev->req_trigger = NULL;
 	}
+	if (vdev->pme_trigger) {
+		eventfd_ctx_put(vdev->pme_trigger);
+		vdev->pme_trigger = NULL;
+	}
 	mutex_unlock(&vdev->igate);
 }
 EXPORT_SYMBOL_GPL(vfio_pci_core_close_device);
@@ -639,7 +664,8 @@ static int vfio_pci_get_irq_count(struct vfio_pci_core_device *vdev, int irq_typ
 	} else if (irq_type == VFIO_PCI_ERR_IRQ_INDEX) {
 		if (pci_is_pcie(vdev->pdev))
 			return 1;
-	} else if (irq_type == VFIO_PCI_REQ_IRQ_INDEX) {
+	} else if (irq_type == VFIO_PCI_REQ_IRQ_INDEX ||
+		   irq_type == VFIO_PCI_PME_IRQ_INDEX) {
 		return 1;
 	}
 
@@ -985,6 +1011,7 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
 		switch (info.index) {
 		case VFIO_PCI_INTX_IRQ_INDEX ... VFIO_PCI_MSIX_IRQ_INDEX:
 		case VFIO_PCI_REQ_IRQ_INDEX:
+		case VFIO_PCI_PME_IRQ_INDEX:
 			break;
 		case VFIO_PCI_ERR_IRQ_INDEX:
 			if (pci_is_pcie(vdev->pdev))
diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index 1a37db99df48..db4180687a74 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -639,6 +639,17 @@ static int vfio_pci_set_req_trigger(struct vfio_pci_core_device *vdev,
 					       count, flags, data);
 }
 
+static int vfio_pci_set_pme_trigger(struct vfio_pci_core_device *vdev,
+				    unsigned index, unsigned start,
+				    unsigned count, uint32_t flags, void *data)
+{
+	if (index != VFIO_PCI_PME_IRQ_INDEX || start != 0 || count > 1)
+		return -EINVAL;
+
+	return vfio_pci_set_ctx_trigger_single(&vdev->pme_trigger,
+					       count, flags, data);
+}
+
 int vfio_pci_set_irqs_ioctl(struct vfio_pci_core_device *vdev, uint32_t flags,
 			    unsigned index, unsigned start, unsigned count,
 			    void *data)
@@ -688,6 +699,13 @@ int vfio_pci_set_irqs_ioctl(struct vfio_pci_core_device *vdev, uint32_t flags,
 			break;
 		}
 		break;
+	case VFIO_PCI_PME_IRQ_INDEX:
+		switch (flags & VFIO_IRQ_SET_ACTION_TYPE_MASK) {
+		case VFIO_IRQ_SET_ACTION_TRIGGER:
+			func = vfio_pci_set_pme_trigger;
+			break;
+		}
+		break;
 	}
 
 	if (!func)
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index 18cc83b767b8..ee2646d820c2 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -102,6 +102,7 @@ struct vfio_pci_core_device {
 	bool			bar_mmap_supported[PCI_STD_NUM_BARS];
 	u8			*pci_config_map;
 	u8			*vconfig;
+	u8			pm_cap_offset;
 	struct perm_bits	*msi_perm;
 	spinlock_t		irqlock;
 	struct mutex		igate;
@@ -133,6 +134,7 @@ struct vfio_pci_core_device {
 	int			ioeventfds_nr;
 	struct eventfd_ctx	*err_trigger;
 	struct eventfd_ctx	*req_trigger;
+	struct eventfd_ctx	*pme_trigger;
 	struct list_head	dummy_resources_list;
 	struct mutex		ioeventfds_lock;
 	struct list_head	ioeventfds_list;
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 7e00de5c21ea..08170950d655 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -621,6 +621,7 @@ enum {
 	VFIO_PCI_MSIX_IRQ_INDEX,
 	VFIO_PCI_ERR_IRQ_INDEX,
 	VFIO_PCI_REQ_IRQ_INDEX,
+	VFIO_PCI_PME_IRQ_INDEX,
 	VFIO_PCI_NUM_IRQS
 };
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 1/6] vfio/pci: Mask INTx during runtime suspend
  2022-07-01 11:08 ` [PATCH v4 1/6] vfio/pci: Mask INTx during runtime suspend Abhishek Sahu
@ 2022-07-06 15:39   ` Alex Williamson
  2022-07-08  9:21     ` Abhishek Sahu
  0 siblings, 1 reply; 26+ messages in thread
From: Alex Williamson @ 2022-07-06 15:39 UTC (permalink / raw)
  To: Abhishek Sahu
  Cc: Cornelia Huck, Yishai Hadas, Jason Gunthorpe, Shameer Kolothum,
	Kevin Tian, Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas,
	linux-kernel, kvm, linux-pm, linux-pci

On Fri, 1 Jul 2022 16:38:09 +0530
Abhishek Sahu <abhsahu@nvidia.com> wrote:

> This patch adds INTx handling during runtime suspend/resume.
> All the suspend/resume related code for the user to put the device
> into the low power state will be added in subsequent patches.
> 
> The INTx are shared among devices. Whenever any INTx interrupt comes

"The INTx lines may be shared..."

> for the VFIO devices, then vfio_intx_handler() will be called for each
> device. Inside vfio_intx_handler(), it calls pci_check_and_mask_intx()

"...device sharing the interrupt."

> and checks if the interrupt has been generated for the current device.
> Now, if the device is already in the D3cold state, then the config space
> can not be read. Attempt to read config space in D3cold state can
> cause system unresponsiveness in a few systems. To prevent this, mask
> INTx in runtime suspend callback and unmask the same in runtime resume
> callback. If INTx has been already masked, then no handling is needed
> in runtime suspend/resume callbacks. 'pm_intx_masked' tracks this, and
> vfio_pci_intx_mask() has been updated to return true if INTx has been
> masked inside this function.
> 
> For the runtime suspend which is triggered for the no user of VFIO
> device, the is_intx() will return false and these callbacks won't do
> anything.
> 
> The MSI/MSI-X are not shared so similar handling should not be
> needed for MSI/MSI-X. vfio_msihandler() triggers eventfd_signal()
> without doing any device-specific config access. When the user performs
> any config access or IOCTL after receiving the eventfd notification,
> then the device will be moved to the D0 state first before
> servicing any request.
> 
> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
> ---
>  drivers/vfio/pci/vfio_pci_core.c  | 37 +++++++++++++++++++++++++++----
>  drivers/vfio/pci/vfio_pci_intrs.c |  6 ++++-
>  include/linux/vfio_pci_core.h     |  3 ++-
>  3 files changed, 40 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index a0d69ddaf90d..5948d930449b 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -259,16 +259,45 @@ int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t stat
>  	return ret;
>  }
>  
> +#ifdef CONFIG_PM
> +static int vfio_pci_core_runtime_suspend(struct device *dev)
> +{
> +	struct vfio_pci_core_device *vdev = dev_get_drvdata(dev);
> +
> +	/*
> +	 * If INTx is enabled, then mask INTx before going into the runtime
> +	 * suspended state and unmask the same in the runtime resume.
> +	 * If INTx has already been masked by the user, then
> +	 * vfio_pci_intx_mask() will return false and in that case, INTx
> +	 * should not be unmasked in the runtime resume.
> +	 */
> +	vdev->pm_intx_masked = (is_intx(vdev) && vfio_pci_intx_mask(vdev));
> +
> +	return 0;
> +}
> +
> +static int vfio_pci_core_runtime_resume(struct device *dev)
> +{
> +	struct vfio_pci_core_device *vdev = dev_get_drvdata(dev);
> +
> +	if (vdev->pm_intx_masked)
> +		vfio_pci_intx_unmask(vdev);
> +
> +	return 0;
> +}
> +#endif /* CONFIG_PM */
> +
>  /*
> - * The dev_pm_ops needs to be provided to make pci-driver runtime PM working,
> - * so use structure without any callbacks.
> - *
>   * The pci-driver core runtime PM routines always save the device state
>   * before going into suspended state. If the device is going into low power
>   * state with only with runtime PM ops, then no explicit handling is needed
>   * for the devices which have NoSoftRst-.
>   */
> -static const struct dev_pm_ops vfio_pci_core_pm_ops = { };
> +static const struct dev_pm_ops vfio_pci_core_pm_ops = {
> +	SET_RUNTIME_PM_OPS(vfio_pci_core_runtime_suspend,
> +			   vfio_pci_core_runtime_resume,
> +			   NULL)
> +};
>  
>  int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
>  {
> diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
> index 6069a11fb51a..1a37db99df48 100644
> --- a/drivers/vfio/pci/vfio_pci_intrs.c
> +++ b/drivers/vfio/pci/vfio_pci_intrs.c
> @@ -33,10 +33,12 @@ static void vfio_send_intx_eventfd(void *opaque, void *unused)
>  		eventfd_signal(vdev->ctx[0].trigger, 1);
>  }
>  
> -void vfio_pci_intx_mask(struct vfio_pci_core_device *vdev)
> +/* Returns true if INTx has been masked by this function. */
> +bool vfio_pci_intx_mask(struct vfio_pci_core_device *vdev)
>  {
>  	struct pci_dev *pdev = vdev->pdev;
>  	unsigned long flags;
> +	bool intx_masked = false;
>  
>  	spin_lock_irqsave(&vdev->irqlock, flags);
>  
> @@ -60,9 +62,11 @@ void vfio_pci_intx_mask(struct vfio_pci_core_device *vdev)
>  			disable_irq_nosync(pdev->irq);
>  
>  		vdev->ctx[0].masked = true;
> +		intx_masked = true;
>  	}
>  
>  	spin_unlock_irqrestore(&vdev->irqlock, flags);
> +	return intx_masked;
>  }


There's certainly another path through this function that masks the
interrupt, which makes the definition of this return value a bit
confusing.  Wouldn't it be simpler not to overload the masked flag on
the interrupt context like this and instead set a new flag on the vdev
under irqlock to indicate the device is unable to generate interrupts.
The irq handler would add a test of this flag before any tests that
would access the device.  Thanks,

Alex
 
>  /*
> diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
> index 23c176d4b073..cdfd328ba6b1 100644
> --- a/include/linux/vfio_pci_core.h
> +++ b/include/linux/vfio_pci_core.h
> @@ -124,6 +124,7 @@ struct vfio_pci_core_device {
>  	bool			needs_reset;
>  	bool			nointx;
>  	bool			needs_pm_restore;
> +	bool			pm_intx_masked;
>  	struct pci_saved_state	*pci_saved_state;
>  	struct pci_saved_state	*pm_save;
>  	int			ioeventfds_nr;
> @@ -147,7 +148,7 @@ struct vfio_pci_core_device {
>  #define is_irq_none(vdev) (!(is_intx(vdev) || is_msi(vdev) || is_msix(vdev)))
>  #define irq_is(vdev, type) (vdev->irq_type == type)
>  
> -extern void vfio_pci_intx_mask(struct vfio_pci_core_device *vdev);
> +extern bool vfio_pci_intx_mask(struct vfio_pci_core_device *vdev);
>  extern void vfio_pci_intx_unmask(struct vfio_pci_core_device *vdev);
>  
>  extern int vfio_pci_set_irqs_ioctl(struct vfio_pci_core_device *vdev,


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 2/6] vfio: Add a new device feature for the power management
  2022-07-01 11:08 ` [PATCH v4 2/6] vfio: Add a new device feature for the power management Abhishek Sahu
@ 2022-07-06 15:39   ` Alex Williamson
  2022-07-08  9:39     ` Abhishek Sahu
  0 siblings, 1 reply; 26+ messages in thread
From: Alex Williamson @ 2022-07-06 15:39 UTC (permalink / raw)
  To: Abhishek Sahu
  Cc: Cornelia Huck, Yishai Hadas, Jason Gunthorpe, Shameer Kolothum,
	Kevin Tian, Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas,
	linux-kernel, kvm, linux-pm, linux-pci

On Fri, 1 Jul 2022 16:38:10 +0530
Abhishek Sahu <abhsahu@nvidia.com> wrote:

> This patch adds the new feature VFIO_DEVICE_FEATURE_POWER_MANAGEMENT
> for the power management in the header file. The implementation for the
> same will be added in the subsequent patches.
> 
> With the standard registers, all power states cannot be achieved. The
> platform-based power management needs to be involved to go into the
> lowest power state. For all the platform-based power management, this
> device feature can be used.
> 
> This device feature uses flags to specify the different operations. In
> the future, if any more power management functionality is needed then
> a new flag can be added to it. It supports both GET and SET operations.
> 
> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
> ---
>  include/uapi/linux/vfio.h | 55 +++++++++++++++++++++++++++++++++++++++
>  1 file changed, 55 insertions(+)
> 
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 733a1cddde30..7e00de5c21ea 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -986,6 +986,61 @@ enum vfio_device_mig_state {
>  	VFIO_DEVICE_STATE_RUNNING_P2P = 5,
>  };
>  
> +/*
> + * Perform power management-related operations for the VFIO device.
> + *
> + * The low power feature uses platform-based power management to move the
> + * device into the low power state.  This low power state is device-specific.
> + *
> + * This device feature uses flags to specify the different operations.
> + * It supports both the GET and SET operations.
> + *
> + * - VFIO_PM_LOW_POWER_ENTER flag moves the VFIO device into the low power
> + *   state with platform-based power management.  This low power state will be
> + *   internal to the VFIO driver and the user will not come to know which power
> + *   state is chosen.  Once the user has moved the VFIO device into the low
> + *   power state, then the user should not do any device access without moving
> + *   the device out of the low power state.

Except we're wrapping device accesses to make this possible.  This
should probably describe how any discrete access will wake the device
but ongoing access through mmaps will generate user faults.

> + *
> + * - VFIO_PM_LOW_POWER_EXIT flag moves the VFIO device out of the low power
> + *    state.  This flag should only be set if the user has previously put the
> + *    device into low power state with the VFIO_PM_LOW_POWER_ENTER flag.

Indenting.

> + *
> + * - VFIO_PM_LOW_POWER_ENTER and VFIO_PM_LOW_POWER_EXIT are mutually exclusive.
> + *
> + * - VFIO_PM_LOW_POWER_REENTERY_DISABLE flag is only valid with
> + *   VFIO_PM_LOW_POWER_ENTER.  If there is any access for the VFIO device on
> + *   the host side, then the device will be moved out of the low power state
> + *   without the user's guest driver involvement.  Some devices require the
> + *   user's guest driver involvement for each low-power entry.  If this flag is
> + *   set, then the re-entry to the low power state will be disabled, and the
> + *   host kernel will not move the device again into the low power state.
> + *   The VFIO driver internally maintains a list of devices for which low
> + *   power re-entry is disabled by default and for those devices, the
> + *   re-entry will be disabled even if the user has not set this flag
> + *   explicitly.

Wrong polarity.  The kernel should not maintain the policy.  By default
every wakeup, whether from host kernel accesses or via user accesses
that do a pm-get should signal a wakeup to userspace.  Userspace needs
to opt-out of that wakeup to let the kernel automatically re-enter low
power and userspace needs to maintain the policy for which devices it
wants that to occur.

> + *
> + * For the IOCTL call with VFIO_DEVICE_FEATURE_GET:
> + *
> + * - VFIO_PM_LOW_POWER_ENTER will be set if the user has put the device into
> + *   the low power state, otherwise, VFIO_PM_LOW_POWER_EXIT will be set.
> + *
> + * - If the device is in a normal power state currently, then
> + *   VFIO_PM_LOW_POWER_REENTERY_DISABLE will be set for the devices where low
> + *   power re-entry is disabled by default.  If the device is in the low power
> + *   state currently, then VFIO_PM_LOW_POWER_REENTERY_DISABLE will be set
> + *   according to the current transition.

Very confusing semantics.

What if the feature SET ioctl took an eventfd and that eventfd was one
time use.  Calling the ioctl would setup the eventfd to notify the user
on wakeup and call pm-put.  Any access to the device via host, ioctl,
or region would be wrapped in pm-get/put and the pm-resume handler
would perform the matching pm-get to balance the feature SET and signal
the eventfd.  If the user opts-out by not providing a wakeup eventfd,
then the pm-resume handler does not perform a pm-get.  Possibly we
could even allow mmap access if a wake-up eventfd is provided.  The
feature GET ioctl would be used to exit low power behavior and would be
a no-op if the wakeup eventfd had already been signaled.  Thanks,

Alex

> + */
> +struct vfio_device_feature_power_management {
> +	__u32	flags;
> +#define VFIO_PM_LOW_POWER_ENTER			(1 << 0)
> +#define VFIO_PM_LOW_POWER_EXIT			(1 << 1)
> +#define VFIO_PM_LOW_POWER_REENTERY_DISABLE	(1 << 2)
> +	__u32	reserved;
> +};
> +
> +#define VFIO_DEVICE_FEATURE_POWER_MANAGEMENT	3
> +
>  /* -------- API for Type1 VFIO IOMMU -------- */
>  
>  /**


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 3/6] vfio: Increment the runtime PM usage count during IOCTL call
  2022-07-01 11:08 ` [PATCH v4 3/6] vfio: Increment the runtime PM usage count during IOCTL call Abhishek Sahu
@ 2022-07-06 15:40   ` Alex Williamson
  2022-07-08  9:43     ` Abhishek Sahu
  0 siblings, 1 reply; 26+ messages in thread
From: Alex Williamson @ 2022-07-06 15:40 UTC (permalink / raw)
  To: Abhishek Sahu
  Cc: Cornelia Huck, Yishai Hadas, Jason Gunthorpe, Shameer Kolothum,
	Kevin Tian, Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas,
	linux-kernel, kvm, linux-pm, linux-pci

On Fri, 1 Jul 2022 16:38:11 +0530
Abhishek Sahu <abhsahu@nvidia.com> wrote:

> The vfio-pci based driver will have runtime power management
> support where the user can put the device into the low power state
> and then PCI devices can go into the D3cold state. If the device is
> in the low power state and the user issues any IOCTL, then the
> device should be moved out of the low power state first. Once
> the IOCTL is serviced, then it can go into the low power state again.
> The runtime PM framework manages this with help of usage count.
> 
> One option was to add the runtime PM related API's inside vfio-pci
> driver but some IOCTL (like VFIO_DEVICE_FEATURE) can follow a
> different path and more IOCTL can be added in the future. Also, the
> runtime PM will be added for vfio-pci based drivers variant currently,
> but the other VFIO based drivers can use the same in the
> future. So, this patch adds the runtime calls runtime-related API in
> the top-level IOCTL function itself.
>
> For the VFIO drivers which do not have runtime power management
> support currently, the runtime PM API's won't be invoked. Only for
> vfio-pci based drivers currently, the runtime PM API's will be invoked
> to increment and decrement the usage count.

Variant drivers can easily opt-out of runtime pm support by performing
a gratuitous pm-get in their device-open function.
 
> Taking this usage count incremented while servicing IOCTL will make
> sure that the user won't put the device into low power state when any
> other IOCTL is being serviced in parallel. Let's consider the
> following scenario:
> 
>  1. Some other IOCTL is called.
>  2. The user has opened another device instance and called the power
>     management IOCTL for the low power entry.
>  3. The power management IOCTL moves the device into the low power state.
>  4. The other IOCTL finishes.
> 
> If we don't keep the usage count incremented then the device
> access will happen between step 3 and 4 while the device has already
> gone into the low power state.
> 
> The runtime PM API's should not be invoked for
> VFIO_DEVICE_FEATURE_POWER_MANAGEMENT since this IOCTL itself performs
> the runtime power management entry and exit for the VFIO device.

I think the one-shot interface I proposed in the previous patch avoids
the need for special handling for these feature ioctls.  Thanks,

Alex
 
> The pm_runtime_resume_and_get() will be the first call so its error
> should not be propagated to user space directly. For example, if
> pm_runtime_resume_and_get() can return -EINVAL for the cases where the
> user has passed the correct argument. So the
> pm_runtime_resume_and_get() errors have been masked behind -EIO.
> 
> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
> ---
>  drivers/vfio/vfio.c | 82 ++++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 74 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index 61e71c1154be..61a8d9f7629a 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -32,6 +32,7 @@
>  #include <linux/vfio.h>
>  #include <linux/wait.h>
>  #include <linux/sched/signal.h>
> +#include <linux/pm_runtime.h>
>  #include "vfio.h"
>  
>  #define DRIVER_VERSION	"0.3"
> @@ -1333,6 +1334,39 @@ static const struct file_operations vfio_group_fops = {
>  	.release	= vfio_group_fops_release,
>  };
>  
> +/*
> + * Wrapper around pm_runtime_resume_and_get().
> + * Return error code on failure or 0 on success.
> + */
> +static inline int vfio_device_pm_runtime_get(struct vfio_device *device)
> +{
> +	struct device *dev = device->dev;
> +
> +	if (dev->driver && dev->driver->pm) {
> +		int ret;
> +
> +		ret = pm_runtime_resume_and_get(dev);
> +		if (ret < 0) {
> +			dev_info_ratelimited(dev,
> +				"vfio: runtime resume failed %d\n", ret);
> +			return -EIO;
> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +/*
> + * Wrapper around pm_runtime_put().
> + */
> +static inline void vfio_device_pm_runtime_put(struct vfio_device *device)
> +{
> +	struct device *dev = device->dev;
> +
> +	if (dev->driver && dev->driver->pm)
> +		pm_runtime_put(dev);
> +}
> +
>  /*
>   * VFIO Device fd
>   */
> @@ -1607,6 +1641,8 @@ static int vfio_ioctl_device_feature(struct vfio_device *device,
>  {
>  	size_t minsz = offsetofend(struct vfio_device_feature, flags);
>  	struct vfio_device_feature feature;
> +	int ret = 0;
> +	u16 feature_cmd;
>  
>  	if (copy_from_user(&feature, arg, minsz))
>  		return -EFAULT;
> @@ -1626,28 +1662,51 @@ static int vfio_ioctl_device_feature(struct vfio_device *device,
>  	    (feature.flags & VFIO_DEVICE_FEATURE_GET))
>  		return -EINVAL;
>  
> -	switch (feature.flags & VFIO_DEVICE_FEATURE_MASK) {
> +	feature_cmd = feature.flags & VFIO_DEVICE_FEATURE_MASK;
> +
> +	/*
> +	 * The VFIO_DEVICE_FEATURE_POWER_MANAGEMENT itself performs the runtime
> +	 * power management entry and exit for the VFIO device, so the runtime
> +	 * PM API's should not be called for this feature.
> +	 */
> +	if (feature_cmd != VFIO_DEVICE_FEATURE_POWER_MANAGEMENT) {
> +		ret = vfio_device_pm_runtime_get(device);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	switch (feature_cmd) {
>  	case VFIO_DEVICE_FEATURE_MIGRATION:
> -		return vfio_ioctl_device_feature_migration(
> +		ret = vfio_ioctl_device_feature_migration(
>  			device, feature.flags, arg->data,
>  			feature.argsz - minsz);
> +		break;
>  	case VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE:
> -		return vfio_ioctl_device_feature_mig_device_state(
> +		ret = vfio_ioctl_device_feature_mig_device_state(
>  			device, feature.flags, arg->data,
>  			feature.argsz - minsz);
> +		break;
>  	default:
>  		if (unlikely(!device->ops->device_feature))
> -			return -EINVAL;
> -		return device->ops->device_feature(device, feature.flags,
> -						   arg->data,
> -						   feature.argsz - minsz);
> +			ret = -EINVAL;
> +		else
> +			ret = device->ops->device_feature(
> +				device, feature.flags, arg->data,
> +				feature.argsz - minsz);
> +		break;
>  	}
> +
> +	if (feature_cmd != VFIO_DEVICE_FEATURE_POWER_MANAGEMENT)
> +		vfio_device_pm_runtime_put(device);
> +
> +	return ret;
>  }
>  
>  static long vfio_device_fops_unl_ioctl(struct file *filep,
>  				       unsigned int cmd, unsigned long arg)
>  {
>  	struct vfio_device *device = filep->private_data;
> +	int ret;
>  
>  	switch (cmd) {
>  	case VFIO_DEVICE_FEATURE:
> @@ -1655,7 +1714,14 @@ static long vfio_device_fops_unl_ioctl(struct file *filep,
>  	default:
>  		if (unlikely(!device->ops->ioctl))
>  			return -EINVAL;
> -		return device->ops->ioctl(device, cmd, arg);
> +
> +		ret = vfio_device_pm_runtime_get(device);
> +		if (ret)
> +			return ret;
> +
> +		ret = device->ops->ioctl(device, cmd, arg);
> +		vfio_device_pm_runtime_put(device);
> +		return ret;
>  	}
>  }
>  


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 6/6] vfio/pci: Add support for virtual PME
  2022-07-01 11:08 ` [PATCH v4 6/6] vfio/pci: Add support for virtual PME Abhishek Sahu
@ 2022-07-06 15:40   ` Alex Williamson
  2022-07-08  9:45     ` Abhishek Sahu
  0 siblings, 1 reply; 26+ messages in thread
From: Alex Williamson @ 2022-07-06 15:40 UTC (permalink / raw)
  To: Abhishek Sahu
  Cc: Cornelia Huck, Yishai Hadas, Jason Gunthorpe, Shameer Kolothum,
	Kevin Tian, Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas,
	linux-kernel, kvm, linux-pm, linux-pci

On Fri, 1 Jul 2022 16:38:14 +0530
Abhishek Sahu <abhsahu@nvidia.com> wrote:

> If the PCI device is in low power state and the device requires
> wake-up, then it can generate PME (Power Management Events). Mostly
> these PME events will be propagated to the root port and then the
> root port will generate the system interrupt. Then the OS should
> identify the device which generated the PME and should resume
> the device.
> 
> We can implement a similar virtual PME framework where if the device
> already went into the runtime suspended state and then there is any
> wake-up on the host side, then it will send the virtual PME
> notification to the guest. This virtual PME will be helpful for the cases
> where the device will not be suspended again if there is any wake-up
> triggered by the host. Following is the overall approach regarding
> the virtual PME.
> 
> 1. Add one more event like VFIO_PCI_ERR_IRQ_INDEX named
>    VFIO_PCI_PME_IRQ_INDEX and do the required code changes to get/set
>    this new IRQ.
> 
> 2. From the guest side, the guest needs to enable eventfd for the
>    virtual PME notification.
> 
> 3. In the vfio-pci driver, the PME support bits are currently
>    virtualized and set to 0. We can set PME capability support for all
>    the power states. This PME capability support is independent of the
>    physical PME support.
> 
> 4. The PME enable (PME_En bit in Power Management Control/Status
>    Register) and PME status (PME_Status bit in Power Management
>    Control/Status Register) are also virtualized currently.
>    The write support for PME_En bit can be enabled.
> 
> 5. The PME_Status bit is a write-1-clear bit where the write with
>    zero value will have no effect and write with 1 value will clear the
>    bit. The write for this bit will be trapped inside
>    vfio_pm_config_write() similar to PCI_PM_CTRL write for PM_STATES.
> 
> 6. When the host gets a request for resuming the device other than from
>    low power exit feature IOCTL, then PME_Status bit will be set.
>    According to [PCIe v5 7.5.2.2],
>      "PME_Status - This bit is Set when the Function would normally
>       generate a PME signal. The value of this bit is not affected by
>       the value of the PME_En bit."
> 
>    So even if PME_En bit is not set, we can set PME_Status bit.
> 
> 7. If the guest has enabled PME_En and registered for PME events
>    through eventfd, then the usage count will be incremented to prevent
>    the device to go into the suspended state and notify the guest through
>    eventfd trigger.
> 
> The virtual PME can help in handling physical PME also. When
> physical PME comes, then also the runtime resume will be called. If
> the guest has registered for virtual PME, then it will be sent in this
> case also.
> 
> * Implementation for handling the virtual PME on the hypervisor:
> 
> If we take the implementation in Linux OS, then during runtime suspend
> time, then it calls __pci_enable_wake(). It internally enables PME
> through pci_pme_active() and also enables the ACPI side wake-up
> through platform_pci_set_wakeup(). To handle the PME, the hypervisor has
> the following two options:
> 
> 1. Create a virtual root port for the VFIO device and trigger
>    interrupt when the PME comes. It will call pcie_pme_irq() which will
>    resume the device.
> 
> 2. Create a virtual ACPI _PRW resource and associate it with the device
>    itself. In _PRW, any GPE (General Purpose Event) can be assigned for
>    the wake-up. When PME comes, then GPE can be triggered by the
>    hypervisor. GPE interrupt will call pci_acpi_wake_dev() function
>    internally and it will resume the device.

Do we really need to implement PME emulation in the kernel or is it
sufficient for userspace to simply register a one-shot eventfd when
SET'ing the low power feature and QEMU can provide the PME emulation
based on that signaling?  Thanks,

Alex

> 
> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
> ---
>  drivers/vfio/pci/vfio_pci_config.c | 39 +++++++++++++++++++++------
>  drivers/vfio/pci/vfio_pci_core.c   | 43 ++++++++++++++++++++++++------
>  drivers/vfio/pci/vfio_pci_intrs.c  | 18 +++++++++++++
>  include/linux/vfio_pci_core.h      |  2 ++
>  include/uapi/linux/vfio.h          |  1 +
>  5 files changed, 87 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
> index 21a4743d011f..a06375a03758 100644
> --- a/drivers/vfio/pci/vfio_pci_config.c
> +++ b/drivers/vfio/pci/vfio_pci_config.c
> @@ -719,6 +719,20 @@ static int vfio_pm_config_write(struct vfio_pci_core_device *vdev, int pos,
>  	if (count < 0)
>  		return count;
>  
> +	/*
> +	 * PME_STATUS is write-1-clear bit. If PME_STATUS is 1, then clear the
> +	 * bit in vconfig. The PME_STATUS is in the upper byte of the control
> +	 * register and user can do single byte write also.
> +	 */
> +	if (offset <= PCI_PM_CTRL + 1 && offset + count > PCI_PM_CTRL + 1) {
> +		if (le32_to_cpu(val) &
> +		    (PCI_PM_CTRL_PME_STATUS >> (offset - PCI_PM_CTRL) * 8)) {
> +			__le16 *ctrl = (__le16 *)&vdev->vconfig
> +					[vdev->pm_cap_offset + PCI_PM_CTRL];
> +			*ctrl &= ~cpu_to_le16(PCI_PM_CTRL_PME_STATUS);
> +		}
> +	}
> +
>  	if (offset == PCI_PM_CTRL) {
>  		pci_power_t state;
>  
> @@ -771,14 +785,16 @@ static int __init init_pci_cap_pm_perm(struct perm_bits *perm)
>  	 * the user change power state, but we trap and initiate the
>  	 * change ourselves, so the state bits are read-only.
>  	 *
> -	 * The guest can't process PME from D3cold so virtualize PME_Status
> -	 * and PME_En bits. The vconfig bits will be cleared during device
> -	 * capability initialization.
> +	 * The guest can't process physical PME from D3cold so virtualize
> +	 * PME_Status and PME_En bits. These bits will be used for the
> +	 * virtual PME between host and guest. The vconfig bits will be
> +	 * updated during device capability initialization. PME_Status is
> +	 * write-1-clear bit, so it is read-only. We trap and update the
> +	 * vconfig bit manually during write.
>  	 */
>  	p_setd(perm, PCI_PM_CTRL,
>  	       PCI_PM_CTRL_PME_ENABLE | PCI_PM_CTRL_PME_STATUS,
> -	       ~(PCI_PM_CTRL_PME_ENABLE | PCI_PM_CTRL_PME_STATUS |
> -		 PCI_PM_CTRL_STATE_MASK));
> +	       ~(PCI_PM_CTRL_STATE_MASK | PCI_PM_CTRL_PME_STATUS));
>  
>  	return 0;
>  }
> @@ -1454,8 +1470,13 @@ static void vfio_update_pm_vconfig_bytes(struct vfio_pci_core_device *vdev,
>  	__le16 *pmc = (__le16 *)&vdev->vconfig[offset + PCI_PM_PMC];
>  	__le16 *ctrl = (__le16 *)&vdev->vconfig[offset + PCI_PM_CTRL];
>  
> -	/* Clear vconfig PME_Support, PME_Status, and PME_En bits */
> -	*pmc &= ~cpu_to_le16(PCI_PM_CAP_PME_MASK);
> +	/*
> +	 * Set the vconfig PME_Support bits. The PME_Status is being used for
> +	 * virtual PME support and is not dependent upon the physical
> +	 * PME support.
> +	 */
> +	*pmc |= cpu_to_le16(PCI_PM_CAP_PME_MASK);
> +	/* Clear vconfig PME_Support and PME_En bits */
>  	*ctrl &= ~cpu_to_le16(PCI_PM_CTRL_PME_ENABLE | PCI_PM_CTRL_PME_STATUS);
>  }
>  
> @@ -1582,8 +1603,10 @@ static int vfio_cap_init(struct vfio_pci_core_device *vdev)
>  		if (ret)
>  			return ret;
>  
> -		if (cap == PCI_CAP_ID_PM)
> +		if (cap == PCI_CAP_ID_PM) {
> +			vdev->pm_cap_offset = pos;
>  			vfio_update_pm_vconfig_bytes(vdev, pos);
> +		}
>  
>  		prev = &vdev->vconfig[pos + PCI_CAP_LIST_NEXT];
>  		pos = next;
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index 1ddaaa6ccef5..6c1225bc2aeb 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -319,14 +319,35 @@ static int vfio_pci_core_runtime_resume(struct device *dev)
>  	 *   the low power state or closed the device.
>  	 * - If there is device access on the host side.
>  	 *
> -	 * For the second case, check if re-entry to the low power state is
> -	 * allowed. If not, then increment the usage count so that runtime PM
> -	 * framework won't suspend the device and set the 'pm_runtime_resumed'
> -	 * flag.
> +	 * For the second case:
> +	 * - The virtual PME_STATUS bit will be set. If PME_ENABLE bit is set
> +	 *   and user has registered for virtual PME events, then send the PME
> +	 *   virtual PME event.
> +	 * - Check if re-entry to the low power state is not allowed.
> +	 *
> +	 * For the above conditions, increment the usage count so that
> +	 * runtime PM framework won't suspend the device and set the
> +	 * 'pm_runtime_resumed' flag.
>  	 */
> -	if (vdev->pm_runtime_engaged && !vdev->pm_runtime_reentry_allowed) {
> -		pm_runtime_get_noresume(dev);
> -		vdev->pm_runtime_resumed = true;
> +	if (vdev->pm_runtime_engaged) {
> +		bool pme_triggered = false;
> +		__le16 *ctrl = (__le16 *)&vdev->vconfig
> +				[vdev->pm_cap_offset + PCI_PM_CTRL];
> +
> +		*ctrl |= cpu_to_le16(PCI_PM_CTRL_PME_STATUS);
> +		if (le16_to_cpu(*ctrl) & PCI_PM_CTRL_PME_ENABLE) {
> +			mutex_lock(&vdev->igate);
> +			if (vdev->pme_trigger) {
> +				pme_triggered = true;
> +				eventfd_signal(vdev->pme_trigger, 1);
> +			}
> +			mutex_unlock(&vdev->igate);
> +		}
> +
> +		if (!vdev->pm_runtime_reentry_allowed || pme_triggered) {
> +			pm_runtime_get_noresume(dev);
> +			vdev->pm_runtime_resumed = true;
> +		}
>  	}
>  	up_write(&vdev->memory_lock);
>  
> @@ -586,6 +607,10 @@ void vfio_pci_core_close_device(struct vfio_device *core_vdev)
>  		eventfd_ctx_put(vdev->req_trigger);
>  		vdev->req_trigger = NULL;
>  	}
> +	if (vdev->pme_trigger) {
> +		eventfd_ctx_put(vdev->pme_trigger);
> +		vdev->pme_trigger = NULL;
> +	}
>  	mutex_unlock(&vdev->igate);
>  }
>  EXPORT_SYMBOL_GPL(vfio_pci_core_close_device);
> @@ -639,7 +664,8 @@ static int vfio_pci_get_irq_count(struct vfio_pci_core_device *vdev, int irq_typ
>  	} else if (irq_type == VFIO_PCI_ERR_IRQ_INDEX) {
>  		if (pci_is_pcie(vdev->pdev))
>  			return 1;
> -	} else if (irq_type == VFIO_PCI_REQ_IRQ_INDEX) {
> +	} else if (irq_type == VFIO_PCI_REQ_IRQ_INDEX ||
> +		   irq_type == VFIO_PCI_PME_IRQ_INDEX) {
>  		return 1;
>  	}
>  
> @@ -985,6 +1011,7 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
>  		switch (info.index) {
>  		case VFIO_PCI_INTX_IRQ_INDEX ... VFIO_PCI_MSIX_IRQ_INDEX:
>  		case VFIO_PCI_REQ_IRQ_INDEX:
> +		case VFIO_PCI_PME_IRQ_INDEX:
>  			break;
>  		case VFIO_PCI_ERR_IRQ_INDEX:
>  			if (pci_is_pcie(vdev->pdev))
> diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
> index 1a37db99df48..db4180687a74 100644
> --- a/drivers/vfio/pci/vfio_pci_intrs.c
> +++ b/drivers/vfio/pci/vfio_pci_intrs.c
> @@ -639,6 +639,17 @@ static int vfio_pci_set_req_trigger(struct vfio_pci_core_device *vdev,
>  					       count, flags, data);
>  }
>  
> +static int vfio_pci_set_pme_trigger(struct vfio_pci_core_device *vdev,
> +				    unsigned index, unsigned start,
> +				    unsigned count, uint32_t flags, void *data)
> +{
> +	if (index != VFIO_PCI_PME_IRQ_INDEX || start != 0 || count > 1)
> +		return -EINVAL;
> +
> +	return vfio_pci_set_ctx_trigger_single(&vdev->pme_trigger,
> +					       count, flags, data);
> +}
> +
>  int vfio_pci_set_irqs_ioctl(struct vfio_pci_core_device *vdev, uint32_t flags,
>  			    unsigned index, unsigned start, unsigned count,
>  			    void *data)
> @@ -688,6 +699,13 @@ int vfio_pci_set_irqs_ioctl(struct vfio_pci_core_device *vdev, uint32_t flags,
>  			break;
>  		}
>  		break;
> +	case VFIO_PCI_PME_IRQ_INDEX:
> +		switch (flags & VFIO_IRQ_SET_ACTION_TYPE_MASK) {
> +		case VFIO_IRQ_SET_ACTION_TRIGGER:
> +			func = vfio_pci_set_pme_trigger;
> +			break;
> +		}
> +		break;
>  	}
>  
>  	if (!func)
> diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
> index 18cc83b767b8..ee2646d820c2 100644
> --- a/include/linux/vfio_pci_core.h
> +++ b/include/linux/vfio_pci_core.h
> @@ -102,6 +102,7 @@ struct vfio_pci_core_device {
>  	bool			bar_mmap_supported[PCI_STD_NUM_BARS];
>  	u8			*pci_config_map;
>  	u8			*vconfig;
> +	u8			pm_cap_offset;
>  	struct perm_bits	*msi_perm;
>  	spinlock_t		irqlock;
>  	struct mutex		igate;
> @@ -133,6 +134,7 @@ struct vfio_pci_core_device {
>  	int			ioeventfds_nr;
>  	struct eventfd_ctx	*err_trigger;
>  	struct eventfd_ctx	*req_trigger;
> +	struct eventfd_ctx	*pme_trigger;
>  	struct list_head	dummy_resources_list;
>  	struct mutex		ioeventfds_lock;
>  	struct list_head	ioeventfds_list;
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 7e00de5c21ea..08170950d655 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -621,6 +621,7 @@ enum {
>  	VFIO_PCI_MSIX_IRQ_INDEX,
>  	VFIO_PCI_ERR_IRQ_INDEX,
>  	VFIO_PCI_REQ_IRQ_INDEX,
> +	VFIO_PCI_PME_IRQ_INDEX,
>  	VFIO_PCI_NUM_IRQS
>  };
>  


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 5/6] vfio/pci: Prevent low power re-entry without guest driver
  2022-07-01 11:08 ` [PATCH v4 5/6] vfio/pci: Prevent low power re-entry without guest driver Abhishek Sahu
@ 2022-07-06 15:40   ` Alex Williamson
  0 siblings, 0 replies; 26+ messages in thread
From: Alex Williamson @ 2022-07-06 15:40 UTC (permalink / raw)
  To: Abhishek Sahu
  Cc: Cornelia Huck, Yishai Hadas, Jason Gunthorpe, Shameer Kolothum,
	Kevin Tian, Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas,
	linux-kernel, kvm, linux-pm, linux-pci

On Fri, 1 Jul 2022 16:38:13 +0530
Abhishek Sahu <abhsahu@nvidia.com> wrote:

> Some devices (Like NVIDIA VGA or 3D controller) require driver
> involvement each time before going into D3cold. In the regular flow,
> the guest driver do all the required steps inside the guest OS and
> then the IOCTL will be called for D3cold entry by the hypervisor. Now,
> if there is any activity on the host side (For example user has run
> lspci, dump the config space through sysfs, etc.), then the runtime PM
> framework will resume the device first, perform the operation and then
> suspend the device again. This second time suspend will be without
> guest driver involvement. This patch adds the support to prevent
> second-time runtime suspend if there is any wake-up. This prevention
> is either based on the predefined vendor/class id list or the user can
> specify the flag (VFIO_PM_LOW_POWER_REENTERY_DISABLE) during entry for
> the same.
> 
> 'pm_runtime_reentry_allowed' flag tracks if this re-entry is allowed.
> It will be set during the entry time.
> 
> 'pm_runtime_resumed' flag tracks if there is any wake-up before the
> guest performs the wake-up. If re-entry is not allowed, then during
> runtime resume, the runtime PM count will be incremented, and this
> flag will be set. This flag will be checked during guest D3cold exit
> and then skip the runtime PM-related handling if this flag is set.
> 
> During guest low power exit time, all vdev power-related flags are
> accessed under 'memory_lock' and usage count will be incremented. The
> resume will be triggered after releasing the lock since the runtime
> resume callback again requires the lock. pm_runtime_get_noresume()/
> pm_runtime_resume() have been used instead of
> pm_runtime_resume_and_get() to handle the following scenario during
> the race condition.
> 
>  a. The guest triggered the low power exit.
>  b. The guest thread got the lock and cleared the vdev related
>     flags and released the locks.
>  c. Before pm_runtime_resume_and_get(), the host lspci thread got
>     scheduled and triggered the runtime resume.
>  d. Now, all the vdev related flags are cleared so there won't be
>     any extra handling inside the runtime resume.
>  e. The runtime PM put the device again into the suspended state.
>  f. The guest triggered pm_runtime_resume_and_get() got called.
> 
> So, at step (e), the suspend is happening without the guest driver
> involvement. Now, by using pm_runtime_get_noresume() before releasing
> 'memory_lock', the runtime PM framework can't suspend the device due
> to incremented usage count.

Nak, this policy should be implemented in userspace.  Thanks,

Alex
 
> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
> ---
>  drivers/vfio/pci/vfio_pci_core.c | 87 ++++++++++++++++++++++++++++++--
>  include/linux/vfio_pci_core.h    |  2 +
>  2 files changed, 84 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index 8c17ca41d156..1ddaaa6ccef5 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -191,6 +191,20 @@ static bool vfio_pci_nointx(struct pci_dev *pdev)
>  	return false;
>  }
>  
> +static bool vfio_pci_low_power_reentry_allowed(struct pci_dev *pdev)
> +{
> +	/*
> +	 * The NVIDIA display class requires driver involvement for every
> +	 * D3cold entry. The audio and other classes can go into D3cold
> +	 * without driver involvement.
> +	 */
> +	if (pdev->vendor == PCI_VENDOR_ID_NVIDIA &&
> +	    ((pdev->class >> 16) == PCI_BASE_CLASS_DISPLAY))
> +		return false;
> +
> +	return true;
> +}
> +
>  static void vfio_pci_probe_power_state(struct vfio_pci_core_device *vdev)
>  {
>  	struct pci_dev *pdev = vdev->pdev;
> @@ -295,6 +309,27 @@ static int vfio_pci_core_runtime_resume(struct device *dev)
>  	if (vdev->pm_intx_masked)
>  		vfio_pci_intx_unmask(vdev);
>  
> +	down_write(&vdev->memory_lock);
> +
> +	/*
> +	 * The runtime resume callback will be called for one of the following
> +	 * two cases:
> +	 *
> +	 * - If the user has called IOCTL explicitly to move the device out of
> +	 *   the low power state or closed the device.
> +	 * - If there is device access on the host side.
> +	 *
> +	 * For the second case, check if re-entry to the low power state is
> +	 * allowed. If not, then increment the usage count so that runtime PM
> +	 * framework won't suspend the device and set the 'pm_runtime_resumed'
> +	 * flag.
> +	 */
> +	if (vdev->pm_runtime_engaged && !vdev->pm_runtime_reentry_allowed) {
> +		pm_runtime_get_noresume(dev);
> +		vdev->pm_runtime_resumed = true;
> +	}
> +	up_write(&vdev->memory_lock);
> +
>  	return 0;
>  }
>  #endif /* CONFIG_PM */
> @@ -415,9 +450,12 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
>  	 */
>  	down_write(&vdev->memory_lock);
>  	if (vdev->pm_runtime_engaged) {
> +		if (!vdev->pm_runtime_resumed) {
> +			pm_runtime_get_noresume(&pdev->dev);
> +			do_resume = true;
> +		}
> +		vdev->pm_runtime_resumed = false;
>  		vdev->pm_runtime_engaged = false;
> -		pm_runtime_get_noresume(&pdev->dev);
> -		do_resume = true;
>  	}
>  	up_write(&vdev->memory_lock);
>  
> @@ -1227,12 +1265,17 @@ static int vfio_pci_pm_validate_flags(u32 flags)
>  	if (!flags)
>  		return -EINVAL;
>  	/* Only valid flags should be set */
> -	if (flags & ~(VFIO_PM_LOW_POWER_ENTER | VFIO_PM_LOW_POWER_EXIT))
> +	if (flags & ~(VFIO_PM_LOW_POWER_ENTER | VFIO_PM_LOW_POWER_EXIT |
> +		      VFIO_PM_LOW_POWER_REENTERY_DISABLE))
>  		return -EINVAL;
>  	/* Both enter and exit should not be set */
>  	if ((flags & (VFIO_PM_LOW_POWER_ENTER | VFIO_PM_LOW_POWER_EXIT)) ==
>  	    (VFIO_PM_LOW_POWER_ENTER | VFIO_PM_LOW_POWER_EXIT))
>  		return -EINVAL;
> +	/* re-entry disable can only be set with enter */
> +	if ((flags & VFIO_PM_LOW_POWER_REENTERY_DISABLE) &&
> +	    !(flags & VFIO_PM_LOW_POWER_ENTER))
> +		return -EINVAL;
>  
>  	return 0;
>  }
> @@ -1255,10 +1298,17 @@ static int vfio_pci_core_feature_pm(struct vfio_device *device, u32 flags,
>  
>  	if (flags & VFIO_DEVICE_FEATURE_GET) {
>  		down_read(&vdev->memory_lock);
> -		if (vdev->pm_runtime_engaged)
> +		if (vdev->pm_runtime_engaged) {
>  			vfio_pm.flags = VFIO_PM_LOW_POWER_ENTER;
> -		else
> +			if (!vdev->pm_runtime_reentry_allowed)
> +				vfio_pm.flags |=
> +					VFIO_PM_LOW_POWER_REENTERY_DISABLE;
> +		} else {
>  			vfio_pm.flags = VFIO_PM_LOW_POWER_EXIT;
> +			if (!vfio_pci_low_power_reentry_allowed(pdev))
> +				vfio_pm.flags |=
> +					VFIO_PM_LOW_POWER_REENTERY_DISABLE;
> +		}
>  		up_read(&vdev->memory_lock);
>  
>  		if (copy_to_user(arg, &vfio_pm, sizeof(vfio_pm)))
> @@ -1286,6 +1336,19 @@ static int vfio_pci_core_feature_pm(struct vfio_device *device, u32 flags,
>  		}
>  
>  		vdev->pm_runtime_engaged = true;
> +		vdev->pm_runtime_resumed = false;
> +
> +		/*
> +		 * If there is any access when the device is in the runtime
> +		 * suspended state, then the device will be resumed first
> +		 * before access and then the device will be suspended again.
> +		 * Check if this second time suspend is allowed and track the
> +		 * same in 'pm_runtime_reentry_allowed' flag.
> +		 */
> +		vdev->pm_runtime_reentry_allowed =
> +			vfio_pci_low_power_reentry_allowed(pdev) &&
> +			!(vfio_pm.flags & VFIO_PM_LOW_POWER_REENTERY_DISABLE);
> +
>  		up_write(&vdev->memory_lock);
>  		pm_runtime_put(&pdev->dev);
>  	} else if (vfio_pm.flags & VFIO_PM_LOW_POWER_EXIT) {
> @@ -1296,6 +1359,20 @@ static int vfio_pci_core_feature_pm(struct vfio_device *device, u32 flags,
>  		}
>  
>  		vdev->pm_runtime_engaged = false;
> +		if (vdev->pm_runtime_resumed) {
> +			vdev->pm_runtime_resumed = false;
> +			up_write(&vdev->memory_lock);
> +			return 0;
> +		}
> +
> +		/*
> +		 * The 'memory_lock' will be acquired again inside the runtime
> +		 * resume callback. So, increment the usage count inside the
> +		 * lock and call pm_runtime_resume() after releasing the lock.
> +		 * If there is any race condition between the wake-up generated
> +		 * at the host and the current path. Then the incremented usage
> +		 * count will prevent the device to go into the suspended state.
> +		 */
>  		pm_runtime_get_noresume(&pdev->dev);
>  		up_write(&vdev->memory_lock);
>  		ret = pm_runtime_resume(&pdev->dev);
> diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
> index bf4823b008f9..18cc83b767b8 100644
> --- a/include/linux/vfio_pci_core.h
> +++ b/include/linux/vfio_pci_core.h
> @@ -126,6 +126,8 @@ struct vfio_pci_core_device {
>  	bool			needs_pm_restore;
>  	bool			pm_intx_masked;
>  	bool			pm_runtime_engaged;
> +	bool			pm_runtime_resumed;
> +	bool			pm_runtime_reentry_allowed;
>  	struct pci_saved_state	*pci_saved_state;
>  	struct pci_saved_state	*pm_save;
>  	int			ioeventfds_nr;


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 4/6] vfio/pci: Add the support for PCI D3cold state
  2022-07-01 11:08 ` [PATCH v4 4/6] vfio/pci: Add the support for PCI D3cold state Abhishek Sahu
@ 2022-07-06 15:40   ` Alex Williamson
  0 siblings, 0 replies; 26+ messages in thread
From: Alex Williamson @ 2022-07-06 15:40 UTC (permalink / raw)
  To: Abhishek Sahu
  Cc: Cornelia Huck, Yishai Hadas, Jason Gunthorpe, Shameer Kolothum,
	Kevin Tian, Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas,
	linux-kernel, kvm, linux-pm, linux-pci

On Fri, 1 Jul 2022 16:38:12 +0530
Abhishek Sahu <abhsahu@nvidia.com> wrote:

> Currently, if the runtime power management is enabled for vfio-pci
> based devices in the guest OS, then the guest OS will do the register
> write for PCI_PM_CTRL register. This write request will be handled in
> vfio_pm_config_write() where it will do the actual register write of
> PCI_PM_CTRL register. With this, the maximum D3hot state can be
> achieved for low power. If we can use the runtime PM framework, then
> we can achieve the D3cold state (on the supported systems) which will
> help in saving maximum power.
> 
> 1. D3cold state can't be achieved by writing PCI standard
>    PM config registers. This patch implements the newly added
>    'VFIO_DEVICE_FEATURE_POWER_MANAGEMENT' device feature which
>     can be used for putting the device into the D3cold state.
> 
> 2. The hypervisors can implement virtual ACPI methods. For example,
>    in guest linux OS if PCI device ACPI node has _PR3 and _PR0 power
>    resources with _ON/_OFF method, then guest linux OS invokes
>    the _OFF method during D3cold transition and then _ON during D0
>    transition. The hypervisor can tap these virtual ACPI calls and then
>    call the  'VFIO_DEVICE_FEATURE_POWER_MANAGEMENT' with respective flags.
> 
> 3. The vfio-pci driver uses runtime PM framework to achieve the
>    D3cold state. For the D3cold transition, decrement the usage count and
>    for the D0 transition, increment the usage count.
> 
> 4. If the D3cold state is not supported, then the device will
>    still be in the D3hot state. But with the runtime PM, the root port
>    can now also go into the suspended state.
> 
> 5. The 'pm_runtime_engaged' flag tracks the entry and exit to
>    runtime PM. This flag is protected with 'memory_lock' semaphore.
> 
> 6. During exit time, the flag clearing and usage count increment
>    are protected with 'memory_lock'. The actual wake-up is happening
>    outside 'memory_lock' since 'memory_lock' will be needed inside
>    runtime_resume callback also in subsequent patches.
> 
> 7. In D3cold, all kinds of device-related access (BAR read/write,
>    config read/write, etc.) need to be disabled. For BAR-related access,
>    we can use existing D3hot memory disable support. During the low power
>    entry, invalidate the mmap mappings and add the check for
>    'pm_runtime_engaged' flag.

Not disabled, just wrapped in pm-get/put.  If the device is indefinitely
in low-power without a wake-up eventfd, mmap faults are fatal to the
user.
 
> 8. For config space, ideally, we need to return an error whenever
>    there is any config access from the user side once the user moved the
>    device into low power state. But adding a check for
>    'pm_runtime_engaged' flag alone won't be sufficient due to the
>    following possible scenario from the user side where config space
>    access happens parallelly with the low power entry IOCTL.
> 
>    a. Config space access happens and vfio_pci_config_rw() will be
>       called.
>    b. The IOCTL to move into low power state is called.
>    c. The IOCTL will move the device into d3cold.
>    d. Exit from vfio_pci_config_rw() happened.
> 
>    Now, if we just check 'pm_runtime_engaged', then in the above
>    sequence the config space access will happen when the device already
>    is in the low power state. To prevent this situation, we increment the
>    usage count before any config space access and decrement the same
>    after completing the access. Also, to prevent any similar cases for
>    other types of access, the usage count will be incremented for all
>    kinds of access.

Unnecessary, just wrap in pm-get/put.  Thanks,

Alex

> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
> ---
>  drivers/vfio/pci/vfio_pci_config.c |   2 +-
>  drivers/vfio/pci/vfio_pci_core.c   | 169 +++++++++++++++++++++++++++--
>  include/linux/vfio_pci_core.h      |   1 +
>  3 files changed, 164 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
> index 9343f597182d..21a4743d011f 100644
> --- a/drivers/vfio/pci/vfio_pci_config.c
> +++ b/drivers/vfio/pci/vfio_pci_config.c
> @@ -408,7 +408,7 @@ bool __vfio_pci_memory_enabled(struct vfio_pci_core_device *vdev)
>  	 * PF SR-IOV capability, there's therefore no need to trigger
>  	 * faults based on the virtual value.
>  	 */
> -	return pdev->current_state < PCI_D3hot &&
> +	return !vdev->pm_runtime_engaged && pdev->current_state < PCI_D3hot &&
>  	       (pdev->no_command_memory || (cmd & PCI_COMMAND_MEMORY));
>  }
>  
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index 5948d930449b..8c17ca41d156 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -264,6 +264,18 @@ static int vfio_pci_core_runtime_suspend(struct device *dev)
>  {
>  	struct vfio_pci_core_device *vdev = dev_get_drvdata(dev);
>  
> +	down_write(&vdev->memory_lock);
> +	/*
> +	 * The user can move the device into D3hot state before invoking
> +	 * power management IOCTL. Move the device into D0 state here and then
> +	 * the pci-driver core runtime PM suspend function will move the device
> +	 * into the low power state. Also, for the devices which have
> +	 * NoSoftRst-, it will help in restoring the original state
> +	 * (saved locally in 'vdev->pm_save').
> +	 */
> +	vfio_pci_set_power_state(vdev, PCI_D0);
> +	up_write(&vdev->memory_lock);
> +
>  	/*
>  	 * If INTx is enabled, then mask INTx before going into the runtime
>  	 * suspended state and unmask the same in the runtime resume.
> @@ -386,6 +398,7 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
>  	struct pci_dev *pdev = vdev->pdev;
>  	struct vfio_pci_dummy_resource *dummy_res, *tmp;
>  	struct vfio_pci_ioeventfd *ioeventfd, *ioeventfd_tmp;
> +	bool do_resume = false;
>  	int i, bar;
>  
>  	/* For needs_reset */
> @@ -393,6 +406,25 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
>  
>  	/*
>  	 * This function can be invoked while the power state is non-D0.
> +	 * This non-D0 power state can be with or without runtime PM.
> +	 * Increment the usage count corresponding to pm_runtime_put()
> +	 * called during setting of 'pm_runtime_engaged'. The device will
> +	 * wake up if it has already gone into the suspended state.
> +	 * Otherwise, the next vfio_pci_set_power_state() will change the
> +	 * device power state to D0.
> +	 */
> +	down_write(&vdev->memory_lock);
> +	if (vdev->pm_runtime_engaged) {
> +		vdev->pm_runtime_engaged = false;
> +		pm_runtime_get_noresume(&pdev->dev);
> +		do_resume = true;
> +	}
> +	up_write(&vdev->memory_lock);
> +
> +	if (do_resume)
> +		pm_runtime_resume(&pdev->dev);
> +
> +	/*
>  	 * This function calls __pci_reset_function_locked() which internally
>  	 * can use pci_pm_reset() for the function reset. pci_pm_reset() will
>  	 * fail if the power state is non-D0. Also, for the devices which
> @@ -1190,6 +1222,99 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
>  }
>  EXPORT_SYMBOL_GPL(vfio_pci_core_ioctl);
>  
> +static int vfio_pci_pm_validate_flags(u32 flags)
> +{
> +	if (!flags)
> +		return -EINVAL;
> +	/* Only valid flags should be set */
> +	if (flags & ~(VFIO_PM_LOW_POWER_ENTER | VFIO_PM_LOW_POWER_EXIT))
> +		return -EINVAL;
> +	/* Both enter and exit should not be set */
> +	if ((flags & (VFIO_PM_LOW_POWER_ENTER | VFIO_PM_LOW_POWER_EXIT)) ==
> +	    (VFIO_PM_LOW_POWER_ENTER | VFIO_PM_LOW_POWER_EXIT))
> +		return -EINVAL;
> +
> +	return 0;
> +}
> +
> +static int vfio_pci_core_feature_pm(struct vfio_device *device, u32 flags,
> +				    void __user *arg, size_t argsz)
> +{
> +	struct vfio_pci_core_device *vdev =
> +		container_of(device, struct vfio_pci_core_device, vdev);
> +	struct pci_dev *pdev = vdev->pdev;
> +	struct vfio_device_feature_power_management vfio_pm = { 0 };
> +	int ret = 0;
> +
> +	ret = vfio_check_feature(flags, argsz,
> +				 VFIO_DEVICE_FEATURE_SET |
> +				 VFIO_DEVICE_FEATURE_GET,
> +				 sizeof(vfio_pm));
> +	if (ret != 1)
> +		return ret;
> +
> +	if (flags & VFIO_DEVICE_FEATURE_GET) {
> +		down_read(&vdev->memory_lock);
> +		if (vdev->pm_runtime_engaged)
> +			vfio_pm.flags = VFIO_PM_LOW_POWER_ENTER;
> +		else
> +			vfio_pm.flags = VFIO_PM_LOW_POWER_EXIT;
> +		up_read(&vdev->memory_lock);
> +
> +		if (copy_to_user(arg, &vfio_pm, sizeof(vfio_pm)))
> +			return -EFAULT;
> +
> +		return 0;
> +	}
> +
> +	if (copy_from_user(&vfio_pm, arg, sizeof(vfio_pm)))
> +		return -EFAULT;
> +
> +	ret = vfio_pci_pm_validate_flags(vfio_pm.flags);
> +	if (ret)
> +		return ret;
> +
> +	/*
> +	 * The vdev power related flags are protected with 'memory_lock'
> +	 * semaphore.
> +	 */
> +	if (vfio_pm.flags & VFIO_PM_LOW_POWER_ENTER) {
> +		vfio_pci_zap_and_down_write_memory_lock(vdev);
> +		if (vdev->pm_runtime_engaged) {
> +			up_write(&vdev->memory_lock);
> +			return -EINVAL;
> +		}
> +
> +		vdev->pm_runtime_engaged = true;
> +		up_write(&vdev->memory_lock);
> +		pm_runtime_put(&pdev->dev);
> +	} else if (vfio_pm.flags & VFIO_PM_LOW_POWER_EXIT) {
> +		down_write(&vdev->memory_lock);
> +		if (!vdev->pm_runtime_engaged) {
> +			up_write(&vdev->memory_lock);
> +			return -EINVAL;
> +		}
> +
> +		vdev->pm_runtime_engaged = false;
> +		pm_runtime_get_noresume(&pdev->dev);
> +		up_write(&vdev->memory_lock);
> +		ret = pm_runtime_resume(&pdev->dev);
> +		if (ret < 0) {
> +			down_write(&vdev->memory_lock);
> +			if (!vdev->pm_runtime_engaged) {
> +				vdev->pm_runtime_engaged = true;
> +				pm_runtime_put_noidle(&pdev->dev);
> +			}
> +			up_write(&vdev->memory_lock);
> +			return ret;
> +		}
> +	} else {
> +		return -EINVAL;
> +	}
> +
> +	return 0;
> +}
> +
>  static int vfio_pci_core_feature_token(struct vfio_device *device, u32 flags,
>  				       void __user *arg, size_t argsz)
>  {
> @@ -1224,6 +1349,8 @@ int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
>  	switch (flags & VFIO_DEVICE_FEATURE_MASK) {
>  	case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN:
>  		return vfio_pci_core_feature_token(device, flags, arg, argsz);
> +	case VFIO_DEVICE_FEATURE_POWER_MANAGEMENT:
> +		return vfio_pci_core_feature_pm(device, flags, arg, argsz);
>  	default:
>  		return -ENOTTY;
>  	}
> @@ -1234,31 +1361,47 @@ static ssize_t vfio_pci_rw(struct vfio_pci_core_device *vdev, char __user *buf,
>  			   size_t count, loff_t *ppos, bool iswrite)
>  {
>  	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
> +	int ret;
>  
>  	if (index >= VFIO_PCI_NUM_REGIONS + vdev->num_regions)
>  		return -EINVAL;
>  
> +	ret = pm_runtime_resume_and_get(&vdev->pdev->dev);
> +	if (ret < 0) {
> +		pci_info_ratelimited(vdev->pdev, "runtime resume failed %d\n",
> +				     ret);
> +		return -EIO;
> +	}
> +
>  	switch (index) {
>  	case VFIO_PCI_CONFIG_REGION_INDEX:
> -		return vfio_pci_config_rw(vdev, buf, count, ppos, iswrite);
> +		ret = vfio_pci_config_rw(vdev, buf, count, ppos, iswrite);
> +		break;
>  
>  	case VFIO_PCI_ROM_REGION_INDEX:
>  		if (iswrite)
> -			return -EINVAL;
> -		return vfio_pci_bar_rw(vdev, buf, count, ppos, false);
> +			ret = -EINVAL;
> +		else
> +			ret = vfio_pci_bar_rw(vdev, buf, count, ppos, false);
> +		break;
>  
>  	case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
> -		return vfio_pci_bar_rw(vdev, buf, count, ppos, iswrite);
> +		ret = vfio_pci_bar_rw(vdev, buf, count, ppos, iswrite);
> +		break;
>  
>  	case VFIO_PCI_VGA_REGION_INDEX:
> -		return vfio_pci_vga_rw(vdev, buf, count, ppos, iswrite);
> +		ret = vfio_pci_vga_rw(vdev, buf, count, ppos, iswrite);
> +		break;
> +
>  	default:
>  		index -= VFIO_PCI_NUM_REGIONS;
> -		return vdev->region[index].ops->rw(vdev, buf,
> +		ret = vdev->region[index].ops->rw(vdev, buf,
>  						   count, ppos, iswrite);
> +		break;
>  	}
>  
> -	return -EINVAL;
> +	pm_runtime_put(&vdev->pdev->dev);
> +	return ret;
>  }
>  
>  ssize_t vfio_pci_core_read(struct vfio_device *core_vdev, char __user *buf,
> @@ -2157,6 +2300,15 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
>  		goto err_unlock;
>  	}
>  
> +	/*
> +	 * Some of the devices in the dev_set can be in the runtime suspended
> +	 * state. Increment the usage count for all the devices in the dev_set
> +	 * before reset and decrement the same after reset.
> +	 */
> +	ret = vfio_pci_dev_set_pm_runtime_get(dev_set);
> +	if (ret)
> +		goto err_unlock;
> +
>  	list_for_each_entry(cur_vma, &dev_set->device_list, vdev.dev_set_list) {
>  		/*
>  		 * Test whether all the affected devices are contained by the
> @@ -2212,6 +2364,9 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
>  		else
>  			mutex_unlock(&cur->vma_lock);
>  	}
> +
> +	list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list)
> +		pm_runtime_put(&cur->pdev->dev);
>  err_unlock:
>  	mutex_unlock(&dev_set->lock);
>  	return ret;
> diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
> index cdfd328ba6b1..bf4823b008f9 100644
> --- a/include/linux/vfio_pci_core.h
> +++ b/include/linux/vfio_pci_core.h
> @@ -125,6 +125,7 @@ struct vfio_pci_core_device {
>  	bool			nointx;
>  	bool			needs_pm_restore;
>  	bool			pm_intx_masked;
> +	bool			pm_runtime_engaged;
>  	struct pci_saved_state	*pci_saved_state;
>  	struct pci_saved_state	*pm_save;
>  	int			ioeventfds_nr;


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 1/6] vfio/pci: Mask INTx during runtime suspend
  2022-07-06 15:39   ` Alex Williamson
@ 2022-07-08  9:21     ` Abhishek Sahu
  2022-07-08 15:45       ` Alex Williamson
  0 siblings, 1 reply; 26+ messages in thread
From: Abhishek Sahu @ 2022-07-08  9:21 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Cornelia Huck, Yishai Hadas, Jason Gunthorpe, Shameer Kolothum,
	Kevin Tian, Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas,
	linux-kernel, kvm, linux-pm, linux-pci

On 7/6/2022 9:09 PM, Alex Williamson wrote:
> On Fri, 1 Jul 2022 16:38:09 +0530
> Abhishek Sahu <abhsahu@nvidia.com> wrote:
> 
>> This patch adds INTx handling during runtime suspend/resume.
>> All the suspend/resume related code for the user to put the device
>> into the low power state will be added in subsequent patches.
>>
>> The INTx are shared among devices. Whenever any INTx interrupt comes
> 
> "The INTx lines may be shared..."
> 
>> for the VFIO devices, then vfio_intx_handler() will be called for each
>> device. Inside vfio_intx_handler(), it calls pci_check_and_mask_intx()
> 
> "...device sharing the interrupt."
> 
>> and checks if the interrupt has been generated for the current device.
>> Now, if the device is already in the D3cold state, then the config space
>> can not be read. Attempt to read config space in D3cold state can
>> cause system unresponsiveness in a few systems. To prevent this, mask
>> INTx in runtime suspend callback and unmask the same in runtime resume
>> callback. If INTx has been already masked, then no handling is needed
>> in runtime suspend/resume callbacks. 'pm_intx_masked' tracks this, and
>> vfio_pci_intx_mask() has been updated to return true if INTx has been
>> masked inside this function.
>>
>> For the runtime suspend which is triggered for the no user of VFIO
>> device, the is_intx() will return false and these callbacks won't do
>> anything.
>>
>> The MSI/MSI-X are not shared so similar handling should not be
>> needed for MSI/MSI-X. vfio_msihandler() triggers eventfd_signal()
>> without doing any device-specific config access. When the user performs
>> any config access or IOCTL after receiving the eventfd notification,
>> then the device will be moved to the D0 state first before
>> servicing any request.
>>
>> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
>> ---
>>  drivers/vfio/pci/vfio_pci_core.c  | 37 +++++++++++++++++++++++++++----
>>  drivers/vfio/pci/vfio_pci_intrs.c |  6 ++++-
>>  include/linux/vfio_pci_core.h     |  3 ++-
>>  3 files changed, 40 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
>> index a0d69ddaf90d..5948d930449b 100644
>> --- a/drivers/vfio/pci/vfio_pci_core.c
>> +++ b/drivers/vfio/pci/vfio_pci_core.c
>> @@ -259,16 +259,45 @@ int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t stat
>>  	return ret;
>>  }
>>  
>> +#ifdef CONFIG_PM
>> +static int vfio_pci_core_runtime_suspend(struct device *dev)
>> +{
>> +	struct vfio_pci_core_device *vdev = dev_get_drvdata(dev);
>> +
>> +	/*
>> +	 * If INTx is enabled, then mask INTx before going into the runtime
>> +	 * suspended state and unmask the same in the runtime resume.
>> +	 * If INTx has already been masked by the user, then
>> +	 * vfio_pci_intx_mask() will return false and in that case, INTx
>> +	 * should not be unmasked in the runtime resume.
>> +	 */
>> +	vdev->pm_intx_masked = (is_intx(vdev) && vfio_pci_intx_mask(vdev));
>> +
>> +	return 0;
>> +}
>> +
>> +static int vfio_pci_core_runtime_resume(struct device *dev)
>> +{
>> +	struct vfio_pci_core_device *vdev = dev_get_drvdata(dev);
>> +
>> +	if (vdev->pm_intx_masked)
>> +		vfio_pci_intx_unmask(vdev);
>> +
>> +	return 0;
>> +}
>> +#endif /* CONFIG_PM */
>> +
>>  /*
>> - * The dev_pm_ops needs to be provided to make pci-driver runtime PM working,
>> - * so use structure without any callbacks.
>> - *
>>   * The pci-driver core runtime PM routines always save the device state
>>   * before going into suspended state. If the device is going into low power
>>   * state with only with runtime PM ops, then no explicit handling is needed
>>   * for the devices which have NoSoftRst-.
>>   */
>> -static const struct dev_pm_ops vfio_pci_core_pm_ops = { };
>> +static const struct dev_pm_ops vfio_pci_core_pm_ops = {
>> +	SET_RUNTIME_PM_OPS(vfio_pci_core_runtime_suspend,
>> +			   vfio_pci_core_runtime_resume,
>> +			   NULL)
>> +};
>>  
>>  int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
>>  {
>> diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
>> index 6069a11fb51a..1a37db99df48 100644
>> --- a/drivers/vfio/pci/vfio_pci_intrs.c
>> +++ b/drivers/vfio/pci/vfio_pci_intrs.c
>> @@ -33,10 +33,12 @@ static void vfio_send_intx_eventfd(void *opaque, void *unused)
>>  		eventfd_signal(vdev->ctx[0].trigger, 1);
>>  }
>>  
>> -void vfio_pci_intx_mask(struct vfio_pci_core_device *vdev)
>> +/* Returns true if INTx has been masked by this function. */
>> +bool vfio_pci_intx_mask(struct vfio_pci_core_device *vdev)
>>  {
>>  	struct pci_dev *pdev = vdev->pdev;
>>  	unsigned long flags;
>> +	bool intx_masked = false;
>>  
>>  	spin_lock_irqsave(&vdev->irqlock, flags);
>>  
>> @@ -60,9 +62,11 @@ void vfio_pci_intx_mask(struct vfio_pci_core_device *vdev)
>>  			disable_irq_nosync(pdev->irq);
>>  
>>  		vdev->ctx[0].masked = true;
>> +		intx_masked = true;
>>  	}
>>  
>>  	spin_unlock_irqrestore(&vdev->irqlock, flags);
>> +	return intx_masked;
>>  }
> 
> 
> There's certainly another path through this function that masks the
> interrupt, which makes the definition of this return value a bit
> confusing.

 For our case we should not hit that path. But we can return the
 intx_masked true from that path as well to align return value.

> Wouldn't it be simpler not to overload the masked flag on
> the interrupt context like this and instead set a new flag on the vdev
> under irqlock to indicate the device is unable to generate interrupts.
> The irq handler would add a test of this flag before any tests that
> would access the device.  Thanks,
> 
> Alex
>  

 We will set this flag inside runtime_suspend callback but the
 device can be in non-D3cold state (For example, if user has
 disabled d3cold explicitly by sysfs, the D3cold is not supported in
 the platform, etc.). Also, in D3cold supported case, the device will
 be in D0 till the PCI core moves the device into D3cold. In this case,
 there is possibility that the device can generate an interrupt.
 If we add check in the IRQ handler, then we won’t check and clear
 the IRQ status, but the interrupt line will still be asserted
 which can cause interrupt flooding.

 This was the reason for disabling interrupt itself instead of
 checking flag in the IRQ handler.

 Thanks,
 Abhishek

>>  /*
>> diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
>> index 23c176d4b073..cdfd328ba6b1 100644
>> --- a/include/linux/vfio_pci_core.h
>> +++ b/include/linux/vfio_pci_core.h
>> @@ -124,6 +124,7 @@ struct vfio_pci_core_device {
>>  	bool			needs_reset;
>>  	bool			nointx;
>>  	bool			needs_pm_restore;
>> +	bool			pm_intx_masked;
>>  	struct pci_saved_state	*pci_saved_state;
>>  	struct pci_saved_state	*pm_save;
>>  	int			ioeventfds_nr;
>> @@ -147,7 +148,7 @@ struct vfio_pci_core_device {
>>  #define is_irq_none(vdev) (!(is_intx(vdev) || is_msi(vdev) || is_msix(vdev)))
>>  #define irq_is(vdev, type) (vdev->irq_type == type)
>>  
>> -extern void vfio_pci_intx_mask(struct vfio_pci_core_device *vdev);
>> +extern bool vfio_pci_intx_mask(struct vfio_pci_core_device *vdev);
>>  extern void vfio_pci_intx_unmask(struct vfio_pci_core_device *vdev);
>>  
>>  extern int vfio_pci_set_irqs_ioctl(struct vfio_pci_core_device *vdev,
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 2/6] vfio: Add a new device feature for the power management
  2022-07-06 15:39   ` Alex Williamson
@ 2022-07-08  9:39     ` Abhishek Sahu
  2022-07-08 16:36       ` Alex Williamson
  0 siblings, 1 reply; 26+ messages in thread
From: Abhishek Sahu @ 2022-07-08  9:39 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Cornelia Huck, Yishai Hadas, Jason Gunthorpe, Shameer Kolothum,
	Kevin Tian, Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas,
	linux-kernel, kvm, linux-pm, linux-pci

On 7/6/2022 9:09 PM, Alex Williamson wrote:
> On Fri, 1 Jul 2022 16:38:10 +0530
> Abhishek Sahu <abhsahu@nvidia.com> wrote:
> 
>> This patch adds the new feature VFIO_DEVICE_FEATURE_POWER_MANAGEMENT
>> for the power management in the header file. The implementation for the
>> same will be added in the subsequent patches.
>>
>> With the standard registers, all power states cannot be achieved. The
>> platform-based power management needs to be involved to go into the
>> lowest power state. For all the platform-based power management, this
>> device feature can be used.
>>
>> This device feature uses flags to specify the different operations. In
>> the future, if any more power management functionality is needed then
>> a new flag can be added to it. It supports both GET and SET operations.
>>
>> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
>> ---
>>  include/uapi/linux/vfio.h | 55 +++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 55 insertions(+)
>>
>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>> index 733a1cddde30..7e00de5c21ea 100644
>> --- a/include/uapi/linux/vfio.h
>> +++ b/include/uapi/linux/vfio.h
>> @@ -986,6 +986,61 @@ enum vfio_device_mig_state {
>>  	VFIO_DEVICE_STATE_RUNNING_P2P = 5,
>>  };
>>  
>> +/*
>> + * Perform power management-related operations for the VFIO device.
>> + *
>> + * The low power feature uses platform-based power management to move the
>> + * device into the low power state.  This low power state is device-specific.
>> + *
>> + * This device feature uses flags to specify the different operations.
>> + * It supports both the GET and SET operations.
>> + *
>> + * - VFIO_PM_LOW_POWER_ENTER flag moves the VFIO device into the low power
>> + *   state with platform-based power management.  This low power state will be
>> + *   internal to the VFIO driver and the user will not come to know which power
>> + *   state is chosen.  Once the user has moved the VFIO device into the low
>> + *   power state, then the user should not do any device access without moving
>> + *   the device out of the low power state.
> 
> Except we're wrapping device accesses to make this possible.  This
> should probably describe how any discrete access will wake the device
> but ongoing access through mmaps will generate user faults.
> 

 Sure. I will add that details also.

>> + *
>> + * - VFIO_PM_LOW_POWER_EXIT flag moves the VFIO device out of the low power
>> + *    state.  This flag should only be set if the user has previously put the
>> + *    device into low power state with the VFIO_PM_LOW_POWER_ENTER flag.
> 
> Indenting.
> 
 
 I will fix this.

>> + *
>> + * - VFIO_PM_LOW_POWER_ENTER and VFIO_PM_LOW_POWER_EXIT are mutually exclusive.
>> + *
>> + * - VFIO_PM_LOW_POWER_REENTERY_DISABLE flag is only valid with
>> + *   VFIO_PM_LOW_POWER_ENTER.  If there is any access for the VFIO device on
>> + *   the host side, then the device will be moved out of the low power state
>> + *   without the user's guest driver involvement.  Some devices require the
>> + *   user's guest driver involvement for each low-power entry.  If this flag is
>> + *   set, then the re-entry to the low power state will be disabled, and the
>> + *   host kernel will not move the device again into the low power state.
>> + *   The VFIO driver internally maintains a list of devices for which low
>> + *   power re-entry is disabled by default and for those devices, the
>> + *   re-entry will be disabled even if the user has not set this flag
>> + *   explicitly.
> 
> Wrong polarity.  The kernel should not maintain the policy.  By default
> every wakeup, whether from host kernel accesses or via user accesses
> that do a pm-get should signal a wakeup to userspace.  Userspace needs
> to opt-out of that wakeup to let the kernel automatically re-enter low
> power and userspace needs to maintain the policy for which devices it
> wants that to occur.
> 
 
 Okay. So that means, in the kernel side, we don’t have to maintain
 the list which currently contains NVIDIA device ID. Also, in our
 updated approach, this opt-out of that wake-up means that user
 has not provided eventfd in the feature SET ioctl. Correct ?
 
>> + *
>> + * For the IOCTL call with VFIO_DEVICE_FEATURE_GET:
>> + *
>> + * - VFIO_PM_LOW_POWER_ENTER will be set if the user has put the device into
>> + *   the low power state, otherwise, VFIO_PM_LOW_POWER_EXIT will be set.
>> + *
>> + * - If the device is in a normal power state currently, then
>> + *   VFIO_PM_LOW_POWER_REENTERY_DISABLE will be set for the devices where low
>> + *   power re-entry is disabled by default.  If the device is in the low power
>> + *   state currently, then VFIO_PM_LOW_POWER_REENTERY_DISABLE will be set
>> + *   according to the current transition.
> 
> Very confusing semantics.
> 
> What if the feature SET ioctl took an eventfd and that eventfd was one
> time use.  Calling the ioctl would setup the eventfd to notify the user
> on wakeup and call pm-put.  Any access to the device via host, ioctl,
> or region would be wrapped in pm-get/put and the pm-resume handler
> would perform the matching pm-get to balance the feature SET and signal
> the eventfd. 

 This seems a better option. It will help in making the ioctl simpler
 and we don’t have to add a separate index for PME which I added in
 patch 6. 

> If the user opts-out by not providing a wakeup eventfd,
> then the pm-resume handler does not perform a pm-get. Possibly we
> could even allow mmap access if a wake-up eventfd is provided.

 Sorry. I am not clear on this mmap part. We currently invalidates
 mapping before going into runtime-suspend. Now, if use tries do
 mmap then do we need some extra handling in the fault handler ?
 Need your help in understanding this part.

> The
> feature GET ioctl would be used to exit low power behavior and would be
> a no-op if the wakeup eventfd had already been signaled.  Thanks,
>
 
 I will use the GET ioctl for low power exit instead of returning the
 current status.
 
 Regards,
 Abhishek

> Alex
> 
>> + */
>> +struct vfio_device_feature_power_management {
>> +	__u32	flags;
>> +#define VFIO_PM_LOW_POWER_ENTER			(1 << 0)
>> +#define VFIO_PM_LOW_POWER_EXIT			(1 << 1)
>> +#define VFIO_PM_LOW_POWER_REENTERY_DISABLE	(1 << 2)
>> +	__u32	reserved;
>> +};
>> +
>> +#define VFIO_DEVICE_FEATURE_POWER_MANAGEMENT	3
>> +
>>  /* -------- API for Type1 VFIO IOMMU -------- */
>>  
>>  /**
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 3/6] vfio: Increment the runtime PM usage count during IOCTL call
  2022-07-06 15:40   ` Alex Williamson
@ 2022-07-08  9:43     ` Abhishek Sahu
  2022-07-08 16:49       ` Alex Williamson
  0 siblings, 1 reply; 26+ messages in thread
From: Abhishek Sahu @ 2022-07-08  9:43 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Cornelia Huck, Yishai Hadas, Jason Gunthorpe, Shameer Kolothum,
	Kevin Tian, Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas,
	linux-kernel, kvm, linux-pm, linux-pci

On 7/6/2022 9:10 PM, Alex Williamson wrote:
> On Fri, 1 Jul 2022 16:38:11 +0530
> Abhishek Sahu <abhsahu@nvidia.com> wrote:
> 
>> The vfio-pci based driver will have runtime power management
>> support where the user can put the device into the low power state
>> and then PCI devices can go into the D3cold state. If the device is
>> in the low power state and the user issues any IOCTL, then the
>> device should be moved out of the low power state first. Once
>> the IOCTL is serviced, then it can go into the low power state again.
>> The runtime PM framework manages this with help of usage count.
>>
>> One option was to add the runtime PM related API's inside vfio-pci
>> driver but some IOCTL (like VFIO_DEVICE_FEATURE) can follow a
>> different path and more IOCTL can be added in the future. Also, the
>> runtime PM will be added for vfio-pci based drivers variant currently,
>> but the other VFIO based drivers can use the same in the
>> future. So, this patch adds the runtime calls runtime-related API in
>> the top-level IOCTL function itself.
>>
>> For the VFIO drivers which do not have runtime power management
>> support currently, the runtime PM API's won't be invoked. Only for
>> vfio-pci based drivers currently, the runtime PM API's will be invoked
>> to increment and decrement the usage count.
> 
> Variant drivers can easily opt-out of runtime pm support by performing
> a gratuitous pm-get in their device-open function.
>  

 Do I need to add this line in the commit message?
 
>> Taking this usage count incremented while servicing IOCTL will make
>> sure that the user won't put the device into low power state when any
>> other IOCTL is being serviced in parallel. Let's consider the
>> following scenario:
>>
>>  1. Some other IOCTL is called.
>>  2. The user has opened another device instance and called the power
>>     management IOCTL for the low power entry.
>>  3. The power management IOCTL moves the device into the low power state.
>>  4. The other IOCTL finishes.
>>
>> If we don't keep the usage count incremented then the device
>> access will happen between step 3 and 4 while the device has already
>> gone into the low power state.
>>
>> The runtime PM API's should not be invoked for
>> VFIO_DEVICE_FEATURE_POWER_MANAGEMENT since this IOCTL itself performs
>> the runtime power management entry and exit for the VFIO device.
> 
> I think the one-shot interface I proposed in the previous patch avoids
> the need for special handling for these feature ioctls.  Thanks,
> 

 Okay. So, for low power exit case (means feature GET ioctl in the
 updated case) also, we will trigger eventfd. Correct?

 Thanks,
 Abhishek
 
> Alex
>  
>> The pm_runtime_resume_and_get() will be the first call so its error
>> should not be propagated to user space directly. For example, if
>> pm_runtime_resume_and_get() can return -EINVAL for the cases where the
>> user has passed the correct argument. So the
>> pm_runtime_resume_and_get() errors have been masked behind -EIO.
>>
>> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
>> ---
>>  drivers/vfio/vfio.c | 82 ++++++++++++++++++++++++++++++++++++++++-----
>>  1 file changed, 74 insertions(+), 8 deletions(-)
>>
>> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
>> index 61e71c1154be..61a8d9f7629a 100644
>> --- a/drivers/vfio/vfio.c
>> +++ b/drivers/vfio/vfio.c
>> @@ -32,6 +32,7 @@
>>  #include <linux/vfio.h>
>>  #include <linux/wait.h>
>>  #include <linux/sched/signal.h>
>> +#include <linux/pm_runtime.h>
>>  #include "vfio.h"
>>  
>>  #define DRIVER_VERSION	"0.3"
>> @@ -1333,6 +1334,39 @@ static const struct file_operations vfio_group_fops = {
>>  	.release	= vfio_group_fops_release,
>>  };
>>  
>> +/*
>> + * Wrapper around pm_runtime_resume_and_get().
>> + * Return error code on failure or 0 on success.
>> + */
>> +static inline int vfio_device_pm_runtime_get(struct vfio_device *device)
>> +{
>> +	struct device *dev = device->dev;
>> +
>> +	if (dev->driver && dev->driver->pm) {
>> +		int ret;
>> +
>> +		ret = pm_runtime_resume_and_get(dev);
>> +		if (ret < 0) {
>> +			dev_info_ratelimited(dev,
>> +				"vfio: runtime resume failed %d\n", ret);
>> +			return -EIO;
>> +		}
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +/*
>> + * Wrapper around pm_runtime_put().
>> + */
>> +static inline void vfio_device_pm_runtime_put(struct vfio_device *device)
>> +{
>> +	struct device *dev = device->dev;
>> +
>> +	if (dev->driver && dev->driver->pm)
>> +		pm_runtime_put(dev);
>> +}
>> +
>>  /*
>>   * VFIO Device fd
>>   */
>> @@ -1607,6 +1641,8 @@ static int vfio_ioctl_device_feature(struct vfio_device *device,
>>  {
>>  	size_t minsz = offsetofend(struct vfio_device_feature, flags);
>>  	struct vfio_device_feature feature;
>> +	int ret = 0;
>> +	u16 feature_cmd;
>>  
>>  	if (copy_from_user(&feature, arg, minsz))
>>  		return -EFAULT;
>> @@ -1626,28 +1662,51 @@ static int vfio_ioctl_device_feature(struct vfio_device *device,
>>  	    (feature.flags & VFIO_DEVICE_FEATURE_GET))
>>  		return -EINVAL;
>>  
>> -	switch (feature.flags & VFIO_DEVICE_FEATURE_MASK) {
>> +	feature_cmd = feature.flags & VFIO_DEVICE_FEATURE_MASK;
>> +
>> +	/*
>> +	 * The VFIO_DEVICE_FEATURE_POWER_MANAGEMENT itself performs the runtime
>> +	 * power management entry and exit for the VFIO device, so the runtime
>> +	 * PM API's should not be called for this feature.
>> +	 */
>> +	if (feature_cmd != VFIO_DEVICE_FEATURE_POWER_MANAGEMENT) {
>> +		ret = vfio_device_pm_runtime_get(device);
>> +		if (ret)
>> +			return ret;
>> +	}
>> +
>> +	switch (feature_cmd) {
>>  	case VFIO_DEVICE_FEATURE_MIGRATION:
>> -		return vfio_ioctl_device_feature_migration(
>> +		ret = vfio_ioctl_device_feature_migration(
>>  			device, feature.flags, arg->data,
>>  			feature.argsz - minsz);
>> +		break;
>>  	case VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE:
>> -		return vfio_ioctl_device_feature_mig_device_state(
>> +		ret = vfio_ioctl_device_feature_mig_device_state(
>>  			device, feature.flags, arg->data,
>>  			feature.argsz - minsz);
>> +		break;
>>  	default:
>>  		if (unlikely(!device->ops->device_feature))
>> -			return -EINVAL;
>> -		return device->ops->device_feature(device, feature.flags,
>> -						   arg->data,
>> -						   feature.argsz - minsz);
>> +			ret = -EINVAL;
>> +		else
>> +			ret = device->ops->device_feature(
>> +				device, feature.flags, arg->data,
>> +				feature.argsz - minsz);
>> +		break;
>>  	}
>> +
>> +	if (feature_cmd != VFIO_DEVICE_FEATURE_POWER_MANAGEMENT)
>> +		vfio_device_pm_runtime_put(device);
>> +
>> +	return ret;
>>  }
>>  
>>  static long vfio_device_fops_unl_ioctl(struct file *filep,
>>  				       unsigned int cmd, unsigned long arg)
>>  {
>>  	struct vfio_device *device = filep->private_data;
>> +	int ret;
>>  
>>  	switch (cmd) {
>>  	case VFIO_DEVICE_FEATURE:
>> @@ -1655,7 +1714,14 @@ static long vfio_device_fops_unl_ioctl(struct file *filep,
>>  	default:
>>  		if (unlikely(!device->ops->ioctl))
>>  			return -EINVAL;
>> -		return device->ops->ioctl(device, cmd, arg);
>> +
>> +		ret = vfio_device_pm_runtime_get(device);
>> +		if (ret)
>> +			return ret;
>> +
>> +		ret = device->ops->ioctl(device, cmd, arg);
>> +		vfio_device_pm_runtime_put(device);
>> +		return ret;
>>  	}
>>  }
>>  
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 6/6] vfio/pci: Add support for virtual PME
  2022-07-06 15:40   ` Alex Williamson
@ 2022-07-08  9:45     ` Abhishek Sahu
  0 siblings, 0 replies; 26+ messages in thread
From: Abhishek Sahu @ 2022-07-08  9:45 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Cornelia Huck, Yishai Hadas, Jason Gunthorpe, Shameer Kolothum,
	Kevin Tian, Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas,
	linux-kernel, kvm, linux-pm, linux-pci

On 7/6/2022 9:10 PM, Alex Williamson wrote:
> On Fri, 1 Jul 2022 16:38:14 +0530
> Abhishek Sahu <abhsahu@nvidia.com> wrote:
> 
>> If the PCI device is in low power state and the device requires
>> wake-up, then it can generate PME (Power Management Events). Mostly
>> these PME events will be propagated to the root port and then the
>> root port will generate the system interrupt. Then the OS should
>> identify the device which generated the PME and should resume
>> the device.
>>
>> We can implement a similar virtual PME framework where if the device
>> already went into the runtime suspended state and then there is any
>> wake-up on the host side, then it will send the virtual PME
>> notification to the guest. This virtual PME will be helpful for the cases
>> where the device will not be suspended again if there is any wake-up
>> triggered by the host. Following is the overall approach regarding
>> the virtual PME.
>>
>> 1. Add one more event like VFIO_PCI_ERR_IRQ_INDEX named
>>    VFIO_PCI_PME_IRQ_INDEX and do the required code changes to get/set
>>    this new IRQ.
>>
>> 2. From the guest side, the guest needs to enable eventfd for the
>>    virtual PME notification.
>>
>> 3. In the vfio-pci driver, the PME support bits are currently
>>    virtualized and set to 0. We can set PME capability support for all
>>    the power states. This PME capability support is independent of the
>>    physical PME support.
>>
>> 4. The PME enable (PME_En bit in Power Management Control/Status
>>    Register) and PME status (PME_Status bit in Power Management
>>    Control/Status Register) are also virtualized currently.
>>    The write support for PME_En bit can be enabled.
>>
>> 5. The PME_Status bit is a write-1-clear bit where the write with
>>    zero value will have no effect and write with 1 value will clear the
>>    bit. The write for this bit will be trapped inside
>>    vfio_pm_config_write() similar to PCI_PM_CTRL write for PM_STATES.
>>
>> 6. When the host gets a request for resuming the device other than from
>>    low power exit feature IOCTL, then PME_Status bit will be set.
>>    According to [PCIe v5 7.5.2.2],
>>      "PME_Status - This bit is Set when the Function would normally
>>       generate a PME signal. The value of this bit is not affected by
>>       the value of the PME_En bit."
>>
>>    So even if PME_En bit is not set, we can set PME_Status bit.
>>
>> 7. If the guest has enabled PME_En and registered for PME events
>>    through eventfd, then the usage count will be incremented to prevent
>>    the device to go into the suspended state and notify the guest through
>>    eventfd trigger.
>>
>> The virtual PME can help in handling physical PME also. When
>> physical PME comes, then also the runtime resume will be called. If
>> the guest has registered for virtual PME, then it will be sent in this
>> case also.
>>
>> * Implementation for handling the virtual PME on the hypervisor:
>>
>> If we take the implementation in Linux OS, then during runtime suspend
>> time, then it calls __pci_enable_wake(). It internally enables PME
>> through pci_pme_active() and also enables the ACPI side wake-up
>> through platform_pci_set_wakeup(). To handle the PME, the hypervisor has
>> the following two options:
>>
>> 1. Create a virtual root port for the VFIO device and trigger
>>    interrupt when the PME comes. It will call pcie_pme_irq() which will
>>    resume the device.
>>
>> 2. Create a virtual ACPI _PRW resource and associate it with the device
>>    itself. In _PRW, any GPE (General Purpose Event) can be assigned for
>>    the wake-up. When PME comes, then GPE can be triggered by the
>>    hypervisor. GPE interrupt will call pci_acpi_wake_dev() function
>>    internally and it will resume the device.
> 
> Do we really need to implement PME emulation in the kernel or is it
> sufficient for userspace to simply register a one-shot eventfd when
> SET'ing the low power feature and QEMU can provide the PME emulation
> based on that signaling?  Thanks,
> 
> Alex
> 

 It seems it can be implemented in QEMU instead of kernel.
 I will drop this patch.

 Thanks,
 Abhishek
 
>>
>> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
>> ---
>>  drivers/vfio/pci/vfio_pci_config.c | 39 +++++++++++++++++++++------
>>  drivers/vfio/pci/vfio_pci_core.c   | 43 ++++++++++++++++++++++++------
>>  drivers/vfio/pci/vfio_pci_intrs.c  | 18 +++++++++++++
>>  include/linux/vfio_pci_core.h      |  2 ++
>>  include/uapi/linux/vfio.h          |  1 +
>>  5 files changed, 87 insertions(+), 16 deletions(-)
>>
>> diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
>> index 21a4743d011f..a06375a03758 100644
>> --- a/drivers/vfio/pci/vfio_pci_config.c
>> +++ b/drivers/vfio/pci/vfio_pci_config.c
>> @@ -719,6 +719,20 @@ static int vfio_pm_config_write(struct vfio_pci_core_device *vdev, int pos,
>>  	if (count < 0)
>>  		return count;
>>  
>> +	/*
>> +	 * PME_STATUS is write-1-clear bit. If PME_STATUS is 1, then clear the
>> +	 * bit in vconfig. The PME_STATUS is in the upper byte of the control
>> +	 * register and user can do single byte write also.
>> +	 */
>> +	if (offset <= PCI_PM_CTRL + 1 && offset + count > PCI_PM_CTRL + 1) {
>> +		if (le32_to_cpu(val) &
>> +		    (PCI_PM_CTRL_PME_STATUS >> (offset - PCI_PM_CTRL) * 8)) {
>> +			__le16 *ctrl = (__le16 *)&vdev->vconfig
>> +					[vdev->pm_cap_offset + PCI_PM_CTRL];
>> +			*ctrl &= ~cpu_to_le16(PCI_PM_CTRL_PME_STATUS);
>> +		}
>> +	}
>> +
>>  	if (offset == PCI_PM_CTRL) {
>>  		pci_power_t state;
>>  
>> @@ -771,14 +785,16 @@ static int __init init_pci_cap_pm_perm(struct perm_bits *perm)
>>  	 * the user change power state, but we trap and initiate the
>>  	 * change ourselves, so the state bits are read-only.
>>  	 *
>> -	 * The guest can't process PME from D3cold so virtualize PME_Status
>> -	 * and PME_En bits. The vconfig bits will be cleared during device
>> -	 * capability initialization.
>> +	 * The guest can't process physical PME from D3cold so virtualize
>> +	 * PME_Status and PME_En bits. These bits will be used for the
>> +	 * virtual PME between host and guest. The vconfig bits will be
>> +	 * updated during device capability initialization. PME_Status is
>> +	 * write-1-clear bit, so it is read-only. We trap and update the
>> +	 * vconfig bit manually during write.
>>  	 */
>>  	p_setd(perm, PCI_PM_CTRL,
>>  	       PCI_PM_CTRL_PME_ENABLE | PCI_PM_CTRL_PME_STATUS,
>> -	       ~(PCI_PM_CTRL_PME_ENABLE | PCI_PM_CTRL_PME_STATUS |
>> -		 PCI_PM_CTRL_STATE_MASK));
>> +	       ~(PCI_PM_CTRL_STATE_MASK | PCI_PM_CTRL_PME_STATUS));
>>  
>>  	return 0;
>>  }
>> @@ -1454,8 +1470,13 @@ static void vfio_update_pm_vconfig_bytes(struct vfio_pci_core_device *vdev,
>>  	__le16 *pmc = (__le16 *)&vdev->vconfig[offset + PCI_PM_PMC];
>>  	__le16 *ctrl = (__le16 *)&vdev->vconfig[offset + PCI_PM_CTRL];
>>  
>> -	/* Clear vconfig PME_Support, PME_Status, and PME_En bits */
>> -	*pmc &= ~cpu_to_le16(PCI_PM_CAP_PME_MASK);
>> +	/*
>> +	 * Set the vconfig PME_Support bits. The PME_Status is being used for
>> +	 * virtual PME support and is not dependent upon the physical
>> +	 * PME support.
>> +	 */
>> +	*pmc |= cpu_to_le16(PCI_PM_CAP_PME_MASK);
>> +	/* Clear vconfig PME_Support and PME_En bits */
>>  	*ctrl &= ~cpu_to_le16(PCI_PM_CTRL_PME_ENABLE | PCI_PM_CTRL_PME_STATUS);
>>  }
>>  
>> @@ -1582,8 +1603,10 @@ static int vfio_cap_init(struct vfio_pci_core_device *vdev)
>>  		if (ret)
>>  			return ret;
>>  
>> -		if (cap == PCI_CAP_ID_PM)
>> +		if (cap == PCI_CAP_ID_PM) {
>> +			vdev->pm_cap_offset = pos;
>>  			vfio_update_pm_vconfig_bytes(vdev, pos);
>> +		}
>>  
>>  		prev = &vdev->vconfig[pos + PCI_CAP_LIST_NEXT];
>>  		pos = next;
>> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
>> index 1ddaaa6ccef5..6c1225bc2aeb 100644
>> --- a/drivers/vfio/pci/vfio_pci_core.c
>> +++ b/drivers/vfio/pci/vfio_pci_core.c
>> @@ -319,14 +319,35 @@ static int vfio_pci_core_runtime_resume(struct device *dev)
>>  	 *   the low power state or closed the device.
>>  	 * - If there is device access on the host side.
>>  	 *
>> -	 * For the second case, check if re-entry to the low power state is
>> -	 * allowed. If not, then increment the usage count so that runtime PM
>> -	 * framework won't suspend the device and set the 'pm_runtime_resumed'
>> -	 * flag.
>> +	 * For the second case:
>> +	 * - The virtual PME_STATUS bit will be set. If PME_ENABLE bit is set
>> +	 *   and user has registered for virtual PME events, then send the PME
>> +	 *   virtual PME event.
>> +	 * - Check if re-entry to the low power state is not allowed.
>> +	 *
>> +	 * For the above conditions, increment the usage count so that
>> +	 * runtime PM framework won't suspend the device and set the
>> +	 * 'pm_runtime_resumed' flag.
>>  	 */
>> -	if (vdev->pm_runtime_engaged && !vdev->pm_runtime_reentry_allowed) {
>> -		pm_runtime_get_noresume(dev);
>> -		vdev->pm_runtime_resumed = true;
>> +	if (vdev->pm_runtime_engaged) {
>> +		bool pme_triggered = false;
>> +		__le16 *ctrl = (__le16 *)&vdev->vconfig
>> +				[vdev->pm_cap_offset + PCI_PM_CTRL];
>> +
>> +		*ctrl |= cpu_to_le16(PCI_PM_CTRL_PME_STATUS);
>> +		if (le16_to_cpu(*ctrl) & PCI_PM_CTRL_PME_ENABLE) {
>> +			mutex_lock(&vdev->igate);
>> +			if (vdev->pme_trigger) {
>> +				pme_triggered = true;
>> +				eventfd_signal(vdev->pme_trigger, 1);
>> +			}
>> +			mutex_unlock(&vdev->igate);
>> +		}
>> +
>> +		if (!vdev->pm_runtime_reentry_allowed || pme_triggered) {
>> +			pm_runtime_get_noresume(dev);
>> +			vdev->pm_runtime_resumed = true;
>> +		}
>>  	}
>>  	up_write(&vdev->memory_lock);
>>  
>> @@ -586,6 +607,10 @@ void vfio_pci_core_close_device(struct vfio_device *core_vdev)
>>  		eventfd_ctx_put(vdev->req_trigger);
>>  		vdev->req_trigger = NULL;
>>  	}
>> +	if (vdev->pme_trigger) {
>> +		eventfd_ctx_put(vdev->pme_trigger);
>> +		vdev->pme_trigger = NULL;
>> +	}
>>  	mutex_unlock(&vdev->igate);
>>  }
>>  EXPORT_SYMBOL_GPL(vfio_pci_core_close_device);
>> @@ -639,7 +664,8 @@ static int vfio_pci_get_irq_count(struct vfio_pci_core_device *vdev, int irq_typ
>>  	} else if (irq_type == VFIO_PCI_ERR_IRQ_INDEX) {
>>  		if (pci_is_pcie(vdev->pdev))
>>  			return 1;
>> -	} else if (irq_type == VFIO_PCI_REQ_IRQ_INDEX) {
>> +	} else if (irq_type == VFIO_PCI_REQ_IRQ_INDEX ||
>> +		   irq_type == VFIO_PCI_PME_IRQ_INDEX) {
>>  		return 1;
>>  	}
>>  
>> @@ -985,6 +1011,7 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
>>  		switch (info.index) {
>>  		case VFIO_PCI_INTX_IRQ_INDEX ... VFIO_PCI_MSIX_IRQ_INDEX:
>>  		case VFIO_PCI_REQ_IRQ_INDEX:
>> +		case VFIO_PCI_PME_IRQ_INDEX:
>>  			break;
>>  		case VFIO_PCI_ERR_IRQ_INDEX:
>>  			if (pci_is_pcie(vdev->pdev))
>> diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
>> index 1a37db99df48..db4180687a74 100644
>> --- a/drivers/vfio/pci/vfio_pci_intrs.c
>> +++ b/drivers/vfio/pci/vfio_pci_intrs.c
>> @@ -639,6 +639,17 @@ static int vfio_pci_set_req_trigger(struct vfio_pci_core_device *vdev,
>>  					       count, flags, data);
>>  }
>>  
>> +static int vfio_pci_set_pme_trigger(struct vfio_pci_core_device *vdev,
>> +				    unsigned index, unsigned start,
>> +				    unsigned count, uint32_t flags, void *data)
>> +{
>> +	if (index != VFIO_PCI_PME_IRQ_INDEX || start != 0 || count > 1)
>> +		return -EINVAL;
>> +
>> +	return vfio_pci_set_ctx_trigger_single(&vdev->pme_trigger,
>> +					       count, flags, data);
>> +}
>> +
>>  int vfio_pci_set_irqs_ioctl(struct vfio_pci_core_device *vdev, uint32_t flags,
>>  			    unsigned index, unsigned start, unsigned count,
>>  			    void *data)
>> @@ -688,6 +699,13 @@ int vfio_pci_set_irqs_ioctl(struct vfio_pci_core_device *vdev, uint32_t flags,
>>  			break;
>>  		}
>>  		break;
>> +	case VFIO_PCI_PME_IRQ_INDEX:
>> +		switch (flags & VFIO_IRQ_SET_ACTION_TYPE_MASK) {
>> +		case VFIO_IRQ_SET_ACTION_TRIGGER:
>> +			func = vfio_pci_set_pme_trigger;
>> +			break;
>> +		}
>> +		break;
>>  	}
>>  
>>  	if (!func)
>> diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
>> index 18cc83b767b8..ee2646d820c2 100644
>> --- a/include/linux/vfio_pci_core.h
>> +++ b/include/linux/vfio_pci_core.h
>> @@ -102,6 +102,7 @@ struct vfio_pci_core_device {
>>  	bool			bar_mmap_supported[PCI_STD_NUM_BARS];
>>  	u8			*pci_config_map;
>>  	u8			*vconfig;
>> +	u8			pm_cap_offset;
>>  	struct perm_bits	*msi_perm;
>>  	spinlock_t		irqlock;
>>  	struct mutex		igate;
>> @@ -133,6 +134,7 @@ struct vfio_pci_core_device {
>>  	int			ioeventfds_nr;
>>  	struct eventfd_ctx	*err_trigger;
>>  	struct eventfd_ctx	*req_trigger;
>> +	struct eventfd_ctx	*pme_trigger;
>>  	struct list_head	dummy_resources_list;
>>  	struct mutex		ioeventfds_lock;
>>  	struct list_head	ioeventfds_list;
>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>> index 7e00de5c21ea..08170950d655 100644
>> --- a/include/uapi/linux/vfio.h
>> +++ b/include/uapi/linux/vfio.h
>> @@ -621,6 +621,7 @@ enum {
>>  	VFIO_PCI_MSIX_IRQ_INDEX,
>>  	VFIO_PCI_ERR_IRQ_INDEX,
>>  	VFIO_PCI_REQ_IRQ_INDEX,
>> +	VFIO_PCI_PME_IRQ_INDEX,
>>  	VFIO_PCI_NUM_IRQS
>>  };
>>  
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 1/6] vfio/pci: Mask INTx during runtime suspend
  2022-07-08  9:21     ` Abhishek Sahu
@ 2022-07-08 15:45       ` Alex Williamson
  2022-07-11  9:18         ` Abhishek Sahu
  0 siblings, 1 reply; 26+ messages in thread
From: Alex Williamson @ 2022-07-08 15:45 UTC (permalink / raw)
  To: Abhishek Sahu
  Cc: Cornelia Huck, Yishai Hadas, Jason Gunthorpe, Shameer Kolothum,
	Kevin Tian, Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas,
	linux-kernel, kvm, linux-pm, linux-pci

On Fri, 8 Jul 2022 14:51:30 +0530
Abhishek Sahu <abhsahu@nvidia.com> wrote:

> On 7/6/2022 9:09 PM, Alex Williamson wrote:
> > On Fri, 1 Jul 2022 16:38:09 +0530
> > Abhishek Sahu <abhsahu@nvidia.com> wrote:
> >   
> >> This patch adds INTx handling during runtime suspend/resume.
> >> All the suspend/resume related code for the user to put the device
> >> into the low power state will be added in subsequent patches.
> >>
> >> The INTx are shared among devices. Whenever any INTx interrupt comes  
> > 
> > "The INTx lines may be shared..."
> >   
> >> for the VFIO devices, then vfio_intx_handler() will be called for each
> >> device. Inside vfio_intx_handler(), it calls pci_check_and_mask_intx()  
> > 
> > "...device sharing the interrupt."
> >   
> >> and checks if the interrupt has been generated for the current device.
> >> Now, if the device is already in the D3cold state, then the config space
> >> can not be read. Attempt to read config space in D3cold state can
> >> cause system unresponsiveness in a few systems. To prevent this, mask
> >> INTx in runtime suspend callback and unmask the same in runtime resume
> >> callback. If INTx has been already masked, then no handling is needed
> >> in runtime suspend/resume callbacks. 'pm_intx_masked' tracks this, and
> >> vfio_pci_intx_mask() has been updated to return true if INTx has been
> >> masked inside this function.
> >>
> >> For the runtime suspend which is triggered for the no user of VFIO
> >> device, the is_intx() will return false and these callbacks won't do
> >> anything.
> >>
> >> The MSI/MSI-X are not shared so similar handling should not be
> >> needed for MSI/MSI-X. vfio_msihandler() triggers eventfd_signal()
> >> without doing any device-specific config access. When the user performs
> >> any config access or IOCTL after receiving the eventfd notification,
> >> then the device will be moved to the D0 state first before
> >> servicing any request.
> >>
> >> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
> >> ---
> >>  drivers/vfio/pci/vfio_pci_core.c  | 37 +++++++++++++++++++++++++++----
> >>  drivers/vfio/pci/vfio_pci_intrs.c |  6 ++++-
> >>  include/linux/vfio_pci_core.h     |  3 ++-
> >>  3 files changed, 40 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> >> index a0d69ddaf90d..5948d930449b 100644
> >> --- a/drivers/vfio/pci/vfio_pci_core.c
> >> +++ b/drivers/vfio/pci/vfio_pci_core.c
> >> @@ -259,16 +259,45 @@ int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t stat
> >>  	return ret;
> >>  }
> >>  
> >> +#ifdef CONFIG_PM
> >> +static int vfio_pci_core_runtime_suspend(struct device *dev)
> >> +{
> >> +	struct vfio_pci_core_device *vdev = dev_get_drvdata(dev);
> >> +
> >> +	/*
> >> +	 * If INTx is enabled, then mask INTx before going into the runtime
> >> +	 * suspended state and unmask the same in the runtime resume.
> >> +	 * If INTx has already been masked by the user, then
> >> +	 * vfio_pci_intx_mask() will return false and in that case, INTx
> >> +	 * should not be unmasked in the runtime resume.
> >> +	 */
> >> +	vdev->pm_intx_masked = (is_intx(vdev) && vfio_pci_intx_mask(vdev));
> >> +
> >> +	return 0;
> >> +}
> >> +
> >> +static int vfio_pci_core_runtime_resume(struct device *dev)
> >> +{
> >> +	struct vfio_pci_core_device *vdev = dev_get_drvdata(dev);
> >> +
> >> +	if (vdev->pm_intx_masked)
> >> +		vfio_pci_intx_unmask(vdev);
> >> +
> >> +	return 0;
> >> +}
> >> +#endif /* CONFIG_PM */
> >> +
> >>  /*
> >> - * The dev_pm_ops needs to be provided to make pci-driver runtime PM working,
> >> - * so use structure without any callbacks.
> >> - *
> >>   * The pci-driver core runtime PM routines always save the device state
> >>   * before going into suspended state. If the device is going into low power
> >>   * state with only with runtime PM ops, then no explicit handling is needed
> >>   * for the devices which have NoSoftRst-.
> >>   */
> >> -static const struct dev_pm_ops vfio_pci_core_pm_ops = { };
> >> +static const struct dev_pm_ops vfio_pci_core_pm_ops = {
> >> +	SET_RUNTIME_PM_OPS(vfio_pci_core_runtime_suspend,
> >> +			   vfio_pci_core_runtime_resume,
> >> +			   NULL)
> >> +};
> >>  
> >>  int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
> >>  {
> >> diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
> >> index 6069a11fb51a..1a37db99df48 100644
> >> --- a/drivers/vfio/pci/vfio_pci_intrs.c
> >> +++ b/drivers/vfio/pci/vfio_pci_intrs.c
> >> @@ -33,10 +33,12 @@ static void vfio_send_intx_eventfd(void *opaque, void *unused)
> >>  		eventfd_signal(vdev->ctx[0].trigger, 1);
> >>  }
> >>  
> >> -void vfio_pci_intx_mask(struct vfio_pci_core_device *vdev)
> >> +/* Returns true if INTx has been masked by this function. */
> >> +bool vfio_pci_intx_mask(struct vfio_pci_core_device *vdev)
> >>  {
> >>  	struct pci_dev *pdev = vdev->pdev;
> >>  	unsigned long flags;
> >> +	bool intx_masked = false;
> >>  
> >>  	spin_lock_irqsave(&vdev->irqlock, flags);
> >>  
> >> @@ -60,9 +62,11 @@ void vfio_pci_intx_mask(struct vfio_pci_core_device *vdev)
> >>  			disable_irq_nosync(pdev->irq);
> >>  
> >>  		vdev->ctx[0].masked = true;
> >> +		intx_masked = true;
> >>  	}
> >>  
> >>  	spin_unlock_irqrestore(&vdev->irqlock, flags);
> >> +	return intx_masked;
> >>  }  
> > 
> > 
> > There's certainly another path through this function that masks the
> > interrupt, which makes the definition of this return value a bit
> > confusing.  
> 
>  For our case we should not hit that path. But we can return the
>  intx_masked true from that path as well to align return value.
> 
> > Wouldn't it be simpler not to overload the masked flag on
> > the interrupt context like this and instead set a new flag on the vdev
> > under irqlock to indicate the device is unable to generate interrupts.
> > The irq handler would add a test of this flag before any tests that
> > would access the device.  Thanks,
> > 
> > Alex
> >    
> 
>  We will set this flag inside runtime_suspend callback but the
>  device can be in non-D3cold state (For example, if user has
>  disabled d3cold explicitly by sysfs, the D3cold is not supported in
>  the platform, etc.). Also, in D3cold supported case, the device will
>  be in D0 till the PCI core moves the device into D3cold. In this case,
>  there is possibility that the device can generate an interrupt.
>  If we add check in the IRQ handler, then we won’t check and clear
>  the IRQ status, but the interrupt line will still be asserted
>  which can cause interrupt flooding.
> 
>  This was the reason for disabling interrupt itself instead of
>  checking flag in the IRQ handler.

Ok, maybe this is largely a clarification of the return value of
vfio_pci_intx_mask().  I think what you're looking for is whether the
context mask was changed, rather than whether the interrupt is masked,
which I think avoids the confusion regarding whether the first branch
should return true or false.  So the variable should be something like
"masked_changed" and the comment changed to "Returns true if the INTx
vfio_pci_irq_ctx.masked value is changed".

Testing is_intx() outside of the irqlock is potentially racy, so do we
need to add the pm-get/put wrappers on ioctls first to avoid the
possibility that pm-suspend can race a SET_IRQS ioctl?  Thanks,

Alex


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 2/6] vfio: Add a new device feature for the power management
  2022-07-08  9:39     ` Abhishek Sahu
@ 2022-07-08 16:36       ` Alex Williamson
  2022-07-11  9:43         ` Abhishek Sahu
  0 siblings, 1 reply; 26+ messages in thread
From: Alex Williamson @ 2022-07-08 16:36 UTC (permalink / raw)
  To: Abhishek Sahu
  Cc: Cornelia Huck, Yishai Hadas, Jason Gunthorpe, Shameer Kolothum,
	Kevin Tian, Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas,
	linux-kernel, kvm, linux-pm, linux-pci

On Fri, 8 Jul 2022 15:09:22 +0530
Abhishek Sahu <abhsahu@nvidia.com> wrote:

> On 7/6/2022 9:09 PM, Alex Williamson wrote:
> > On Fri, 1 Jul 2022 16:38:10 +0530
> > Abhishek Sahu <abhsahu@nvidia.com> wrote:
> >   
> >> This patch adds the new feature VFIO_DEVICE_FEATURE_POWER_MANAGEMENT
> >> for the power management in the header file. The implementation for the
> >> same will be added in the subsequent patches.
> >>
> >> With the standard registers, all power states cannot be achieved. The
> >> platform-based power management needs to be involved to go into the
> >> lowest power state. For all the platform-based power management, this
> >> device feature can be used.
> >>
> >> This device feature uses flags to specify the different operations. In
> >> the future, if any more power management functionality is needed then
> >> a new flag can be added to it. It supports both GET and SET operations.
> >>
> >> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
> >> ---
> >>  include/uapi/linux/vfio.h | 55 +++++++++++++++++++++++++++++++++++++++
> >>  1 file changed, 55 insertions(+)
> >>
> >> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> >> index 733a1cddde30..7e00de5c21ea 100644
> >> --- a/include/uapi/linux/vfio.h
> >> +++ b/include/uapi/linux/vfio.h
> >> @@ -986,6 +986,61 @@ enum vfio_device_mig_state {
> >>  	VFIO_DEVICE_STATE_RUNNING_P2P = 5,
> >>  };
> >>  
> >> +/*
> >> + * Perform power management-related operations for the VFIO device.
> >> + *
> >> + * The low power feature uses platform-based power management to move the
> >> + * device into the low power state.  This low power state is device-specific.
> >> + *
> >> + * This device feature uses flags to specify the different operations.
> >> + * It supports both the GET and SET operations.
> >> + *
> >> + * - VFIO_PM_LOW_POWER_ENTER flag moves the VFIO device into the low power
> >> + *   state with platform-based power management.  This low power state will be
> >> + *   internal to the VFIO driver and the user will not come to know which power
> >> + *   state is chosen.  Once the user has moved the VFIO device into the low
> >> + *   power state, then the user should not do any device access without moving
> >> + *   the device out of the low power state.  
> > 
> > Except we're wrapping device accesses to make this possible.  This
> > should probably describe how any discrete access will wake the device
> > but ongoing access through mmaps will generate user faults.
> >   
> 
>  Sure. I will add that details also.
> 
> >> + *
> >> + * - VFIO_PM_LOW_POWER_EXIT flag moves the VFIO device out of the low power
> >> + *    state.  This flag should only be set if the user has previously put the
> >> + *    device into low power state with the VFIO_PM_LOW_POWER_ENTER flag.  
> > 
> > Indenting.
> >   
>  
>  I will fix this.
> 
> >> + *
> >> + * - VFIO_PM_LOW_POWER_ENTER and VFIO_PM_LOW_POWER_EXIT are mutually exclusive.
> >> + *
> >> + * - VFIO_PM_LOW_POWER_REENTERY_DISABLE flag is only valid with
> >> + *   VFIO_PM_LOW_POWER_ENTER.  If there is any access for the VFIO device on
> >> + *   the host side, then the device will be moved out of the low power state
> >> + *   without the user's guest driver involvement.  Some devices require the
> >> + *   user's guest driver involvement for each low-power entry.  If this flag is
> >> + *   set, then the re-entry to the low power state will be disabled, and the
> >> + *   host kernel will not move the device again into the low power state.
> >> + *   The VFIO driver internally maintains a list of devices for which low
> >> + *   power re-entry is disabled by default and for those devices, the
> >> + *   re-entry will be disabled even if the user has not set this flag
> >> + *   explicitly.  
> > 
> > Wrong polarity.  The kernel should not maintain the policy.  By default
> > every wakeup, whether from host kernel accesses or via user accesses
> > that do a pm-get should signal a wakeup to userspace.  Userspace needs
> > to opt-out of that wakeup to let the kernel automatically re-enter low
> > power and userspace needs to maintain the policy for which devices it
> > wants that to occur.
> >   
>  
>  Okay. So that means, in the kernel side, we don’t have to maintain
>  the list which currently contains NVIDIA device ID. Also, in our
>  updated approach, this opt-out of that wake-up means that user
>  has not provided eventfd in the feature SET ioctl. Correct ?

Yes, I'm imagining that if the user hasn't provided a one-shot wake-up
eventfd, that's the opt-out for being notified of device wakes.  For
example, pm-resume would have something like:

	
	if (vdev->pm_wake_eventfd) {
		eventfd_signal(vdev->pm_wake_eventfd, 1);
		vdev->pm_wake_eventfd = NULL;
		pm_runtime_get_noresume(dev);
	}

(eventfd pseudo handling substantially simplified)

So w/o a wake-up eventfd, the user would need to call the pm feature
exit ioctl to elevate the pm reference to prevent it going back to low
power.  The pm feature exit ioctl would be optional if a wake eventfd is
provided, so some piece of the eventfd context would need to remain to
determine whether a pm-get is necessary.

> >> + *
> >> + * For the IOCTL call with VFIO_DEVICE_FEATURE_GET:
> >> + *
> >> + * - VFIO_PM_LOW_POWER_ENTER will be set if the user has put the device into
> >> + *   the low power state, otherwise, VFIO_PM_LOW_POWER_EXIT will be set.
> >> + *
> >> + * - If the device is in a normal power state currently, then
> >> + *   VFIO_PM_LOW_POWER_REENTERY_DISABLE will be set for the devices where low
> >> + *   power re-entry is disabled by default.  If the device is in the low power
> >> + *   state currently, then VFIO_PM_LOW_POWER_REENTERY_DISABLE will be set
> >> + *   according to the current transition.  
> > 
> > Very confusing semantics.
> > 
> > What if the feature SET ioctl took an eventfd and that eventfd was one
> > time use.  Calling the ioctl would setup the eventfd to notify the user
> > on wakeup and call pm-put.  Any access to the device via host, ioctl,
> > or region would be wrapped in pm-get/put and the pm-resume handler
> > would perform the matching pm-get to balance the feature SET and signal
> > the eventfd.   
> 
>  This seems a better option. It will help in making the ioctl simpler
>  and we don’t have to add a separate index for PME which I added in
>  patch 6. 
> 
> > If the user opts-out by not providing a wakeup eventfd,
> > then the pm-resume handler does not perform a pm-get. Possibly we
> > could even allow mmap access if a wake-up eventfd is provided.  
> 
>  Sorry. I am not clear on this mmap part. We currently invalidates
>  mapping before going into runtime-suspend. Now, if use tries do
>  mmap then do we need some extra handling in the fault handler ?
>  Need your help in understanding this part.

The option that I'm thinking about is if the mmap fault handler is
wrapped in a pm-get/put then we could actually populate the mmap.  In
the case where the pm-get triggers the wake-eventfd in pm-resume, the
device doesn't return to low power when the mmap fault handler calls
pm-put.  This possibly allows that we could actually invalidate mmaps on
pm-suspend rather than in the pm feature enter ioctl, essentially the
same as we're doing for intx.  I wonder though if this allows the
possibility that we just bounce between mmap fault and pm-suspend.  So
long as some work can be done, for instance the pm-suspend occurs
asynchronously to the pm-put, this might be ok.

> > The
> > feature GET ioctl would be used to exit low power behavior and would be
> > a no-op if the wakeup eventfd had already been signaled.  Thanks,
> >  
>  
>  I will use the GET ioctl for low power exit instead of returning the
>  current status.

Note that Yishai is proposing a device DMA dirty logging feature where
the stop and start are exposed via SET on separate features, rather
than SET/GET.  We should probably maintain some consistency between
these use cases.  Possibly we might even want two separate pm enter
ioctls, one with the wake eventfd and one without.  I think this is the
sort of thing Jason is describing for future expansion of the dirty
tracking uAPI.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 3/6] vfio: Increment the runtime PM usage count during IOCTL call
  2022-07-08  9:43     ` Abhishek Sahu
@ 2022-07-08 16:49       ` Alex Williamson
  2022-07-11  9:50         ` Abhishek Sahu
  0 siblings, 1 reply; 26+ messages in thread
From: Alex Williamson @ 2022-07-08 16:49 UTC (permalink / raw)
  To: Abhishek Sahu
  Cc: Cornelia Huck, Yishai Hadas, Jason Gunthorpe, Shameer Kolothum,
	Kevin Tian, Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas,
	linux-kernel, kvm, linux-pm, linux-pci

On Fri, 8 Jul 2022 15:13:16 +0530
Abhishek Sahu <abhsahu@nvidia.com> wrote:

> On 7/6/2022 9:10 PM, Alex Williamson wrote:
> > On Fri, 1 Jul 2022 16:38:11 +0530
> > Abhishek Sahu <abhsahu@nvidia.com> wrote:
> >   
> >> The vfio-pci based driver will have runtime power management
> >> support where the user can put the device into the low power state
> >> and then PCI devices can go into the D3cold state. If the device is
> >> in the low power state and the user issues any IOCTL, then the
> >> device should be moved out of the low power state first. Once
> >> the IOCTL is serviced, then it can go into the low power state again.
> >> The runtime PM framework manages this with help of usage count.
> >>
> >> One option was to add the runtime PM related API's inside vfio-pci
> >> driver but some IOCTL (like VFIO_DEVICE_FEATURE) can follow a
> >> different path and more IOCTL can be added in the future. Also, the
> >> runtime PM will be added for vfio-pci based drivers variant currently,
> >> but the other VFIO based drivers can use the same in the
> >> future. So, this patch adds the runtime calls runtime-related API in
> >> the top-level IOCTL function itself.
> >>
> >> For the VFIO drivers which do not have runtime power management
> >> support currently, the runtime PM API's won't be invoked. Only for
> >> vfio-pci based drivers currently, the runtime PM API's will be invoked
> >> to increment and decrement the usage count.  
> > 
> > Variant drivers can easily opt-out of runtime pm support by performing
> > a gratuitous pm-get in their device-open function.
> >    
> 
>  Do I need to add this line in the commit message?

Maybe I misinterpreted, but my initial read was that there was some
sort of opt-in, which there is by providing pm-runtime support in the
driver, which vfio-pci-core does for all vfio-pci variant drivers.  But
there's also an opt-out, where a vfio-pci variant driver might not want
to support pm-runtime support and could accomplish that by bumping the
pm reference count on device-open such that the user cannot trigger a
pm-suspend.

> >> Taking this usage count incremented while servicing IOCTL will make
> >> sure that the user won't put the device into low power state when any
> >> other IOCTL is being serviced in parallel. Let's consider the
> >> following scenario:
> >>
> >>  1. Some other IOCTL is called.
> >>  2. The user has opened another device instance and called the power
> >>     management IOCTL for the low power entry.
> >>  3. The power management IOCTL moves the device into the low power state.
> >>  4. The other IOCTL finishes.
> >>
> >> If we don't keep the usage count incremented then the device
> >> access will happen between step 3 and 4 while the device has already
> >> gone into the low power state.
> >>
> >> The runtime PM API's should not be invoked for
> >> VFIO_DEVICE_FEATURE_POWER_MANAGEMENT since this IOCTL itself performs
> >> the runtime power management entry and exit for the VFIO device.  
> > 
> > I think the one-shot interface I proposed in the previous patch avoids
> > the need for special handling for these feature ioctls.  Thanks,
> >   
> 
>  Okay. So, for low power exit case (means feature GET ioctl in the
>  updated case) also, we will trigger eventfd. Correct?

If all ioctls are wrapped in pm-get/put, then the pm feature exit ioctl
would wakeup and signal the eventfd via the pm-get.  I'm not sure if
it's worthwhile to try to surprise this eventfd.  Do you foresee any
issues?  Thanks,

Alex


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 1/6] vfio/pci: Mask INTx during runtime suspend
  2022-07-08 15:45       ` Alex Williamson
@ 2022-07-11  9:18         ` Abhishek Sahu
  2022-07-11 12:57           ` Alex Williamson
  0 siblings, 1 reply; 26+ messages in thread
From: Abhishek Sahu @ 2022-07-11  9:18 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Cornelia Huck, Yishai Hadas, Jason Gunthorpe, Shameer Kolothum,
	Kevin Tian, Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas,
	linux-kernel, kvm, linux-pm, linux-pci

On 7/8/2022 9:15 PM, Alex Williamson wrote:
> On Fri, 8 Jul 2022 14:51:30 +0530
> Abhishek Sahu <abhsahu@nvidia.com> wrote:
> 
>> On 7/6/2022 9:09 PM, Alex Williamson wrote:
>>> On Fri, 1 Jul 2022 16:38:09 +0530
>>> Abhishek Sahu <abhsahu@nvidia.com> wrote:
>>>   
>>>> This patch adds INTx handling during runtime suspend/resume.
>>>> All the suspend/resume related code for the user to put the device
>>>> into the low power state will be added in subsequent patches.
>>>>
>>>> The INTx are shared among devices. Whenever any INTx interrupt comes  
>>>
>>> "The INTx lines may be shared..."
>>>   
>>>> for the VFIO devices, then vfio_intx_handler() will be called for each
>>>> device. Inside vfio_intx_handler(), it calls pci_check_and_mask_intx()  
>>>
>>> "...device sharing the interrupt."
>>>   
>>>> and checks if the interrupt has been generated for the current device.
>>>> Now, if the device is already in the D3cold state, then the config space
>>>> can not be read. Attempt to read config space in D3cold state can
>>>> cause system unresponsiveness in a few systems. To prevent this, mask
>>>> INTx in runtime suspend callback and unmask the same in runtime resume
>>>> callback. If INTx has been already masked, then no handling is needed
>>>> in runtime suspend/resume callbacks. 'pm_intx_masked' tracks this, and
>>>> vfio_pci_intx_mask() has been updated to return true if INTx has been
>>>> masked inside this function.
>>>>
>>>> For the runtime suspend which is triggered for the no user of VFIO
>>>> device, the is_intx() will return false and these callbacks won't do
>>>> anything.
>>>>
>>>> The MSI/MSI-X are not shared so similar handling should not be
>>>> needed for MSI/MSI-X. vfio_msihandler() triggers eventfd_signal()
>>>> without doing any device-specific config access. When the user performs
>>>> any config access or IOCTL after receiving the eventfd notification,
>>>> then the device will be moved to the D0 state first before
>>>> servicing any request.
>>>>
>>>> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
>>>> ---
>>>>  drivers/vfio/pci/vfio_pci_core.c  | 37 +++++++++++++++++++++++++++----
>>>>  drivers/vfio/pci/vfio_pci_intrs.c |  6 ++++-
>>>>  include/linux/vfio_pci_core.h     |  3 ++-
>>>>  3 files changed, 40 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
>>>> index a0d69ddaf90d..5948d930449b 100644
>>>> --- a/drivers/vfio/pci/vfio_pci_core.c
>>>> +++ b/drivers/vfio/pci/vfio_pci_core.c
>>>> @@ -259,16 +259,45 @@ int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t stat
>>>>  	return ret;
>>>>  }
>>>>  
>>>> +#ifdef CONFIG_PM
>>>> +static int vfio_pci_core_runtime_suspend(struct device *dev)
>>>> +{
>>>> +	struct vfio_pci_core_device *vdev = dev_get_drvdata(dev);
>>>> +
>>>> +	/*
>>>> +	 * If INTx is enabled, then mask INTx before going into the runtime
>>>> +	 * suspended state and unmask the same in the runtime resume.
>>>> +	 * If INTx has already been masked by the user, then
>>>> +	 * vfio_pci_intx_mask() will return false and in that case, INTx
>>>> +	 * should not be unmasked in the runtime resume.
>>>> +	 */
>>>> +	vdev->pm_intx_masked = (is_intx(vdev) && vfio_pci_intx_mask(vdev));
>>>> +
>>>> +	return 0;
>>>> +}
>>>> +
>>>> +static int vfio_pci_core_runtime_resume(struct device *dev)
>>>> +{
>>>> +	struct vfio_pci_core_device *vdev = dev_get_drvdata(dev);
>>>> +
>>>> +	if (vdev->pm_intx_masked)
>>>> +		vfio_pci_intx_unmask(vdev);
>>>> +
>>>> +	return 0;
>>>> +}
>>>> +#endif /* CONFIG_PM */
>>>> +
>>>>  /*
>>>> - * The dev_pm_ops needs to be provided to make pci-driver runtime PM working,
>>>> - * so use structure without any callbacks.
>>>> - *
>>>>   * The pci-driver core runtime PM routines always save the device state
>>>>   * before going into suspended state. If the device is going into low power
>>>>   * state with only with runtime PM ops, then no explicit handling is needed
>>>>   * for the devices which have NoSoftRst-.
>>>>   */
>>>> -static const struct dev_pm_ops vfio_pci_core_pm_ops = { };
>>>> +static const struct dev_pm_ops vfio_pci_core_pm_ops = {
>>>> +	SET_RUNTIME_PM_OPS(vfio_pci_core_runtime_suspend,
>>>> +			   vfio_pci_core_runtime_resume,
>>>> +			   NULL)
>>>> +};
>>>>  
>>>>  int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
>>>>  {
>>>> diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
>>>> index 6069a11fb51a..1a37db99df48 100644
>>>> --- a/drivers/vfio/pci/vfio_pci_intrs.c
>>>> +++ b/drivers/vfio/pci/vfio_pci_intrs.c
>>>> @@ -33,10 +33,12 @@ static void vfio_send_intx_eventfd(void *opaque, void *unused)
>>>>  		eventfd_signal(vdev->ctx[0].trigger, 1);
>>>>  }
>>>>  
>>>> -void vfio_pci_intx_mask(struct vfio_pci_core_device *vdev)
>>>> +/* Returns true if INTx has been masked by this function. */
>>>> +bool vfio_pci_intx_mask(struct vfio_pci_core_device *vdev)
>>>>  {
>>>>  	struct pci_dev *pdev = vdev->pdev;
>>>>  	unsigned long flags;
>>>> +	bool intx_masked = false;
>>>>  
>>>>  	spin_lock_irqsave(&vdev->irqlock, flags);
>>>>  
>>>> @@ -60,9 +62,11 @@ void vfio_pci_intx_mask(struct vfio_pci_core_device *vdev)
>>>>  			disable_irq_nosync(pdev->irq);
>>>>  
>>>>  		vdev->ctx[0].masked = true;
>>>> +		intx_masked = true;
>>>>  	}
>>>>  
>>>>  	spin_unlock_irqrestore(&vdev->irqlock, flags);
>>>> +	return intx_masked;
>>>>  }  
>>>
>>>
>>> There's certainly another path through this function that masks the
>>> interrupt, which makes the definition of this return value a bit
>>> confusing.  
>>
>>  For our case we should not hit that path. But we can return the
>>  intx_masked true from that path as well to align return value.
>>
>>> Wouldn't it be simpler not to overload the masked flag on
>>> the interrupt context like this and instead set a new flag on the vdev
>>> under irqlock to indicate the device is unable to generate interrupts.
>>> The irq handler would add a test of this flag before any tests that
>>> would access the device.  Thanks,
>>>
>>> Alex
>>>    
>>
>>  We will set this flag inside runtime_suspend callback but the
>>  device can be in non-D3cold state (For example, if user has
>>  disabled d3cold explicitly by sysfs, the D3cold is not supported in
>>  the platform, etc.). Also, in D3cold supported case, the device will
>>  be in D0 till the PCI core moves the device into D3cold. In this case,
>>  there is possibility that the device can generate an interrupt.
>>  If we add check in the IRQ handler, then we won’t check and clear
>>  the IRQ status, but the interrupt line will still be asserted
>>  which can cause interrupt flooding.
>>
>>  This was the reason for disabling interrupt itself instead of
>>  checking flag in the IRQ handler.
> 
> Ok, maybe this is largely a clarification of the return value of
> vfio_pci_intx_mask().  I think what you're looking for is whether the
> context mask was changed, rather than whether the interrupt is masked,
> which I think avoids the confusion regarding whether the first branch
> should return true or false.  So the variable should be something like
> "masked_changed" and the comment changed to "Returns true if the INTx
> vfio_pci_irq_ctx.masked value is changed".
> 

 Thanks Alex.
 I will rename the variable and update the comment.

> Testing is_intx() outside of the irqlock is potentially racy, so do we
> need to add the pm-get/put wrappers on ioctls first to avoid the
> possibility that pm-suspend can race a SET_IRQS ioctl?  Thanks,
> 
> Alex
> 
 
 Even after adding this patch, the runtime suspend will not be supported
 for the device with users. It will be supported only after patch 4 in
 this patch series. So with this patch, the pm-suspend can be called only
 for the cases where vfio device has no user and there we should not see
 the race condition.

 But I can move the patch related with pm-get/put wrappers on
 ioctls before this patch since both are independent. 

 Regards,
 Abhishek

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 2/6] vfio: Add a new device feature for the power management
  2022-07-08 16:36       ` Alex Williamson
@ 2022-07-11  9:43         ` Abhishek Sahu
  2022-07-11 13:04           ` Alex Williamson
  0 siblings, 1 reply; 26+ messages in thread
From: Abhishek Sahu @ 2022-07-11  9:43 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Cornelia Huck, Yishai Hadas, Jason Gunthorpe, Shameer Kolothum,
	Kevin Tian, Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas,
	linux-kernel, kvm, linux-pm, linux-pci

On 7/8/2022 10:06 PM, Alex Williamson wrote:
> On Fri, 8 Jul 2022 15:09:22 +0530
> Abhishek Sahu <abhsahu@nvidia.com> wrote:
> 
>> On 7/6/2022 9:09 PM, Alex Williamson wrote:
>>> On Fri, 1 Jul 2022 16:38:10 +0530
>>> Abhishek Sahu <abhsahu@nvidia.com> wrote:
>>>   
>>>> This patch adds the new feature VFIO_DEVICE_FEATURE_POWER_MANAGEMENT
>>>> for the power management in the header file. The implementation for the
>>>> same will be added in the subsequent patches.
>>>>
>>>> With the standard registers, all power states cannot be achieved. The
>>>> platform-based power management needs to be involved to go into the
>>>> lowest power state. For all the platform-based power management, this
>>>> device feature can be used.
>>>>
>>>> This device feature uses flags to specify the different operations. In
>>>> the future, if any more power management functionality is needed then
>>>> a new flag can be added to it. It supports both GET and SET operations.
>>>>
>>>> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
>>>> ---
>>>>  include/uapi/linux/vfio.h | 55 +++++++++++++++++++++++++++++++++++++++
>>>>  1 file changed, 55 insertions(+)
>>>>
>>>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>>>> index 733a1cddde30..7e00de5c21ea 100644
>>>> --- a/include/uapi/linux/vfio.h
>>>> +++ b/include/uapi/linux/vfio.h
>>>> @@ -986,6 +986,61 @@ enum vfio_device_mig_state {
>>>>  	VFIO_DEVICE_STATE_RUNNING_P2P = 5,
>>>>  };
>>>>  
>>>> +/*
>>>> + * Perform power management-related operations for the VFIO device.
>>>> + *
>>>> + * The low power feature uses platform-based power management to move the
>>>> + * device into the low power state.  This low power state is device-specific.
>>>> + *
>>>> + * This device feature uses flags to specify the different operations.
>>>> + * It supports both the GET and SET operations.
>>>> + *
>>>> + * - VFIO_PM_LOW_POWER_ENTER flag moves the VFIO device into the low power
>>>> + *   state with platform-based power management.  This low power state will be
>>>> + *   internal to the VFIO driver and the user will not come to know which power
>>>> + *   state is chosen.  Once the user has moved the VFIO device into the low
>>>> + *   power state, then the user should not do any device access without moving
>>>> + *   the device out of the low power state.  
>>>
>>> Except we're wrapping device accesses to make this possible.  This
>>> should probably describe how any discrete access will wake the device
>>> but ongoing access through mmaps will generate user faults.
>>>   
>>
>>  Sure. I will add that details also.
>>
>>>> + *
>>>> + * - VFIO_PM_LOW_POWER_EXIT flag moves the VFIO device out of the low power
>>>> + *    state.  This flag should only be set if the user has previously put the
>>>> + *    device into low power state with the VFIO_PM_LOW_POWER_ENTER flag.  
>>>
>>> Indenting.
>>>   
>>  
>>  I will fix this.
>>
>>>> + *
>>>> + * - VFIO_PM_LOW_POWER_ENTER and VFIO_PM_LOW_POWER_EXIT are mutually exclusive.
>>>> + *
>>>> + * - VFIO_PM_LOW_POWER_REENTERY_DISABLE flag is only valid with
>>>> + *   VFIO_PM_LOW_POWER_ENTER.  If there is any access for the VFIO device on
>>>> + *   the host side, then the device will be moved out of the low power state
>>>> + *   without the user's guest driver involvement.  Some devices require the
>>>> + *   user's guest driver involvement for each low-power entry.  If this flag is
>>>> + *   set, then the re-entry to the low power state will be disabled, and the
>>>> + *   host kernel will not move the device again into the low power state.
>>>> + *   The VFIO driver internally maintains a list of devices for which low
>>>> + *   power re-entry is disabled by default and for those devices, the
>>>> + *   re-entry will be disabled even if the user has not set this flag
>>>> + *   explicitly.  
>>>
>>> Wrong polarity.  The kernel should not maintain the policy.  By default
>>> every wakeup, whether from host kernel accesses or via user accesses
>>> that do a pm-get should signal a wakeup to userspace.  Userspace needs
>>> to opt-out of that wakeup to let the kernel automatically re-enter low
>>> power and userspace needs to maintain the policy for which devices it
>>> wants that to occur.
>>>   
>>  
>>  Okay. So that means, in the kernel side, we don’t have to maintain
>>  the list which currently contains NVIDIA device ID. Also, in our
>>  updated approach, this opt-out of that wake-up means that user
>>  has not provided eventfd in the feature SET ioctl. Correct ?
> 
> Yes, I'm imagining that if the user hasn't provided a one-shot wake-up
> eventfd, that's the opt-out for being notified of device wakes.  For
> example, pm-resume would have something like:
> 
> 	
> 	if (vdev->pm_wake_eventfd) {
> 		eventfd_signal(vdev->pm_wake_eventfd, 1);
> 		vdev->pm_wake_eventfd = NULL;
> 		pm_runtime_get_noresume(dev);
> 	}
> 
> (eventfd pseudo handling substantially simplified)
> 
> So w/o a wake-up eventfd, the user would need to call the pm feature
> exit ioctl to elevate the pm reference to prevent it going back to low
> power.  The pm feature exit ioctl would be optional if a wake eventfd is
> provided, so some piece of the eventfd context would need to remain to
> determine whether a pm-get is necessary.
> 
>>>> + *
>>>> + * For the IOCTL call with VFIO_DEVICE_FEATURE_GET:
>>>> + *
>>>> + * - VFIO_PM_LOW_POWER_ENTER will be set if the user has put the device into
>>>> + *   the low power state, otherwise, VFIO_PM_LOW_POWER_EXIT will be set.
>>>> + *
>>>> + * - If the device is in a normal power state currently, then
>>>> + *   VFIO_PM_LOW_POWER_REENTERY_DISABLE will be set for the devices where low
>>>> + *   power re-entry is disabled by default.  If the device is in the low power
>>>> + *   state currently, then VFIO_PM_LOW_POWER_REENTERY_DISABLE will be set
>>>> + *   according to the current transition.  
>>>
>>> Very confusing semantics.
>>>
>>> What if the feature SET ioctl took an eventfd and that eventfd was one
>>> time use.  Calling the ioctl would setup the eventfd to notify the user
>>> on wakeup and call pm-put.  Any access to the device via host, ioctl,
>>> or region would be wrapped in pm-get/put and the pm-resume handler
>>> would perform the matching pm-get to balance the feature SET and signal
>>> the eventfd.   
>>
>>  This seems a better option. It will help in making the ioctl simpler
>>  and we don’t have to add a separate index for PME which I added in
>>  patch 6. 
>>
>>> If the user opts-out by not providing a wakeup eventfd,
>>> then the pm-resume handler does not perform a pm-get. Possibly we
>>> could even allow mmap access if a wake-up eventfd is provided.  
>>
>>  Sorry. I am not clear on this mmap part. We currently invalidates
>>  mapping before going into runtime-suspend. Now, if use tries do
>>  mmap then do we need some extra handling in the fault handler ?
>>  Need your help in understanding this part.
> 
> The option that I'm thinking about is if the mmap fault handler is
> wrapped in a pm-get/put then we could actually populate the mmap.  In
> the case where the pm-get triggers the wake-eventfd in pm-resume, the
> device doesn't return to low power when the mmap fault handler calls
> pm-put.  This possibly allows that we could actually invalidate mmaps on
> pm-suspend rather than in the pm feature enter ioctl, essentially the
> same as we're doing for intx.  I wonder though if this allows the
> possibility that we just bounce between mmap fault and pm-suspend.  So
> long as some work can be done, for instance the pm-suspend occurs
> asynchronously to the pm-put, this might be ok.
> 

 We can do this. But in the normal use case, the situation should
 never arise where user should access any mmaped region when user has
 already put the device into D3 (D3hot or D3cold). This can only happen
 if there is some bug in the guest driver or user is doing wrong
 sequence. Do we need to add handling to officially support this part ?

 pm-get can take more than a second for resume for some devices and
 will doing this in fault handler be safe ?

 Also, we will add this support only when wake-eventfd is provided so
 still w/o wake-eventfd case, the mmap access will still generate fault.
 So, we will have different behavior. Will that be acceptable ?

>>> The
>>> feature GET ioctl would be used to exit low power behavior and would be
>>> a no-op if the wakeup eventfd had already been signaled.  Thanks,
>>>  
>>  
>>  I will use the GET ioctl for low power exit instead of returning the
>>  current status.
> 
> Note that Yishai is proposing a device DMA dirty logging feature where
> the stop and start are exposed via SET on separate features, rather
> than SET/GET.  We should probably maintain some consistency between
> these use cases.  Possibly we might even want two separate pm enter
> ioctls, one with the wake eventfd and one without.  I think this is the
> sort of thing Jason is describing for future expansion of the dirty
> tracking uAPI.  Thanks,
> 
> Alex
> 

 Okay. So, we need to add 3 device features in total.

 VFIO_DEVICE_FEATURE_PM_ENTRY
 VFIO_DEVICE_FEATURE_PM_ENTRY_WITH_WAKEUP
 VFIO_DEVICE_FEATURE_PM_EXIT

 And only the second one need structure which will have only one field
 for eventfd and we need to return error if wakeup-eventfd is not
 provided in the second feature ?

 Do we need to support GET operation also for these ?
 We can skip GET operation since that won’t be very useful.

 Regards,
 Abhishek

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 3/6] vfio: Increment the runtime PM usage count during IOCTL call
  2022-07-08 16:49       ` Alex Williamson
@ 2022-07-11  9:50         ` Abhishek Sahu
  0 siblings, 0 replies; 26+ messages in thread
From: Abhishek Sahu @ 2022-07-11  9:50 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Cornelia Huck, Yishai Hadas, Jason Gunthorpe, Shameer Kolothum,
	Kevin Tian, Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas,
	linux-kernel, kvm, linux-pm, linux-pci

On 7/8/2022 10:19 PM, Alex Williamson wrote:
> On Fri, 8 Jul 2022 15:13:16 +0530
> Abhishek Sahu <abhsahu@nvidia.com> wrote:
> 
>> On 7/6/2022 9:10 PM, Alex Williamson wrote:
>>> On Fri, 1 Jul 2022 16:38:11 +0530
>>> Abhishek Sahu <abhsahu@nvidia.com> wrote:
>>>   
>>>> The vfio-pci based driver will have runtime power management
>>>> support where the user can put the device into the low power state
>>>> and then PCI devices can go into the D3cold state. If the device is
>>>> in the low power state and the user issues any IOCTL, then the
>>>> device should be moved out of the low power state first. Once
>>>> the IOCTL is serviced, then it can go into the low power state again.
>>>> The runtime PM framework manages this with help of usage count.
>>>>
>>>> One option was to add the runtime PM related API's inside vfio-pci
>>>> driver but some IOCTL (like VFIO_DEVICE_FEATURE) can follow a
>>>> different path and more IOCTL can be added in the future. Also, the
>>>> runtime PM will be added for vfio-pci based drivers variant currently,
>>>> but the other VFIO based drivers can use the same in the
>>>> future. So, this patch adds the runtime calls runtime-related API in
>>>> the top-level IOCTL function itself.
>>>>
>>>> For the VFIO drivers which do not have runtime power management
>>>> support currently, the runtime PM API's won't be invoked. Only for
>>>> vfio-pci based drivers currently, the runtime PM API's will be invoked
>>>> to increment and decrement the usage count.  
>>>
>>> Variant drivers can easily opt-out of runtime pm support by performing
>>> a gratuitous pm-get in their device-open function.
>>>    
>>
>>  Do I need to add this line in the commit message?
> 
> Maybe I misinterpreted, but my initial read was that there was some
> sort of opt-in, which there is by providing pm-runtime support in the
> driver, which vfio-pci-core does for all vfio-pci variant drivers.  But
> there's also an opt-out, where a vfio-pci variant driver might not want
> to support pm-runtime support and could accomplish that by bumping the
> pm reference count on device-open such that the user cannot trigger a
> pm-suspend.
> 
>>>> Taking this usage count incremented while servicing IOCTL will make
>>>> sure that the user won't put the device into low power state when any
>>>> other IOCTL is being serviced in parallel. Let's consider the
>>>> following scenario:
>>>>
>>>>  1. Some other IOCTL is called.
>>>>  2. The user has opened another device instance and called the power
>>>>     management IOCTL for the low power entry.
>>>>  3. The power management IOCTL moves the device into the low power state.
>>>>  4. The other IOCTL finishes.
>>>>
>>>> If we don't keep the usage count incremented then the device
>>>> access will happen between step 3 and 4 while the device has already
>>>> gone into the low power state.
>>>>
>>>> The runtime PM API's should not be invoked for
>>>> VFIO_DEVICE_FEATURE_POWER_MANAGEMENT since this IOCTL itself performs
>>>> the runtime power management entry and exit for the VFIO device.  
>>>
>>> I think the one-shot interface I proposed in the previous patch avoids
>>> the need for special handling for these feature ioctls.  Thanks,
>>>   
>>
>>  Okay. So, for low power exit case (means feature GET ioctl in the
>>  updated case) also, we will trigger eventfd. Correct?
> 
> If all ioctls are wrapped in pm-get/put, then the pm feature exit ioctl
> would wakeup and signal the eventfd via the pm-get.  I'm not sure if
> it's worthwhile to try to surprise this eventfd.  Do you foresee any
> issues?  Thanks,
> 
> Alex
> 

 I think it should be fine. It requires some extra handling in the
 hypervisor or QEMU side where it needs to skip the virtual PME or
 similar notification handling for the cases when the guest has
 actually triggered the low power exit but that can be easily done.
 
 Regards,
 Abhishek

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 1/6] vfio/pci: Mask INTx during runtime suspend
  2022-07-11  9:18         ` Abhishek Sahu
@ 2022-07-11 12:57           ` Alex Williamson
  0 siblings, 0 replies; 26+ messages in thread
From: Alex Williamson @ 2022-07-11 12:57 UTC (permalink / raw)
  To: Abhishek Sahu
  Cc: Cornelia Huck, Yishai Hadas, Jason Gunthorpe, Shameer Kolothum,
	Kevin Tian, Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas,
	linux-kernel, kvm, linux-pm, linux-pci

On Mon, 11 Jul 2022 14:48:34 +0530
Abhishek Sahu <abhsahu@nvidia.com> wrote:

> On 7/8/2022 9:15 PM, Alex Williamson wrote:
> > On Fri, 8 Jul 2022 14:51:30 +0530
> > Abhishek Sahu <abhsahu@nvidia.com> wrote:
> >   
> >> On 7/6/2022 9:09 PM, Alex Williamson wrote:  
> >>> On Fri, 1 Jul 2022 16:38:09 +0530
> >>> Abhishek Sahu <abhsahu@nvidia.com> wrote:
> >>>     
> >>>> This patch adds INTx handling during runtime suspend/resume.
> >>>> All the suspend/resume related code for the user to put the device
> >>>> into the low power state will be added in subsequent patches.
> >>>>
> >>>> The INTx are shared among devices. Whenever any INTx interrupt comes    
> >>>
> >>> "The INTx lines may be shared..."
> >>>     
> >>>> for the VFIO devices, then vfio_intx_handler() will be called for each
> >>>> device. Inside vfio_intx_handler(), it calls pci_check_and_mask_intx()    
> >>>
> >>> "...device sharing the interrupt."
> >>>     
> >>>> and checks if the interrupt has been generated for the current device.
> >>>> Now, if the device is already in the D3cold state, then the config space
> >>>> can not be read. Attempt to read config space in D3cold state can
> >>>> cause system unresponsiveness in a few systems. To prevent this, mask
> >>>> INTx in runtime suspend callback and unmask the same in runtime resume
> >>>> callback. If INTx has been already masked, then no handling is needed
> >>>> in runtime suspend/resume callbacks. 'pm_intx_masked' tracks this, and
> >>>> vfio_pci_intx_mask() has been updated to return true if INTx has been
> >>>> masked inside this function.
> >>>>
> >>>> For the runtime suspend which is triggered for the no user of VFIO
> >>>> device, the is_intx() will return false and these callbacks won't do
> >>>> anything.
> >>>>
> >>>> The MSI/MSI-X are not shared so similar handling should not be
> >>>> needed for MSI/MSI-X. vfio_msihandler() triggers eventfd_signal()
> >>>> without doing any device-specific config access. When the user performs
> >>>> any config access or IOCTL after receiving the eventfd notification,
> >>>> then the device will be moved to the D0 state first before
> >>>> servicing any request.
> >>>>
> >>>> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
> >>>> ---
> >>>>  drivers/vfio/pci/vfio_pci_core.c  | 37 +++++++++++++++++++++++++++----
> >>>>  drivers/vfio/pci/vfio_pci_intrs.c |  6 ++++-
> >>>>  include/linux/vfio_pci_core.h     |  3 ++-
> >>>>  3 files changed, 40 insertions(+), 6 deletions(-)
> >>>>
> >>>> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> >>>> index a0d69ddaf90d..5948d930449b 100644
> >>>> --- a/drivers/vfio/pci/vfio_pci_core.c
> >>>> +++ b/drivers/vfio/pci/vfio_pci_core.c
> >>>> @@ -259,16 +259,45 @@ int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t stat
> >>>>  	return ret;
> >>>>  }
> >>>>  
> >>>> +#ifdef CONFIG_PM
> >>>> +static int vfio_pci_core_runtime_suspend(struct device *dev)
> >>>> +{
> >>>> +	struct vfio_pci_core_device *vdev = dev_get_drvdata(dev);
> >>>> +
> >>>> +	/*
> >>>> +	 * If INTx is enabled, then mask INTx before going into the runtime
> >>>> +	 * suspended state and unmask the same in the runtime resume.
> >>>> +	 * If INTx has already been masked by the user, then
> >>>> +	 * vfio_pci_intx_mask() will return false and in that case, INTx
> >>>> +	 * should not be unmasked in the runtime resume.
> >>>> +	 */
> >>>> +	vdev->pm_intx_masked = (is_intx(vdev) && vfio_pci_intx_mask(vdev));
> >>>> +
> >>>> +	return 0;
> >>>> +}
> >>>> +
> >>>> +static int vfio_pci_core_runtime_resume(struct device *dev)
> >>>> +{
> >>>> +	struct vfio_pci_core_device *vdev = dev_get_drvdata(dev);
> >>>> +
> >>>> +	if (vdev->pm_intx_masked)
> >>>> +		vfio_pci_intx_unmask(vdev);
> >>>> +
> >>>> +	return 0;
> >>>> +}
> >>>> +#endif /* CONFIG_PM */
> >>>> +
> >>>>  /*
> >>>> - * The dev_pm_ops needs to be provided to make pci-driver runtime PM working,
> >>>> - * so use structure without any callbacks.
> >>>> - *
> >>>>   * The pci-driver core runtime PM routines always save the device state
> >>>>   * before going into suspended state. If the device is going into low power
> >>>>   * state with only with runtime PM ops, then no explicit handling is needed
> >>>>   * for the devices which have NoSoftRst-.
> >>>>   */
> >>>> -static const struct dev_pm_ops vfio_pci_core_pm_ops = { };
> >>>> +static const struct dev_pm_ops vfio_pci_core_pm_ops = {
> >>>> +	SET_RUNTIME_PM_OPS(vfio_pci_core_runtime_suspend,
> >>>> +			   vfio_pci_core_runtime_resume,
> >>>> +			   NULL)
> >>>> +};
> >>>>  
> >>>>  int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
> >>>>  {
> >>>> diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
> >>>> index 6069a11fb51a..1a37db99df48 100644
> >>>> --- a/drivers/vfio/pci/vfio_pci_intrs.c
> >>>> +++ b/drivers/vfio/pci/vfio_pci_intrs.c
> >>>> @@ -33,10 +33,12 @@ static void vfio_send_intx_eventfd(void *opaque, void *unused)
> >>>>  		eventfd_signal(vdev->ctx[0].trigger, 1);
> >>>>  }
> >>>>  
> >>>> -void vfio_pci_intx_mask(struct vfio_pci_core_device *vdev)
> >>>> +/* Returns true if INTx has been masked by this function. */
> >>>> +bool vfio_pci_intx_mask(struct vfio_pci_core_device *vdev)
> >>>>  {
> >>>>  	struct pci_dev *pdev = vdev->pdev;
> >>>>  	unsigned long flags;
> >>>> +	bool intx_masked = false;
> >>>>  
> >>>>  	spin_lock_irqsave(&vdev->irqlock, flags);
> >>>>  
> >>>> @@ -60,9 +62,11 @@ void vfio_pci_intx_mask(struct vfio_pci_core_device *vdev)
> >>>>  			disable_irq_nosync(pdev->irq);
> >>>>  
> >>>>  		vdev->ctx[0].masked = true;
> >>>> +		intx_masked = true;
> >>>>  	}
> >>>>  
> >>>>  	spin_unlock_irqrestore(&vdev->irqlock, flags);
> >>>> +	return intx_masked;
> >>>>  }    
> >>>
> >>>
> >>> There's certainly another path through this function that masks the
> >>> interrupt, which makes the definition of this return value a bit
> >>> confusing.    
> >>
> >>  For our case we should not hit that path. But we can return the
> >>  intx_masked true from that path as well to align return value.
> >>  
> >>> Wouldn't it be simpler not to overload the masked flag on
> >>> the interrupt context like this and instead set a new flag on the vdev
> >>> under irqlock to indicate the device is unable to generate interrupts.
> >>> The irq handler would add a test of this flag before any tests that
> >>> would access the device.  Thanks,
> >>>
> >>> Alex
> >>>      
> >>
> >>  We will set this flag inside runtime_suspend callback but the
> >>  device can be in non-D3cold state (For example, if user has
> >>  disabled d3cold explicitly by sysfs, the D3cold is not supported in
> >>  the platform, etc.). Also, in D3cold supported case, the device will
> >>  be in D0 till the PCI core moves the device into D3cold. In this case,
> >>  there is possibility that the device can generate an interrupt.
> >>  If we add check in the IRQ handler, then we won’t check and clear
> >>  the IRQ status, but the interrupt line will still be asserted
> >>  which can cause interrupt flooding.
> >>
> >>  This was the reason for disabling interrupt itself instead of
> >>  checking flag in the IRQ handler.  
> > 
> > Ok, maybe this is largely a clarification of the return value of
> > vfio_pci_intx_mask().  I think what you're looking for is whether the
> > context mask was changed, rather than whether the interrupt is masked,
> > which I think avoids the confusion regarding whether the first branch
> > should return true or false.  So the variable should be something like
> > "masked_changed" and the comment changed to "Returns true if the INTx
> > vfio_pci_irq_ctx.masked value is changed".
> >   
> 
>  Thanks Alex.
>  I will rename the variable and update the comment.
> 
> > Testing is_intx() outside of the irqlock is potentially racy, so do we
> > need to add the pm-get/put wrappers on ioctls first to avoid the
> > possibility that pm-suspend can race a SET_IRQS ioctl?  Thanks,
> > 
> > Alex
> >   
>  
>  Even after adding this patch, the runtime suspend will not be supported
>  for the device with users. It will be supported only after patch 4 in
>  this patch series. So with this patch, the pm-suspend can be called only
>  for the cases where vfio device has no user and there we should not see
>  the race condition.

We should also not see is_intx() true for unused devices.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 2/6] vfio: Add a new device feature for the power management
  2022-07-11  9:43         ` Abhishek Sahu
@ 2022-07-11 13:04           ` Alex Williamson
  2022-07-11 17:30             ` Abhishek Sahu
  0 siblings, 1 reply; 26+ messages in thread
From: Alex Williamson @ 2022-07-11 13:04 UTC (permalink / raw)
  To: Abhishek Sahu
  Cc: Cornelia Huck, Yishai Hadas, Jason Gunthorpe, Shameer Kolothum,
	Kevin Tian, Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas,
	linux-kernel, kvm, linux-pm, linux-pci

On Mon, 11 Jul 2022 15:13:13 +0530
Abhishek Sahu <abhsahu@nvidia.com> wrote:

> On 7/8/2022 10:06 PM, Alex Williamson wrote:
> > On Fri, 8 Jul 2022 15:09:22 +0530
> > Abhishek Sahu <abhsahu@nvidia.com> wrote:
> >   
> >> On 7/6/2022 9:09 PM, Alex Williamson wrote:  
> >>> On Fri, 1 Jul 2022 16:38:10 +0530
> >>> Abhishek Sahu <abhsahu@nvidia.com> wrote:
> >>>     
> >>>> This patch adds the new feature VFIO_DEVICE_FEATURE_POWER_MANAGEMENT
> >>>> for the power management in the header file. The implementation for the
> >>>> same will be added in the subsequent patches.
> >>>>
> >>>> With the standard registers, all power states cannot be achieved. The
> >>>> platform-based power management needs to be involved to go into the
> >>>> lowest power state. For all the platform-based power management, this
> >>>> device feature can be used.
> >>>>
> >>>> This device feature uses flags to specify the different operations. In
> >>>> the future, if any more power management functionality is needed then
> >>>> a new flag can be added to it. It supports both GET and SET operations.
> >>>>
> >>>> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
> >>>> ---
> >>>>  include/uapi/linux/vfio.h | 55 +++++++++++++++++++++++++++++++++++++++
> >>>>  1 file changed, 55 insertions(+)
> >>>>
> >>>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> >>>> index 733a1cddde30..7e00de5c21ea 100644
> >>>> --- a/include/uapi/linux/vfio.h
> >>>> +++ b/include/uapi/linux/vfio.h
> >>>> @@ -986,6 +986,61 @@ enum vfio_device_mig_state {
> >>>>  	VFIO_DEVICE_STATE_RUNNING_P2P = 5,
> >>>>  };
> >>>>  
> >>>> +/*
> >>>> + * Perform power management-related operations for the VFIO device.
> >>>> + *
> >>>> + * The low power feature uses platform-based power management to move the
> >>>> + * device into the low power state.  This low power state is device-specific.
> >>>> + *
> >>>> + * This device feature uses flags to specify the different operations.
> >>>> + * It supports both the GET and SET operations.
> >>>> + *
> >>>> + * - VFIO_PM_LOW_POWER_ENTER flag moves the VFIO device into the low power
> >>>> + *   state with platform-based power management.  This low power state will be
> >>>> + *   internal to the VFIO driver and the user will not come to know which power
> >>>> + *   state is chosen.  Once the user has moved the VFIO device into the low
> >>>> + *   power state, then the user should not do any device access without moving
> >>>> + *   the device out of the low power state.    
> >>>
> >>> Except we're wrapping device accesses to make this possible.  This
> >>> should probably describe how any discrete access will wake the device
> >>> but ongoing access through mmaps will generate user faults.
> >>>     
> >>
> >>  Sure. I will add that details also.
> >>  
> >>>> + *
> >>>> + * - VFIO_PM_LOW_POWER_EXIT flag moves the VFIO device out of the low power
> >>>> + *    state.  This flag should only be set if the user has previously put the
> >>>> + *    device into low power state with the VFIO_PM_LOW_POWER_ENTER flag.    
> >>>
> >>> Indenting.
> >>>     
> >>  
> >>  I will fix this.
> >>  
> >>>> + *
> >>>> + * - VFIO_PM_LOW_POWER_ENTER and VFIO_PM_LOW_POWER_EXIT are mutually exclusive.
> >>>> + *
> >>>> + * - VFIO_PM_LOW_POWER_REENTERY_DISABLE flag is only valid with
> >>>> + *   VFIO_PM_LOW_POWER_ENTER.  If there is any access for the VFIO device on
> >>>> + *   the host side, then the device will be moved out of the low power state
> >>>> + *   without the user's guest driver involvement.  Some devices require the
> >>>> + *   user's guest driver involvement for each low-power entry.  If this flag is
> >>>> + *   set, then the re-entry to the low power state will be disabled, and the
> >>>> + *   host kernel will not move the device again into the low power state.
> >>>> + *   The VFIO driver internally maintains a list of devices for which low
> >>>> + *   power re-entry is disabled by default and for those devices, the
> >>>> + *   re-entry will be disabled even if the user has not set this flag
> >>>> + *   explicitly.    
> >>>
> >>> Wrong polarity.  The kernel should not maintain the policy.  By default
> >>> every wakeup, whether from host kernel accesses or via user accesses
> >>> that do a pm-get should signal a wakeup to userspace.  Userspace needs
> >>> to opt-out of that wakeup to let the kernel automatically re-enter low
> >>> power and userspace needs to maintain the policy for which devices it
> >>> wants that to occur.
> >>>     
> >>  
> >>  Okay. So that means, in the kernel side, we don’t have to maintain
> >>  the list which currently contains NVIDIA device ID. Also, in our
> >>  updated approach, this opt-out of that wake-up means that user
> >>  has not provided eventfd in the feature SET ioctl. Correct ?  
> > 
> > Yes, I'm imagining that if the user hasn't provided a one-shot wake-up
> > eventfd, that's the opt-out for being notified of device wakes.  For
> > example, pm-resume would have something like:
> > 
> > 	
> > 	if (vdev->pm_wake_eventfd) {
> > 		eventfd_signal(vdev->pm_wake_eventfd, 1);
> > 		vdev->pm_wake_eventfd = NULL;
> > 		pm_runtime_get_noresume(dev);
> > 	}
> > 
> > (eventfd pseudo handling substantially simplified)
> > 
> > So w/o a wake-up eventfd, the user would need to call the pm feature
> > exit ioctl to elevate the pm reference to prevent it going back to low
> > power.  The pm feature exit ioctl would be optional if a wake eventfd is
> > provided, so some piece of the eventfd context would need to remain to
> > determine whether a pm-get is necessary.
> >   
> >>>> + *
> >>>> + * For the IOCTL call with VFIO_DEVICE_FEATURE_GET:
> >>>> + *
> >>>> + * - VFIO_PM_LOW_POWER_ENTER will be set if the user has put the device into
> >>>> + *   the low power state, otherwise, VFIO_PM_LOW_POWER_EXIT will be set.
> >>>> + *
> >>>> + * - If the device is in a normal power state currently, then
> >>>> + *   VFIO_PM_LOW_POWER_REENTERY_DISABLE will be set for the devices where low
> >>>> + *   power re-entry is disabled by default.  If the device is in the low power
> >>>> + *   state currently, then VFIO_PM_LOW_POWER_REENTERY_DISABLE will be set
> >>>> + *   according to the current transition.    
> >>>
> >>> Very confusing semantics.
> >>>
> >>> What if the feature SET ioctl took an eventfd and that eventfd was one
> >>> time use.  Calling the ioctl would setup the eventfd to notify the user
> >>> on wakeup and call pm-put.  Any access to the device via host, ioctl,
> >>> or region would be wrapped in pm-get/put and the pm-resume handler
> >>> would perform the matching pm-get to balance the feature SET and signal
> >>> the eventfd.     
> >>
> >>  This seems a better option. It will help in making the ioctl simpler
> >>  and we don’t have to add a separate index for PME which I added in
> >>  patch 6. 
> >>  
> >>> If the user opts-out by not providing a wakeup eventfd,
> >>> then the pm-resume handler does not perform a pm-get. Possibly we
> >>> could even allow mmap access if a wake-up eventfd is provided.    
> >>
> >>  Sorry. I am not clear on this mmap part. We currently invalidates
> >>  mapping before going into runtime-suspend. Now, if use tries do
> >>  mmap then do we need some extra handling in the fault handler ?
> >>  Need your help in understanding this part.  
> > 
> > The option that I'm thinking about is if the mmap fault handler is
> > wrapped in a pm-get/put then we could actually populate the mmap.  In
> > the case where the pm-get triggers the wake-eventfd in pm-resume, the
> > device doesn't return to low power when the mmap fault handler calls
> > pm-put.  This possibly allows that we could actually invalidate mmaps on
> > pm-suspend rather than in the pm feature enter ioctl, essentially the
> > same as we're doing for intx.  I wonder though if this allows the
> > possibility that we just bounce between mmap fault and pm-suspend.  So
> > long as some work can be done, for instance the pm-suspend occurs
> > asynchronously to the pm-put, this might be ok.
> >   
> 
>  We can do this. But in the normal use case, the situation should
>  never arise where user should access any mmaped region when user has
>  already put the device into D3 (D3hot or D3cold). This can only happen
>  if there is some bug in the guest driver or user is doing wrong
>  sequence. Do we need to add handling to officially support this part ?

We cannot rely on userspace drivers to be bug free or non-malicious,
but if we want to impose that an mmap access while low power is
enabled always triggers a fault, that's ok.

>  pm-get can take more than a second for resume for some devices and
>  will doing this in fault handler be safe ?
> 
>  Also, we will add this support only when wake-eventfd is provided so
>  still w/o wake-eventfd case, the mmap access will still generate fault.
>  So, we will have different behavior. Will that be acceptable ?

Let's keep it simple, generate a fault for all cases.

> >>> The
> >>> feature GET ioctl would be used to exit low power behavior and would be
> >>> a no-op if the wakeup eventfd had already been signaled.  Thanks,
> >>>    
> >>  
> >>  I will use the GET ioctl for low power exit instead of returning the
> >>  current status.  
> > 
> > Note that Yishai is proposing a device DMA dirty logging feature where
> > the stop and start are exposed via SET on separate features, rather
> > than SET/GET.  We should probably maintain some consistency between
> > these use cases.  Possibly we might even want two separate pm enter
> > ioctls, one with the wake eventfd and one without.  I think this is the
> > sort of thing Jason is describing for future expansion of the dirty
> > tracking uAPI.  Thanks,
> > 
> > Alex
> >   
> 
>  Okay. So, we need to add 3 device features in total.
> 
>  VFIO_DEVICE_FEATURE_PM_ENTRY
>  VFIO_DEVICE_FEATURE_PM_ENTRY_WITH_WAKEUP
>  VFIO_DEVICE_FEATURE_PM_EXIT
> 
>  And only the second one need structure which will have only one field
>  for eventfd and we need to return error if wakeup-eventfd is not
>  provided in the second feature ?

Yes, we'd use eventfd_ctx and fail on a bad fileget.

>  Do we need to support GET operation also for these ?
>  We can skip GET operation since that won’t be very useful.

What would they do?  Thanks,

Alex


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v4 2/6] vfio: Add a new device feature for the power management
  2022-07-11 13:04           ` Alex Williamson
@ 2022-07-11 17:30             ` Abhishek Sahu
  0 siblings, 0 replies; 26+ messages in thread
From: Abhishek Sahu @ 2022-07-11 17:30 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Cornelia Huck, Yishai Hadas, Jason Gunthorpe, Shameer Kolothum,
	Kevin Tian, Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas,
	linux-kernel, kvm, linux-pm, linux-pci

On 7/11/2022 6:34 PM, Alex Williamson wrote:
> On Mon, 11 Jul 2022 15:13:13 +0530
> Abhishek Sahu <abhsahu@nvidia.com> wrote:
> 
>> On 7/8/2022 10:06 PM, Alex Williamson wrote:
>>> On Fri, 8 Jul 2022 15:09:22 +0530
>>> Abhishek Sahu <abhsahu@nvidia.com> wrote:
>>>   
>>>> On 7/6/2022 9:09 PM, Alex Williamson wrote:  
>>>>> On Fri, 1 Jul 2022 16:38:10 +0530
>>>>> Abhishek Sahu <abhsahu@nvidia.com> wrote:
>>>>>     
>>>>>> This patch adds the new feature VFIO_DEVICE_FEATURE_POWER_MANAGEMENT
>>>>>> for the power management in the header file. The implementation for the
>>>>>> same will be added in the subsequent patches.
>>>>>>
>>>>>> With the standard registers, all power states cannot be achieved. The
>>>>>> platform-based power management needs to be involved to go into the
>>>>>> lowest power state. For all the platform-based power management, this
>>>>>> device feature can be used.
>>>>>>
>>>>>> This device feature uses flags to specify the different operations. In
>>>>>> the future, if any more power management functionality is needed then
>>>>>> a new flag can be added to it. It supports both GET and SET operations.
>>>>>>
>>>>>> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
>>>>>> ---
>>>>>>  include/uapi/linux/vfio.h | 55 +++++++++++++++++++++++++++++++++++++++
>>>>>>  1 file changed, 55 insertions(+)
>>>>>>
>>>>>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>>>>>> index 733a1cddde30..7e00de5c21ea 100644
>>>>>> --- a/include/uapi/linux/vfio.h
>>>>>> +++ b/include/uapi/linux/vfio.h
>>>>>> @@ -986,6 +986,61 @@ enum vfio_device_mig_state {
>>>>>>  	VFIO_DEVICE_STATE_RUNNING_P2P = 5,
>>>>>>  };
>>>>>>  
>>>>>> +/*
>>>>>> + * Perform power management-related operations for the VFIO device.
>>>>>> + *
>>>>>> + * The low power feature uses platform-based power management to move the
>>>>>> + * device into the low power state.  This low power state is device-specific.
>>>>>> + *
>>>>>> + * This device feature uses flags to specify the different operations.
>>>>>> + * It supports both the GET and SET operations.
>>>>>> + *
>>>>>> + * - VFIO_PM_LOW_POWER_ENTER flag moves the VFIO device into the low power
>>>>>> + *   state with platform-based power management.  This low power state will be
>>>>>> + *   internal to the VFIO driver and the user will not come to know which power
>>>>>> + *   state is chosen.  Once the user has moved the VFIO device into the low
>>>>>> + *   power state, then the user should not do any device access without moving
>>>>>> + *   the device out of the low power state.    
>>>>>
>>>>> Except we're wrapping device accesses to make this possible.  This
>>>>> should probably describe how any discrete access will wake the device
>>>>> but ongoing access through mmaps will generate user faults.
>>>>>     
>>>>
>>>>  Sure. I will add that details also.
>>>>  
>>>>>> + *
>>>>>> + * - VFIO_PM_LOW_POWER_EXIT flag moves the VFIO device out of the low power
>>>>>> + *    state.  This flag should only be set if the user has previously put the
>>>>>> + *    device into low power state with the VFIO_PM_LOW_POWER_ENTER flag.    
>>>>>
>>>>> Indenting.
>>>>>     
>>>>  
>>>>  I will fix this.
>>>>  
>>>>>> + *
>>>>>> + * - VFIO_PM_LOW_POWER_ENTER and VFIO_PM_LOW_POWER_EXIT are mutually exclusive.
>>>>>> + *
>>>>>> + * - VFIO_PM_LOW_POWER_REENTERY_DISABLE flag is only valid with
>>>>>> + *   VFIO_PM_LOW_POWER_ENTER.  If there is any access for the VFIO device on
>>>>>> + *   the host side, then the device will be moved out of the low power state
>>>>>> + *   without the user's guest driver involvement.  Some devices require the
>>>>>> + *   user's guest driver involvement for each low-power entry.  If this flag is
>>>>>> + *   set, then the re-entry to the low power state will be disabled, and the
>>>>>> + *   host kernel will not move the device again into the low power state.
>>>>>> + *   The VFIO driver internally maintains a list of devices for which low
>>>>>> + *   power re-entry is disabled by default and for those devices, the
>>>>>> + *   re-entry will be disabled even if the user has not set this flag
>>>>>> + *   explicitly.    
>>>>>
>>>>> Wrong polarity.  The kernel should not maintain the policy.  By default
>>>>> every wakeup, whether from host kernel accesses or via user accesses
>>>>> that do a pm-get should signal a wakeup to userspace.  Userspace needs
>>>>> to opt-out of that wakeup to let the kernel automatically re-enter low
>>>>> power and userspace needs to maintain the policy for which devices it
>>>>> wants that to occur.
>>>>>     
>>>>  
>>>>  Okay. So that means, in the kernel side, we don’t have to maintain
>>>>  the list which currently contains NVIDIA device ID. Also, in our
>>>>  updated approach, this opt-out of that wake-up means that user
>>>>  has not provided eventfd in the feature SET ioctl. Correct ?  
>>>
>>> Yes, I'm imagining that if the user hasn't provided a one-shot wake-up
>>> eventfd, that's the opt-out for being notified of device wakes.  For
>>> example, pm-resume would have something like:
>>>
>>> 	
>>> 	if (vdev->pm_wake_eventfd) {
>>> 		eventfd_signal(vdev->pm_wake_eventfd, 1);
>>> 		vdev->pm_wake_eventfd = NULL;
>>> 		pm_runtime_get_noresume(dev);
>>> 	}
>>>
>>> (eventfd pseudo handling substantially simplified)
>>>
>>> So w/o a wake-up eventfd, the user would need to call the pm feature
>>> exit ioctl to elevate the pm reference to prevent it going back to low
>>> power.  The pm feature exit ioctl would be optional if a wake eventfd is
>>> provided, so some piece of the eventfd context would need to remain to
>>> determine whether a pm-get is necessary.
>>>   
>>>>>> + *
>>>>>> + * For the IOCTL call with VFIO_DEVICE_FEATURE_GET:
>>>>>> + *
>>>>>> + * - VFIO_PM_LOW_POWER_ENTER will be set if the user has put the device into
>>>>>> + *   the low power state, otherwise, VFIO_PM_LOW_POWER_EXIT will be set.
>>>>>> + *
>>>>>> + * - If the device is in a normal power state currently, then
>>>>>> + *   VFIO_PM_LOW_POWER_REENTERY_DISABLE will be set for the devices where low
>>>>>> + *   power re-entry is disabled by default.  If the device is in the low power
>>>>>> + *   state currently, then VFIO_PM_LOW_POWER_REENTERY_DISABLE will be set
>>>>>> + *   according to the current transition.    
>>>>>
>>>>> Very confusing semantics.
>>>>>
>>>>> What if the feature SET ioctl took an eventfd and that eventfd was one
>>>>> time use.  Calling the ioctl would setup the eventfd to notify the user
>>>>> on wakeup and call pm-put.  Any access to the device via host, ioctl,
>>>>> or region would be wrapped in pm-get/put and the pm-resume handler
>>>>> would perform the matching pm-get to balance the feature SET and signal
>>>>> the eventfd.     
>>>>
>>>>  This seems a better option. It will help in making the ioctl simpler
>>>>  and we don’t have to add a separate index for PME which I added in
>>>>  patch 6. 
>>>>  
>>>>> If the user opts-out by not providing a wakeup eventfd,
>>>>> then the pm-resume handler does not perform a pm-get. Possibly we
>>>>> could even allow mmap access if a wake-up eventfd is provided.    
>>>>
>>>>  Sorry. I am not clear on this mmap part. We currently invalidates
>>>>  mapping before going into runtime-suspend. Now, if use tries do
>>>>  mmap then do we need some extra handling in the fault handler ?
>>>>  Need your help in understanding this part.  
>>>
>>> The option that I'm thinking about is if the mmap fault handler is
>>> wrapped in a pm-get/put then we could actually populate the mmap.  In
>>> the case where the pm-get triggers the wake-eventfd in pm-resume, the
>>> device doesn't return to low power when the mmap fault handler calls
>>> pm-put.  This possibly allows that we could actually invalidate mmaps on
>>> pm-suspend rather than in the pm feature enter ioctl, essentially the
>>> same as we're doing for intx.  I wonder though if this allows the
>>> possibility that we just bounce between mmap fault and pm-suspend.  So
>>> long as some work can be done, for instance the pm-suspend occurs
>>> asynchronously to the pm-put, this might be ok.
>>>   
>>
>>  We can do this. But in the normal use case, the situation should
>>  never arise where user should access any mmaped region when user has
>>  already put the device into D3 (D3hot or D3cold). This can only happen
>>  if there is some bug in the guest driver or user is doing wrong
>>  sequence. Do we need to add handling to officially support this part ?
> 
> We cannot rely on userspace drivers to be bug free or non-malicious,
> but if we want to impose that an mmap access while low power is
> enabled always triggers a fault, that's ok.
> 
>>  pm-get can take more than a second for resume for some devices and
>>  will doing this in fault handler be safe ?
>>
>>  Also, we will add this support only when wake-eventfd is provided so
>>  still w/o wake-eventfd case, the mmap access will still generate fault.
>>  So, we will have different behavior. Will that be acceptable ?
> 
> Let's keep it simple, generate a fault for all cases.
> 

 Thanks Alex for confirmation.

>>>>> The
>>>>> feature GET ioctl would be used to exit low power behavior and would be
>>>>> a no-op if the wakeup eventfd had already been signaled.  Thanks,
>>>>>    
>>>>  
>>>>  I will use the GET ioctl for low power exit instead of returning the
>>>>  current status.  
>>>
>>> Note that Yishai is proposing a device DMA dirty logging feature where
>>> the stop and start are exposed via SET on separate features, rather
>>> than SET/GET.  We should probably maintain some consistency between
>>> these use cases.  Possibly we might even want two separate pm enter
>>> ioctls, one with the wake eventfd and one without.  I think this is the
>>> sort of thing Jason is describing for future expansion of the dirty
>>> tracking uAPI.  Thanks,
>>>
>>> Alex
>>>   
>>
>>  Okay. So, we need to add 3 device features in total.
>>
>>  VFIO_DEVICE_FEATURE_PM_ENTRY
>>  VFIO_DEVICE_FEATURE_PM_ENTRY_WITH_WAKEUP
>>  VFIO_DEVICE_FEATURE_PM_EXIT
>>
>>  And only the second one need structure which will have only one field
>>  for eventfd and we need to return error if wakeup-eventfd is not
>>  provided in the second feature ?
> 
> Yes, we'd use eventfd_ctx and fail on a bad fileget.
> 
>>  Do we need to support GET operation also for these ?
>>  We can skip GET operation since that won’t be very useful.
> 
> What would they do?  Thanks,
> 
> Alex
> 

 If we implement GET operation then it can return the
 current status. For example, for VFIO_DEVICE_FEATURE_PM_ENTRY
 can return the information whether user has put the device into
 low power previously. But this information is not much useful as such
 and it requires to add a structure where this information needs to
 be filled. Also, the GET will again cause the device wake-up.
 So, for these device features, we can support only SET operation.

 I checked the Yishai DMA logging patches and there start
 and stop seems to be supporting only SET operation and there is
 separate feature which supports only GET operation.

 Regards,
 Abhishek

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2022-07-11 17:30 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-01 11:08 [PATCH v4 0/6] vfio/pci: power management changes Abhishek Sahu
2022-07-01 11:08 ` [PATCH v4 1/6] vfio/pci: Mask INTx during runtime suspend Abhishek Sahu
2022-07-06 15:39   ` Alex Williamson
2022-07-08  9:21     ` Abhishek Sahu
2022-07-08 15:45       ` Alex Williamson
2022-07-11  9:18         ` Abhishek Sahu
2022-07-11 12:57           ` Alex Williamson
2022-07-01 11:08 ` [PATCH v4 2/6] vfio: Add a new device feature for the power management Abhishek Sahu
2022-07-06 15:39   ` Alex Williamson
2022-07-08  9:39     ` Abhishek Sahu
2022-07-08 16:36       ` Alex Williamson
2022-07-11  9:43         ` Abhishek Sahu
2022-07-11 13:04           ` Alex Williamson
2022-07-11 17:30             ` Abhishek Sahu
2022-07-01 11:08 ` [PATCH v4 3/6] vfio: Increment the runtime PM usage count during IOCTL call Abhishek Sahu
2022-07-06 15:40   ` Alex Williamson
2022-07-08  9:43     ` Abhishek Sahu
2022-07-08 16:49       ` Alex Williamson
2022-07-11  9:50         ` Abhishek Sahu
2022-07-01 11:08 ` [PATCH v4 4/6] vfio/pci: Add the support for PCI D3cold state Abhishek Sahu
2022-07-06 15:40   ` Alex Williamson
2022-07-01 11:08 ` [PATCH v4 5/6] vfio/pci: Prevent low power re-entry without guest driver Abhishek Sahu
2022-07-06 15:40   ` Alex Williamson
2022-07-01 11:08 ` [PATCH v4 6/6] vfio/pci: Add support for virtual PME Abhishek Sahu
2022-07-06 15:40   ` Alex Williamson
2022-07-08  9:45     ` Abhishek Sahu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.