linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 0/5] vfio/pci: power management changes
@ 2022-07-19 12:15 Abhishek Sahu
  2022-07-19 12:15 ` [PATCH v5 1/5] vfio: Add the device features for the low power entry and exit Abhishek Sahu
                   ` (4 more replies)
  0 siblings, 5 replies; 26+ messages in thread
From: Abhishek Sahu @ 2022-07-19 12:15 UTC (permalink / raw)
  To: Alex Williamson, Cornelia Huck, Yishai Hadas, Jason Gunthorpe,
	Shameer Kolothum, Kevin Tian, Rafael J . Wysocki
  Cc: Max Gurtovoy, Bjorn Helgaas, linux-kernel, kvm, linux-pm,
	linux-pci, Abhishek Sahu

This is part 2 for the vfio-pci driver power management support.
Part 1 of this patch series was related to adding D3cold support
when there is no user of the VFIO device and has already merged in the
mainline kernel. If we enable the runtime power management for
vfio-pci device in the guest OS, then the device is being runtime
suspended (for linux guest OS) and the PCI device will be put into
D3hot state (in function vfio_pm_config_write()). If the D3cold
state can be used instead of D3hot, then it will help in saving
maximum power. The D3cold state can't be possible with native
PCI PM. It requires interaction with platform firmware which is
system-specific. To go into low power states (Including D3cold),
the runtime PM framework can be used which internally interacts
with PCI and platform firmware and puts the device into the
lowest possible D-States.

This patch series adds the support to engage runtime power management
initiated by the user. Since D3cold state can't be achieved by writing
PCI standard PM config registers, so new device features have been
added in DEVICE_FEATURE IOCTL for low power entry and exit related
handling. For the PCI device, this low power state will be D3cold
(if the platform supports the D3cold state). The hypervisors can implement
virtual ACPI methods to make the integration with guest OS.
For example, in guest Linux OS if PCI device ACPI node has
_PR3 and _PR0 power resources with _ON/_OFF method, then guest
Linux OS makes the _OFF call during D3cold transition and
then _ON during D0 transition. The hypervisor can tap these virtual
ACPI calls and then do the low power related IOCTL.

The entry device feature has two variants. These two variants are mainly
to support the different behaviour for the low power entry.
If there is any access for the VFIO device on the host side, then the
device will be moved out of the low power state without the user's
guest driver involvement. Some devices (for example NVIDIA VGA or
3D controller) require the user's guest driver involvement for
each low-power entry. In the first variant, the host can move the
device into low power without any guest driver involvement while
in the second variant, the host will send a notification to user
through eventfd and then user guest driver needs to move the device
into low power. The hypervisor can implement the virtual PME
support to notify the guest OS. Please refer
https://lore.kernel.org/lkml/20220701110814.7310-7-abhsahu@nvidia.com/
where initially this virtual PME was implemented in the vfio-pci driver
itself, but later-on, it has been decided that hypervisor can implement
this.

* Changes in v5

- Rebased patches on https://github.com/awilliam/linux-vfio/tree/next.
- Implemented 3 separate device features for the low power entry and exit.
- Dropped virtual PME patch.
- Removed the special handling for power management related device feature
  and now all the ioctls will be wrapped under pm_runtime_get/put.
- Refactored code around low power entry and exit.
- Removed all the policy related code and same can be implemented in the
  userspace.
- Renamed 'intx_masked' to 'masked_changed' and updated function comment.
- Changed the order of patches.

* Changes in v4
  (https://lore.kernel.org/lkml/20220701110814.7310-1-abhsahu@nvidia.com/)

- Rebased patches on v5.19-rc4.
- Added virtual PME support.
- Used flags for low power entry and exit instead of explicit variable.
- Add the support to keep NVIDIA display related controllers in active
  state if there is any activity on the host side.
- Add a flag that can be set by the user to keep the device in the active
  state if there is any activity on the host side.
- Split the D3cold patch into smaller patches.
- Kept the runtime PM usage count incremented for all the IOCTL
  (except power management IOCTL) and all the PCI region access.
- Masked the runtime errors behind -EIO.
- Refactored logic in runtime suspend/resume routine and for power
  management device feature IOCTL.
- Add helper function for pm_runtime_put() also in the
  drivers/vfio/vfio.c and use the 'struct vfio_device' for the
  function parameter.
- Removed the requirement to move the device into D3hot before calling
  low power entry.
- Renamed power management related new members in the structure.
- Used 'pm_runtime_engaged' check in __vfio_pci_memory_enabled().

* Changes in v3
  (https://lore.kernel.org/lkml/20220425092615.10133-1-abhsahu@nvidia.com)

- Rebased patches on v5.18-rc3.
- Marked this series as PATCH instead of RFC.
- Addressed the review comments given in v2.
- Removed the limitation to keep device in D0 state if there is any
  access from host side. This is specific to NVIDIA use case and
  will be handled separately.
- Used the existing DEVICE_FEATURE IOCTL itself instead of adding new
  IOCTL for power management.
- Removed all custom code related with power management in runtime
  suspend/resume callbacks and IOCTL handling. Now, the callbacks
  contain code related with INTx handling and few other stuffs and
  all the PCI state and platform PM handling will be done by PCI core
  functions itself.
- Add the support of wake-up in main vfio layer itself since now we have
  more vfio/pci based drivers.
- Instead of assigning the 'struct dev_pm_ops' in individual parent
  driver, now the vfio_pci_core tself assigns the 'struct dev_pm_ops'. 
- Added handling of power management around SR-IOV handling.
- Moved the setting of drvdata in a separate patch.
- Masked INTx before during runtime suspended state.
- Changed the order of patches so that Fix related things are at beginning
  of this patch series.
- Removed storing the power state locally and used one new boolean to
  track the d3 (D3cold and D3hot) power state 
- Removed check for IO access in D3 power state.
- Used another helper function vfio_lock_and_set_power_state() instead
  of touching vfio_pci_set_power_state().
- Considered the fixes made in
  https://lore.kernel.org/lkml/20220217122107.22434-1-abhsahu@nvidia.com
  and updated the patches accordingly.

* Changes in v2
  (https://lore.kernel.org/lkml/20220124181726.19174-1-abhsahu@nvidia.com)

- Rebased patches on v5.17-rc1.
- Included the patch to handle BAR access in D3cold.
- Included the patch to fix memory leak.
- Made a separate IOCTL that can be used to change the power state from
  D3hot to D3cold and D3cold to D0.
- Addressed the review comments given in v1.

* v1
  (https://lore.kernel.org/lkml/20211115133640.2231-1-abhsahu@nvidia.com)

Abhishek Sahu (5):
  vfio: Add the device features for the low power entry and exit
  vfio: Increment the runtime PM usage count during IOCTL call
  vfio/pci: Mask INTx during runtime suspend
  vfio/pci: Implement VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY/EXIT
  vfio/pci: Implement VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP

 drivers/vfio/pci/vfio_pci_core.c  | 256 ++++++++++++++++++++++++++++--
 drivers/vfio/pci/vfio_pci_intrs.c |   6 +-
 drivers/vfio/vfio.c               |  52 +++++-
 include/linux/vfio_pci_core.h     |   5 +-
 include/uapi/linux/vfio.h         |  55 +++++++
 5 files changed, 357 insertions(+), 17 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v5 1/5] vfio: Add the device features for the low power entry and exit
  2022-07-19 12:15 [PATCH v5 0/5] vfio/pci: power management changes Abhishek Sahu
@ 2022-07-19 12:15 ` Abhishek Sahu
  2022-07-21 22:34   ` Alex Williamson
  2022-07-19 12:15 ` [PATCH v5 2/5] vfio: Increment the runtime PM usage count during IOCTL call Abhishek Sahu
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 26+ messages in thread
From: Abhishek Sahu @ 2022-07-19 12:15 UTC (permalink / raw)
  To: Alex Williamson, Cornelia Huck, Yishai Hadas, Jason Gunthorpe,
	Shameer Kolothum, Kevin Tian, Rafael J . Wysocki
  Cc: Max Gurtovoy, Bjorn Helgaas, linux-kernel, kvm, linux-pm,
	linux-pci, Abhishek Sahu

This patch adds the following new device features for the low
power entry and exit in the header file. The implementation for the
same will be added in the subsequent patches.

- VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY
- VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP
- VFIO_DEVICE_FEATURE_LOW_POWER_EXIT

With the standard registers, all power states cannot be achieved. The
platform-based power management needs to be involved to go into the
lowest power state. For doing low power entry and exit with
platform-based power management, these device features can be used.

The entry device feature has two variants. These two variants are mainly
to support the different behaviour for the low power entry.
If there is any access for the VFIO device on the host side, then the
device will be moved out of the low power state without the user's
guest driver involvement. Some devices (for example NVIDIA VGA or
3D controller) require the user's guest driver involvement for
each low-power entry. In the first variant, the host can move the
device into low power without any guest driver involvement while
in the second variant, the host will send a notification to the user
through eventfd and then the users guest driver needs to move
the device into low power.

These device features only support VFIO_DEVICE_FEATURE_SET operation.

Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
---
 include/uapi/linux/vfio.h | 55 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 55 insertions(+)

diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 733a1cddde30..08fd3482d22b 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -986,6 +986,61 @@ enum vfio_device_mig_state {
 	VFIO_DEVICE_STATE_RUNNING_P2P = 5,
 };
 
+/*
+ * Upon VFIO_DEVICE_FEATURE_SET, move the VFIO device into the low power state
+ * with the platform-based power management.  This low power state will be
+ * internal to the VFIO driver and the user will not come to know which power
+ * state is chosen.  If any device access happens (either from the host or
+ * the guest) when the device is in the low power state, then the host will
+ * move the device out of the low power state first.  Once the access has been
+ * finished, then the host will move the device into the low power state again.
+ * If the user wants that the device should not go into the low power state
+ * again in this case, then the user should use the
+ * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP device feature for the
+ * low power entry.  The mmap'ed region access is not allowed till the low power
+ * exit happens through VFIO_DEVICE_FEATURE_LOW_POWER_EXIT and will
+ * generate the access fault.
+ */
+#define VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY 3
+
+/*
+ * Upon VFIO_DEVICE_FEATURE_SET, move the VFIO device into the low power state
+ * with the platform-based power management and provide support for the wake-up
+ * notifications through eventfd.  This low power state will be internal to the
+ * VFIO driver and the user will not come to know which power state is chosen.
+ * If any device access happens (either from the host or the guest) when the
+ * device is in the low power state, then the host will move the device out of
+ * the low power state first and a notification will be sent to the guest
+ * through eventfd.  Once the access is finished, the host will not move back
+ * the device into the low power state.  The guest should move the device into
+ * the low power state again upon receiving the wakeup notification.  The
+ * notification will be generated only if the device physically went into the
+ * low power state.  If the low power entry has been disabled from the host
+ * side, then the device will not go into the low power state even after
+ * calling this device feature and then the device access does not require
+ * wake-up.  The mmap'ed region access is not allowed till the low power exit
+ * happens.  The low power exit can happen either through
+ * VFIO_DEVICE_FEATURE_LOW_POWER_EXIT or through any other access (where the
+ * wake-up notification has been generated).
+ */
+struct vfio_device_low_power_entry_with_wakeup {
+	__s32 wakeup_eventfd;
+	__u32 reserved;
+};
+
+#define VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP 4
+
+/*
+ * Upon VFIO_DEVICE_FEATURE_SET, move the VFIO device out of the low power
+ * state.  This device feature should be called only if the user has previously
+ * put the device into the low power state either with
+ * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY or
+ * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP device feature.  If the
+ * device is not in the low power state currently, this device feature will
+ * return early with the success status.
+ */
+#define VFIO_DEVICE_FEATURE_LOW_POWER_EXIT 5
+
 /* -------- API for Type1 VFIO IOMMU -------- */
 
 /**
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v5 2/5] vfio: Increment the runtime PM usage count during IOCTL call
  2022-07-19 12:15 [PATCH v5 0/5] vfio/pci: power management changes Abhishek Sahu
  2022-07-19 12:15 ` [PATCH v5 1/5] vfio: Add the device features for the low power entry and exit Abhishek Sahu
@ 2022-07-19 12:15 ` Abhishek Sahu
  2022-07-21 22:34   ` Alex Williamson
  2022-07-19 12:15 ` [PATCH v5 3/5] vfio/pci: Mask INTx during runtime suspend Abhishek Sahu
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 26+ messages in thread
From: Abhishek Sahu @ 2022-07-19 12:15 UTC (permalink / raw)
  To: Alex Williamson, Cornelia Huck, Yishai Hadas, Jason Gunthorpe,
	Shameer Kolothum, Kevin Tian, Rafael J . Wysocki
  Cc: Max Gurtovoy, Bjorn Helgaas, linux-kernel, kvm, linux-pm,
	linux-pci, Abhishek Sahu

The vfio-pci based drivers will have runtime power management
support where the user can put the device into the low power state
and then PCI devices can go into the D3cold state. If the device is
in the low power state and the user issues any IOCTL, then the
device should be moved out of the low power state first. Once
the IOCTL is serviced, then it can go into the low power state again.
The runtime PM framework manages this with help of usage count.

One option was to add the runtime PM related API's inside vfio-pci
driver but some IOCTL (like VFIO_DEVICE_FEATURE) can follow a
different path and more IOCTL can be added in the future. Also, the
runtime PM will be added for vfio-pci based drivers variant currently,
but the other VFIO based drivers can use the same in the
future. So, this patch adds the runtime calls runtime-related API in
the top-level IOCTL function itself.

For the VFIO drivers which do not have runtime power management
support currently, the runtime PM API's won't be invoked. Only for
vfio-pci based drivers currently, the runtime PM API's will be invoked
to increment and decrement the usage count. In the vfio-pci drivers also,
the variant drivers can opt-out by incrementing the usage count during
device-open. The pm_runtime_resume_and_get() checks the device
current status and will return early if the device is already in the
ACTIVE state.

Taking this usage count incremented while servicing IOCTL will make
sure that the user won't put the device into the low power state when any
other IOCTL is being serviced in parallel. Let's consider the
following scenario:

 1. Some other IOCTL is called.
 2. The user has opened another device instance and called the IOCTL for
    low power entry.
 3. The low power entry IOCTL moves the device into the low power state.
 4. The other IOCTL finishes.

If we don't keep the usage count incremented then the device
access will happen between step 3 and 4 while the device has already
gone into the low power state.

The pm_runtime_resume_and_get() will be the first call so its error
should not be propagated to user space directly. For example, if
pm_runtime_resume_and_get() can return -EINVAL for the cases where the
user has passed the correct argument. So the
pm_runtime_resume_and_get() errors have been masked behind -EIO.

Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
---
 drivers/vfio/vfio.c | 52 ++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 49 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index bd84ca7c5e35..1d005a0a9d3d 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -32,6 +32,7 @@
 #include <linux/vfio.h>
 #include <linux/wait.h>
 #include <linux/sched/signal.h>
+#include <linux/pm_runtime.h>
 #include "vfio.h"
 
 #define DRIVER_VERSION	"0.3"
@@ -1335,6 +1336,39 @@ static const struct file_operations vfio_group_fops = {
 	.release	= vfio_group_fops_release,
 };
 
+/*
+ * Wrapper around pm_runtime_resume_and_get().
+ * Return error code on failure or 0 on success.
+ */
+static inline int vfio_device_pm_runtime_get(struct vfio_device *device)
+{
+	struct device *dev = device->dev;
+
+	if (dev->driver && dev->driver->pm) {
+		int ret;
+
+		ret = pm_runtime_resume_and_get(dev);
+		if (ret < 0) {
+			dev_info_ratelimited(dev,
+				"vfio: runtime resume failed %d\n", ret);
+			return -EIO;
+		}
+	}
+
+	return 0;
+}
+
+/*
+ * Wrapper around pm_runtime_put().
+ */
+static inline void vfio_device_pm_runtime_put(struct vfio_device *device)
+{
+	struct device *dev = device->dev;
+
+	if (dev->driver && dev->driver->pm)
+		pm_runtime_put(dev);
+}
+
 /*
  * VFIO Device fd
  */
@@ -1649,15 +1683,27 @@ static long vfio_device_fops_unl_ioctl(struct file *filep,
 				       unsigned int cmd, unsigned long arg)
 {
 	struct vfio_device *device = filep->private_data;
+	int ret;
+
+	ret = vfio_device_pm_runtime_get(device);
+	if (ret)
+		return ret;
 
 	switch (cmd) {
 	case VFIO_DEVICE_FEATURE:
-		return vfio_ioctl_device_feature(device, (void __user *)arg);
+		ret = vfio_ioctl_device_feature(device, (void __user *)arg);
+		break;
+
 	default:
 		if (unlikely(!device->ops->ioctl))
-			return -EINVAL;
-		return device->ops->ioctl(device, cmd, arg);
+			ret = -EINVAL;
+		else
+			ret = device->ops->ioctl(device, cmd, arg);
+		break;
 	}
+
+	vfio_device_pm_runtime_put(device);
+	return ret;
 }
 
 static ssize_t vfio_device_fops_read(struct file *filep, char __user *buf,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v5 3/5] vfio/pci: Mask INTx during runtime suspend
  2022-07-19 12:15 [PATCH v5 0/5] vfio/pci: power management changes Abhishek Sahu
  2022-07-19 12:15 ` [PATCH v5 1/5] vfio: Add the device features for the low power entry and exit Abhishek Sahu
  2022-07-19 12:15 ` [PATCH v5 2/5] vfio: Increment the runtime PM usage count during IOCTL call Abhishek Sahu
@ 2022-07-19 12:15 ` Abhishek Sahu
  2022-07-19 12:15 ` [PATCH v5 4/5] vfio/pci: Implement VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY/EXIT Abhishek Sahu
  2022-07-19 12:15 ` [PATCH v5 5/5] vfio/pci: Implement VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP Abhishek Sahu
  4 siblings, 0 replies; 26+ messages in thread
From: Abhishek Sahu @ 2022-07-19 12:15 UTC (permalink / raw)
  To: Alex Williamson, Cornelia Huck, Yishai Hadas, Jason Gunthorpe,
	Shameer Kolothum, Kevin Tian, Rafael J . Wysocki
  Cc: Max Gurtovoy, Bjorn Helgaas, linux-kernel, kvm, linux-pm,
	linux-pci, Abhishek Sahu

This patch adds INTx handling during runtime suspend/resume.
All the suspend/resume related code for the user to put the device
into the low power state will be added in subsequent patches.

The INTx lines may be shared among devices. Whenever any INTx
interrupt comes for the VFIO devices, then vfio_intx_handler() will be
called for each device sharing the interrupt. Inside vfio_intx_handler(),
it calls pci_check_and_mask_intx() and checks if the interrupt has
been generated for the current device. Now, if the device is already
in the D3cold state, then the config space can not be read. Attempt
to read config space in D3cold state can cause system unresponsiveness
in a few systems. To prevent this, mask INTx in runtime suspend callback,
and unmask the same in runtime resume callback. If INTx has been already
masked, then no handling is needed in runtime suspend/resume callbacks.
'pm_intx_masked' tracks this, and vfio_pci_intx_mask() has been updated
to return true if the INTx vfio_pci_irq_ctx.masked value is changed
inside this function.

For the runtime suspend which is triggered for the no user of VFIO
device, the is_intx() will return false and these callbacks won't do
anything.

The MSI/MSI-X are not shared so similar handling should not be
needed for MSI/MSI-X. vfio_msihandler() triggers eventfd_signal()
without doing any device-specific config access. When the user performs
any config access or IOCTL after receiving the eventfd notification,
then the device will be moved to the D0 state first before
servicing any request.

Another option was to check this flag 'pm_intx_masked' inside
vfio_intx_handler() instead of masking the interrupts. This flag
is being set inside the runtime_suspend callback but the device
can be in non-D3cold state (for example, if the user has disabled D3cold
explicitly by sysfs, the D3cold is not supported in the platform, etc.).
Also, in D3cold supported case, the device will be in D0 till the
PCI core moves the device into D3cold. In this case, there is
a possibility that the device can generate an interrupt. Adding check
in the IRQ handler will not clear the IRQ status and the interrupt
line will still be asserted. This can cause interrupt flooding.

Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
---
 drivers/vfio/pci/vfio_pci_core.c  | 37 +++++++++++++++++++++++++++----
 drivers/vfio/pci/vfio_pci_intrs.c |  6 ++++-
 include/linux/vfio_pci_core.h     |  3 ++-
 3 files changed, 40 insertions(+), 6 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 2efa06b1fafa..9517645acfa6 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -259,16 +259,45 @@ int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t stat
 	return ret;
 }
 
+#ifdef CONFIG_PM
+static int vfio_pci_core_runtime_suspend(struct device *dev)
+{
+	struct vfio_pci_core_device *vdev = dev_get_drvdata(dev);
+
+	/*
+	 * If INTx is enabled, then mask INTx before going into the runtime
+	 * suspended state and unmask the same in the runtime resume.
+	 * If INTx has already been masked by the user, then
+	 * vfio_pci_intx_mask() will return false and in that case, INTx
+	 * should not be unmasked in the runtime resume.
+	 */
+	vdev->pm_intx_masked = (is_intx(vdev) && vfio_pci_intx_mask(vdev));
+
+	return 0;
+}
+
+static int vfio_pci_core_runtime_resume(struct device *dev)
+{
+	struct vfio_pci_core_device *vdev = dev_get_drvdata(dev);
+
+	if (vdev->pm_intx_masked)
+		vfio_pci_intx_unmask(vdev);
+
+	return 0;
+}
+#endif /* CONFIG_PM */
+
 /*
- * The dev_pm_ops needs to be provided to make pci-driver runtime PM working,
- * so use structure without any callbacks.
- *
  * The pci-driver core runtime PM routines always save the device state
  * before going into suspended state. If the device is going into low power
  * state with only with runtime PM ops, then no explicit handling is needed
  * for the devices which have NoSoftRst-.
  */
-static const struct dev_pm_ops vfio_pci_core_pm_ops = { };
+static const struct dev_pm_ops vfio_pci_core_pm_ops = {
+	SET_RUNTIME_PM_OPS(vfio_pci_core_runtime_suspend,
+			   vfio_pci_core_runtime_resume,
+			   NULL)
+};
 
 int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
 {
diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index 6069a11fb51a..8b805d5d19e1 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -33,10 +33,12 @@ static void vfio_send_intx_eventfd(void *opaque, void *unused)
 		eventfd_signal(vdev->ctx[0].trigger, 1);
 }
 
-void vfio_pci_intx_mask(struct vfio_pci_core_device *vdev)
+/* Returns true if the INTx vfio_pci_irq_ctx.masked value is changed. */
+bool vfio_pci_intx_mask(struct vfio_pci_core_device *vdev)
 {
 	struct pci_dev *pdev = vdev->pdev;
 	unsigned long flags;
+	bool masked_changed = false;
 
 	spin_lock_irqsave(&vdev->irqlock, flags);
 
@@ -60,9 +62,11 @@ void vfio_pci_intx_mask(struct vfio_pci_core_device *vdev)
 			disable_irq_nosync(pdev->irq);
 
 		vdev->ctx[0].masked = true;
+		masked_changed = true;
 	}
 
 	spin_unlock_irqrestore(&vdev->irqlock, flags);
+	return masked_changed;
 }
 
 /*
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index 22de2bce6394..e96cc3081236 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -124,6 +124,7 @@ struct vfio_pci_core_device {
 	bool			needs_reset;
 	bool			nointx;
 	bool			needs_pm_restore;
+	bool			pm_intx_masked;
 	struct pci_saved_state	*pci_saved_state;
 	struct pci_saved_state	*pm_save;
 	int			ioeventfds_nr;
@@ -147,7 +148,7 @@ struct vfio_pci_core_device {
 #define is_irq_none(vdev) (!(is_intx(vdev) || is_msi(vdev) || is_msix(vdev)))
 #define irq_is(vdev, type) (vdev->irq_type == type)
 
-void vfio_pci_intx_mask(struct vfio_pci_core_device *vdev);
+bool vfio_pci_intx_mask(struct vfio_pci_core_device *vdev);
 void vfio_pci_intx_unmask(struct vfio_pci_core_device *vdev);
 
 int vfio_pci_set_irqs_ioctl(struct vfio_pci_core_device *vdev,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v5 4/5] vfio/pci: Implement VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY/EXIT
  2022-07-19 12:15 [PATCH v5 0/5] vfio/pci: power management changes Abhishek Sahu
                   ` (2 preceding siblings ...)
  2022-07-19 12:15 ` [PATCH v5 3/5] vfio/pci: Mask INTx during runtime suspend Abhishek Sahu
@ 2022-07-19 12:15 ` Abhishek Sahu
  2022-07-21 22:34   ` Alex Williamson
  2022-07-19 12:15 ` [PATCH v5 5/5] vfio/pci: Implement VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP Abhishek Sahu
  4 siblings, 1 reply; 26+ messages in thread
From: Abhishek Sahu @ 2022-07-19 12:15 UTC (permalink / raw)
  To: Alex Williamson, Cornelia Huck, Yishai Hadas, Jason Gunthorpe,
	Shameer Kolothum, Kevin Tian, Rafael J . Wysocki
  Cc: Max Gurtovoy, Bjorn Helgaas, linux-kernel, kvm, linux-pm,
	linux-pci, Abhishek Sahu

Currently, if the runtime power management is enabled for vfio-pci
based devices in the guest OS, then the guest OS will do the register
write for PCI_PM_CTRL register. This write request will be handled in
vfio_pm_config_write() where it will do the actual register write of
PCI_PM_CTRL register. With this, the maximum D3hot state can be
achieved for low power. If we can use the runtime PM framework, then
we can achieve the D3cold state (on the supported systems) which will
help in saving maximum power.

1. D3cold state can't be achieved by writing PCI standard
   PM config registers. This patch implements the following
   newly added low power related device features:
    - VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY
    - VFIO_DEVICE_FEATURE_LOW_POWER_EXIT

   The VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY will move the device into
   the low power state, and the VFIO_DEVICE_FEATURE_LOW_POWER_EXIT
   will move the device out of the low power state.

2. The vfio-pci driver uses runtime PM framework for low power entry and
   exit. On the platforms where D3cold state is supported, the runtime
   PM framework will put the device into D3cold otherwise, D3hot or some
   other power state will be used. If the user has explicitly disabled
   runtime PM for the device, then the device will be in the power state
   configured by the guest OS through PCI_PM_CTRL.

3. The hypervisors can implement virtual ACPI methods. For example,
   in guest linux OS if PCI device ACPI node has _PR3 and _PR0 power
   resources with _ON/_OFF method, then guest linux OS invokes
   the _OFF method during D3cold transition and then _ON during D0
   transition. The hypervisor can tap these virtual ACPI calls and then
   call the low power device feature IOCTL.

4. The 'pm_runtime_engaged' flag tracks the entry and exit to
   runtime PM. This flag is protected with 'memory_lock' semaphore.

5. All the config and other region access are wrapped under
   pm_runtime_resume_and_get() and pm_runtime_put(). So, if any
   device access happens while the device is in the runtime suspended
   state, then the device will be resumed first before access. Once the
   access has been finished, then the device will again go into the
   runtime suspended state.

6. The memory region access through mmap will not be allowed in the low
   power state. Since __vfio_pci_memory_enabled() is a common function,
   so check for 'pm_runtime_engaged' has been added explicitly in
   vfio_pci_mmap_fault() to block only mmap'ed access.

Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
---
 drivers/vfio/pci/vfio_pci_core.c | 151 +++++++++++++++++++++++++++++--
 include/linux/vfio_pci_core.h    |   1 +
 2 files changed, 144 insertions(+), 8 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 9517645acfa6..726a6f282496 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -259,11 +259,98 @@ int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t stat
 	return ret;
 }
 
+static int vfio_pci_runtime_pm_entry(struct vfio_pci_core_device *vdev)
+{
+	/*
+	 * The vdev power related flags are protected with 'memory_lock'
+	 * semaphore.
+	 */
+	vfio_pci_zap_and_down_write_memory_lock(vdev);
+	if (vdev->pm_runtime_engaged) {
+		up_write(&vdev->memory_lock);
+		return -EINVAL;
+	}
+
+	vdev->pm_runtime_engaged = true;
+	pm_runtime_put_noidle(&vdev->pdev->dev);
+	up_write(&vdev->memory_lock);
+
+	return 0;
+}
+
+static int vfio_pci_core_pm_entry(struct vfio_device *device, u32 flags,
+				  void __user *arg, size_t argsz)
+{
+	struct vfio_pci_core_device *vdev =
+		container_of(device, struct vfio_pci_core_device, vdev);
+	int ret;
+
+	ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_SET, 0);
+	if (ret != 1)
+		return ret;
+
+	/*
+	 * Inside vfio_pci_runtime_pm_entry(), only the runtime PM usage count
+	 * will be decremented. The pm_runtime_put() will be invoked again
+	 * while returning from the ioctl and then the device can go into
+	 * runtime suspended state.
+	 */
+	return vfio_pci_runtime_pm_entry(vdev);
+}
+
+static void vfio_pci_runtime_pm_exit(struct vfio_pci_core_device *vdev)
+{
+	/*
+	 * The vdev power related flags are protected with 'memory_lock'
+	 * semaphore.
+	 */
+	down_write(&vdev->memory_lock);
+	if (vdev->pm_runtime_engaged) {
+		vdev->pm_runtime_engaged = false;
+		pm_runtime_get_noresume(&vdev->pdev->dev);
+	}
+
+	up_write(&vdev->memory_lock);
+}
+
+static int vfio_pci_core_pm_exit(struct vfio_device *device, u32 flags,
+				 void __user *arg, size_t argsz)
+{
+	struct vfio_pci_core_device *vdev =
+		container_of(device, struct vfio_pci_core_device, vdev);
+	int ret;
+
+	ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_SET, 0);
+	if (ret != 1)
+		return ret;
+
+	/*
+	 * The device should already be resumed by the vfio core layer.
+	 * vfio_pci_runtime_pm_exit() will internally increment the usage
+	 * count corresponding to pm_runtime_put() called during low power
+	 * feature entry.
+	 */
+	vfio_pci_runtime_pm_exit(vdev);
+	return 0;
+}
+
 #ifdef CONFIG_PM
 static int vfio_pci_core_runtime_suspend(struct device *dev)
 {
 	struct vfio_pci_core_device *vdev = dev_get_drvdata(dev);
 
+	down_write(&vdev->memory_lock);
+	/*
+	 * The user can move the device into D3hot state before invoking
+	 * power management IOCTL. Move the device into D0 state here and then
+	 * the pci-driver core runtime PM suspend function will move the device
+	 * into the low power state. Also, for the devices which have
+	 * NoSoftRst-, it will help in restoring the original state
+	 * (saved locally in 'vdev->pm_save').
+	 */
+	vfio_pci_set_power_state(vdev, PCI_D0);
+	up_write(&vdev->memory_lock);
+
 	/*
 	 * If INTx is enabled, then mask INTx before going into the runtime
 	 * suspended state and unmask the same in the runtime resume.
@@ -393,6 +480,18 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
 
 	/*
 	 * This function can be invoked while the power state is non-D0.
+	 * This non-D0 power state can be with or without runtime PM.
+	 * vfio_pci_runtime_pm_exit() will internally increment the usage
+	 * count corresponding to pm_runtime_put() called during low power
+	 * feature entry and then pm_runtime_resume() will wake up the device,
+	 * if the device has already gone into the suspended state. Otherwise,
+	 * the vfio_pci_set_power_state() will change the device power state
+	 * to D0.
+	 */
+	vfio_pci_runtime_pm_exit(vdev);
+	pm_runtime_resume(&pdev->dev);
+
+	/*
 	 * This function calls __pci_reset_function_locked() which internally
 	 * can use pci_pm_reset() for the function reset. pci_pm_reset() will
 	 * fail if the power state is non-D0. Also, for the devices which
@@ -1224,6 +1323,10 @@ int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
 	switch (flags & VFIO_DEVICE_FEATURE_MASK) {
 	case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN:
 		return vfio_pci_core_feature_token(device, flags, arg, argsz);
+	case VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY:
+		return vfio_pci_core_pm_entry(device, flags, arg, argsz);
+	case VFIO_DEVICE_FEATURE_LOW_POWER_EXIT:
+		return vfio_pci_core_pm_exit(device, flags, arg, argsz);
 	default:
 		return -ENOTTY;
 	}
@@ -1234,31 +1337,47 @@ static ssize_t vfio_pci_rw(struct vfio_pci_core_device *vdev, char __user *buf,
 			   size_t count, loff_t *ppos, bool iswrite)
 {
 	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+	int ret;
 
 	if (index >= VFIO_PCI_NUM_REGIONS + vdev->num_regions)
 		return -EINVAL;
 
+	ret = pm_runtime_resume_and_get(&vdev->pdev->dev);
+	if (ret < 0) {
+		pci_info_ratelimited(vdev->pdev, "runtime resume failed %d\n",
+				     ret);
+		return -EIO;
+	}
+
 	switch (index) {
 	case VFIO_PCI_CONFIG_REGION_INDEX:
-		return vfio_pci_config_rw(vdev, buf, count, ppos, iswrite);
+		ret = vfio_pci_config_rw(vdev, buf, count, ppos, iswrite);
+		break;
 
 	case VFIO_PCI_ROM_REGION_INDEX:
 		if (iswrite)
-			return -EINVAL;
-		return vfio_pci_bar_rw(vdev, buf, count, ppos, false);
+			ret = -EINVAL;
+		else
+			ret = vfio_pci_bar_rw(vdev, buf, count, ppos, false);
+		break;
 
 	case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
-		return vfio_pci_bar_rw(vdev, buf, count, ppos, iswrite);
+		ret = vfio_pci_bar_rw(vdev, buf, count, ppos, iswrite);
+		break;
 
 	case VFIO_PCI_VGA_REGION_INDEX:
-		return vfio_pci_vga_rw(vdev, buf, count, ppos, iswrite);
+		ret = vfio_pci_vga_rw(vdev, buf, count, ppos, iswrite);
+		break;
+
 	default:
 		index -= VFIO_PCI_NUM_REGIONS;
-		return vdev->region[index].ops->rw(vdev, buf,
+		ret = vdev->region[index].ops->rw(vdev, buf,
 						   count, ppos, iswrite);
+		break;
 	}
 
-	return -EINVAL;
+	pm_runtime_put(&vdev->pdev->dev);
+	return ret;
 }
 
 ssize_t vfio_pci_core_read(struct vfio_device *core_vdev, char __user *buf,
@@ -1453,7 +1572,11 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
 	mutex_lock(&vdev->vma_lock);
 	down_read(&vdev->memory_lock);
 
-	if (!__vfio_pci_memory_enabled(vdev)) {
+	/*
+	 * Memory region cannot be accessed if the low power feature is engaged
+	 * or memory access is disabled.
+	 */
+	if (vdev->pm_runtime_engaged || !__vfio_pci_memory_enabled(vdev)) {
 		ret = VM_FAULT_SIGBUS;
 		goto up_out;
 	}
@@ -2164,6 +2287,15 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
 		goto err_unlock;
 	}
 
+	/*
+	 * Some of the devices in the dev_set can be in the runtime suspended
+	 * state. Increment the usage count for all the devices in the dev_set
+	 * before reset and decrement the same after reset.
+	 */
+	ret = vfio_pci_dev_set_pm_runtime_get(dev_set);
+	if (ret)
+		goto err_unlock;
+
 	list_for_each_entry(cur_vma, &dev_set->device_list, vdev.dev_set_list) {
 		/*
 		 * Test whether all the affected devices are contained by the
@@ -2219,6 +2351,9 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
 		else
 			mutex_unlock(&cur->vma_lock);
 	}
+
+	list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list)
+		pm_runtime_put(&cur->pdev->dev);
 err_unlock:
 	mutex_unlock(&dev_set->lock);
 	return ret;
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index e96cc3081236..7ec81271bd05 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -125,6 +125,7 @@ struct vfio_pci_core_device {
 	bool			nointx;
 	bool			needs_pm_restore;
 	bool			pm_intx_masked;
+	bool			pm_runtime_engaged;
 	struct pci_saved_state	*pci_saved_state;
 	struct pci_saved_state	*pm_save;
 	int			ioeventfds_nr;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v5 5/5] vfio/pci: Implement VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP
  2022-07-19 12:15 [PATCH v5 0/5] vfio/pci: power management changes Abhishek Sahu
                   ` (3 preceding siblings ...)
  2022-07-19 12:15 ` [PATCH v5 4/5] vfio/pci: Implement VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY/EXIT Abhishek Sahu
@ 2022-07-19 12:15 ` Abhishek Sahu
  2022-07-21 22:34   ` Alex Williamson
  4 siblings, 1 reply; 26+ messages in thread
From: Abhishek Sahu @ 2022-07-19 12:15 UTC (permalink / raw)
  To: Alex Williamson, Cornelia Huck, Yishai Hadas, Jason Gunthorpe,
	Shameer Kolothum, Kevin Tian, Rafael J . Wysocki
  Cc: Max Gurtovoy, Bjorn Helgaas, linux-kernel, kvm, linux-pm,
	linux-pci, Abhishek Sahu

This patch implements VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP
device feature. In the VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY, if there is
any access for the VFIO device on the host side, then the device will
be moved out of the low power state without the user's guest driver
involvement. Once the device access has been finished, then the device
will be moved again into low power state. With the low power
entry happened through VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP,
the device will not be moved back into the low power state and
a notification will be sent to the user by triggering wakeup eventfd.

vfio_pci_core_pm_entry() will be called for both the variants of low
power feature entry so add an extra argument for wakeup eventfd context
and store locally in 'struct vfio_pci_core_device'.

For the entry happened without wakeup eventfd, all the exit related
handling will be done by the LOW_POWER_EXIT device feature only.
When the LOW_POWER_EXIT will be called, then the vfio core layer
vfio_device_pm_runtime_get() will increment the usage count and will
resume the device. In the driver runtime_resume callback,
the 'pm_wake_eventfd_ctx' will be NULL so the vfio_pci_runtime_pm_exit()
will return early. Then vfio_pci_core_pm_exit() will again call
vfio_pci_runtime_pm_exit() and now the exit related handling will be done.

For the entry happened with wakeup eventfd, in the driver resume
callback, eventfd will be triggered and all the exit related handling will
be done. When vfio_pci_runtime_pm_exit() will be called by
vfio_pci_core_pm_exit(), then it will return early. But if the user has
disabled the runtime PM on the host side, the device will never go
runtime suspended state and in this case, all the exit related handling
will be done during vfio_pci_core_pm_exit() only. Also, the eventfd will
not be triggered since the device power state has not been changed by the
host driver.

For vfio_pci_core_disable() also, all the exit related handling
needs to be done if user has closed the device after putting into
low power. In this case eventfd will not be triggered since
the device close has been initiated by the user only.

Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
---
 drivers/vfio/pci/vfio_pci_core.c | 78 ++++++++++++++++++++++++++++++--
 include/linux/vfio_pci_core.h    |  1 +
 2 files changed, 74 insertions(+), 5 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 726a6f282496..dbe942bcaa67 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -259,7 +259,8 @@ int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t stat
 	return ret;
 }
 
-static int vfio_pci_runtime_pm_entry(struct vfio_pci_core_device *vdev)
+static int vfio_pci_runtime_pm_entry(struct vfio_pci_core_device *vdev,
+				     struct eventfd_ctx *efdctx)
 {
 	/*
 	 * The vdev power related flags are protected with 'memory_lock'
@@ -272,6 +273,7 @@ static int vfio_pci_runtime_pm_entry(struct vfio_pci_core_device *vdev)
 	}
 
 	vdev->pm_runtime_engaged = true;
+	vdev->pm_wake_eventfd_ctx = efdctx;
 	pm_runtime_put_noidle(&vdev->pdev->dev);
 	up_write(&vdev->memory_lock);
 
@@ -295,21 +297,67 @@ static int vfio_pci_core_pm_entry(struct vfio_device *device, u32 flags,
 	 * while returning from the ioctl and then the device can go into
 	 * runtime suspended state.
 	 */
-	return vfio_pci_runtime_pm_entry(vdev);
+	return vfio_pci_runtime_pm_entry(vdev, NULL);
 }
 
-static void vfio_pci_runtime_pm_exit(struct vfio_pci_core_device *vdev)
+static int
+vfio_pci_core_pm_entry_with_wakeup(struct vfio_device *device, u32 flags,
+				   void __user *arg, size_t argsz)
+{
+	struct vfio_pci_core_device *vdev =
+		container_of(device, struct vfio_pci_core_device, vdev);
+	struct vfio_device_low_power_entry_with_wakeup entry;
+	struct eventfd_ctx *efdctx;
+	int ret;
+
+	ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_SET,
+				 sizeof(entry));
+	if (ret != 1)
+		return ret;
+
+	if (copy_from_user(&entry, arg, sizeof(entry)))
+		return -EFAULT;
+
+	if (entry.wakeup_eventfd < 0)
+		return -EINVAL;
+
+	efdctx = eventfd_ctx_fdget(entry.wakeup_eventfd);
+	if (IS_ERR(efdctx))
+		return PTR_ERR(efdctx);
+
+	ret = vfio_pci_runtime_pm_entry(vdev, efdctx);
+	if (ret)
+		eventfd_ctx_put(efdctx);
+
+	return ret;
+}
+
+static void vfio_pci_runtime_pm_exit(struct vfio_pci_core_device *vdev,
+				     bool resume_callback)
 {
 	/*
 	 * The vdev power related flags are protected with 'memory_lock'
 	 * semaphore.
 	 */
 	down_write(&vdev->memory_lock);
+	if (resume_callback && !vdev->pm_wake_eventfd_ctx) {
+		up_write(&vdev->memory_lock);
+		return;
+	}
+
 	if (vdev->pm_runtime_engaged) {
 		vdev->pm_runtime_engaged = false;
 		pm_runtime_get_noresume(&vdev->pdev->dev);
 	}
 
+	if (vdev->pm_wake_eventfd_ctx) {
+		if (resume_callback)
+			eventfd_signal(vdev->pm_wake_eventfd_ctx, 1);
+
+		eventfd_ctx_put(vdev->pm_wake_eventfd_ctx);
+		vdev->pm_wake_eventfd_ctx = NULL;
+	}
+
 	up_write(&vdev->memory_lock);
 }
 
@@ -329,8 +377,18 @@ static int vfio_pci_core_pm_exit(struct vfio_device *device, u32 flags,
 	 * vfio_pci_runtime_pm_exit() will internally increment the usage
 	 * count corresponding to pm_runtime_put() called during low power
 	 * feature entry.
+	 *
+	 * For the low power entry happened with wakeup eventfd, there will
+	 * be two cases:
+	 *
+	 * 1. The device has gone into runtime suspended state. In this case,
+	 *    the runtime resume by the vfio core layer should already have
+	 *    performed all exit related handling and the
+	 *    vfio_pci_runtime_pm_exit() will return early.
+	 * 2. The device was in runtime active state. In this case, the
+	 *    vfio_pci_runtime_pm_exit() will do all the required handling.
 	 */
-	vfio_pci_runtime_pm_exit(vdev);
+	vfio_pci_runtime_pm_exit(vdev, false);
 	return 0;
 }
 
@@ -370,6 +428,13 @@ static int vfio_pci_core_runtime_resume(struct device *dev)
 	if (vdev->pm_intx_masked)
 		vfio_pci_intx_unmask(vdev);
 
+	/*
+	 * Only for the low power entry happened with wakeup eventfd,
+	 * the vfio_pci_runtime_pm_exit() will perform exit related handling
+	 * and will trigger eventfd. For the other cases, it will return early.
+	 */
+	vfio_pci_runtime_pm_exit(vdev, true);
+
 	return 0;
 }
 #endif /* CONFIG_PM */
@@ -488,7 +553,7 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
 	 * the vfio_pci_set_power_state() will change the device power state
 	 * to D0.
 	 */
-	vfio_pci_runtime_pm_exit(vdev);
+	vfio_pci_runtime_pm_exit(vdev, false);
 	pm_runtime_resume(&pdev->dev);
 
 	/*
@@ -1325,6 +1390,9 @@ int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
 		return vfio_pci_core_feature_token(device, flags, arg, argsz);
 	case VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY:
 		return vfio_pci_core_pm_entry(device, flags, arg, argsz);
+	case VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP:
+		return vfio_pci_core_pm_entry_with_wakeup(device, flags,
+							  arg, argsz);
 	case VFIO_DEVICE_FEATURE_LOW_POWER_EXIT:
 		return vfio_pci_core_pm_exit(device, flags, arg, argsz);
 	default:
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index 7ec81271bd05..fb25214e85c8 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -131,6 +131,7 @@ struct vfio_pci_core_device {
 	int			ioeventfds_nr;
 	struct eventfd_ctx	*err_trigger;
 	struct eventfd_ctx	*req_trigger;
+	struct eventfd_ctx	*pm_wake_eventfd_ctx;
 	struct list_head	dummy_resources_list;
 	struct mutex		ioeventfds_lock;
 	struct list_head	ioeventfds_list;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 5/5] vfio/pci: Implement VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP
  2022-07-19 12:15 ` [PATCH v5 5/5] vfio/pci: Implement VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP Abhishek Sahu
@ 2022-07-21 22:34   ` Alex Williamson
  2022-07-25 15:04     ` Abhishek Sahu
  0 siblings, 1 reply; 26+ messages in thread
From: Alex Williamson @ 2022-07-21 22:34 UTC (permalink / raw)
  To: Abhishek Sahu
  Cc: Cornelia Huck, Yishai Hadas, Jason Gunthorpe, Shameer Kolothum,
	Kevin Tian, Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas,
	linux-kernel, kvm, linux-pm, linux-pci

On Tue, 19 Jul 2022 17:45:23 +0530
Abhishek Sahu <abhsahu@nvidia.com> wrote:

> This patch implements VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP
> device feature. In the VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY, if there is
> any access for the VFIO device on the host side, then the device will
> be moved out of the low power state without the user's guest driver
> involvement. Once the device access has been finished, then the device
> will be moved again into low power state. With the low power
> entry happened through VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP,
> the device will not be moved back into the low power state and
> a notification will be sent to the user by triggering wakeup eventfd.
> 
> vfio_pci_core_pm_entry() will be called for both the variants of low
> power feature entry so add an extra argument for wakeup eventfd context
> and store locally in 'struct vfio_pci_core_device'.
> 
> For the entry happened without wakeup eventfd, all the exit related
> handling will be done by the LOW_POWER_EXIT device feature only.
> When the LOW_POWER_EXIT will be called, then the vfio core layer
> vfio_device_pm_runtime_get() will increment the usage count and will
> resume the device. In the driver runtime_resume callback,
> the 'pm_wake_eventfd_ctx' will be NULL so the vfio_pci_runtime_pm_exit()
> will return early. Then vfio_pci_core_pm_exit() will again call
> vfio_pci_runtime_pm_exit() and now the exit related handling will be done.
> 
> For the entry happened with wakeup eventfd, in the driver resume
> callback, eventfd will be triggered and all the exit related handling will
> be done. When vfio_pci_runtime_pm_exit() will be called by
> vfio_pci_core_pm_exit(), then it will return early. But if the user has
> disabled the runtime PM on the host side, the device will never go
> runtime suspended state and in this case, all the exit related handling
> will be done during vfio_pci_core_pm_exit() only. Also, the eventfd will
> not be triggered since the device power state has not been changed by the
> host driver.
> 
> For vfio_pci_core_disable() also, all the exit related handling
> needs to be done if user has closed the device after putting into
> low power. In this case eventfd will not be triggered since
> the device close has been initiated by the user only.
> 
> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
> ---
>  drivers/vfio/pci/vfio_pci_core.c | 78 ++++++++++++++++++++++++++++++--
>  include/linux/vfio_pci_core.h    |  1 +
>  2 files changed, 74 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index 726a6f282496..dbe942bcaa67 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -259,7 +259,8 @@ int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t stat
>  	return ret;
>  }
>  
> -static int vfio_pci_runtime_pm_entry(struct vfio_pci_core_device *vdev)
> +static int vfio_pci_runtime_pm_entry(struct vfio_pci_core_device *vdev,
> +				     struct eventfd_ctx *efdctx)
>  {
>  	/*
>  	 * The vdev power related flags are protected with 'memory_lock'
> @@ -272,6 +273,7 @@ static int vfio_pci_runtime_pm_entry(struct vfio_pci_core_device *vdev)
>  	}
>  
>  	vdev->pm_runtime_engaged = true;
> +	vdev->pm_wake_eventfd_ctx = efdctx;
>  	pm_runtime_put_noidle(&vdev->pdev->dev);
>  	up_write(&vdev->memory_lock);
>  
> @@ -295,21 +297,67 @@ static int vfio_pci_core_pm_entry(struct vfio_device *device, u32 flags,
>  	 * while returning from the ioctl and then the device can go into
>  	 * runtime suspended state.
>  	 */
> -	return vfio_pci_runtime_pm_entry(vdev);
> +	return vfio_pci_runtime_pm_entry(vdev, NULL);
>  }
>  
> -static void vfio_pci_runtime_pm_exit(struct vfio_pci_core_device *vdev)
> +static int
> +vfio_pci_core_pm_entry_with_wakeup(struct vfio_device *device, u32 flags,
> +				   void __user *arg, size_t argsz)
> +{
> +	struct vfio_pci_core_device *vdev =
> +		container_of(device, struct vfio_pci_core_device, vdev);
> +	struct vfio_device_low_power_entry_with_wakeup entry;
> +	struct eventfd_ctx *efdctx;
> +	int ret;
> +
> +	ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_SET,
> +				 sizeof(entry));
> +	if (ret != 1)
> +		return ret;
> +
> +	if (copy_from_user(&entry, arg, sizeof(entry)))
> +		return -EFAULT;
> +
> +	if (entry.wakeup_eventfd < 0)
> +		return -EINVAL;
> +
> +	efdctx = eventfd_ctx_fdget(entry.wakeup_eventfd);
> +	if (IS_ERR(efdctx))
> +		return PTR_ERR(efdctx);
> +
> +	ret = vfio_pci_runtime_pm_entry(vdev, efdctx);
> +	if (ret)
> +		eventfd_ctx_put(efdctx);
> +
> +	return ret;
> +}
> +
> +static void vfio_pci_runtime_pm_exit(struct vfio_pci_core_device *vdev,
> +				     bool resume_callback)
>  {
>  	/*
>  	 * The vdev power related flags are protected with 'memory_lock'
>  	 * semaphore.
>  	 */
>  	down_write(&vdev->memory_lock);
> +	if (resume_callback && !vdev->pm_wake_eventfd_ctx) {
> +		up_write(&vdev->memory_lock);
> +		return;
> +	}
> +
>  	if (vdev->pm_runtime_engaged) {
>  		vdev->pm_runtime_engaged = false;
>  		pm_runtime_get_noresume(&vdev->pdev->dev);
>  	}
>  
> +	if (vdev->pm_wake_eventfd_ctx) {
> +		if (resume_callback)
> +			eventfd_signal(vdev->pm_wake_eventfd_ctx, 1);
> +
> +		eventfd_ctx_put(vdev->pm_wake_eventfd_ctx);
> +		vdev->pm_wake_eventfd_ctx = NULL;
> +	}
> +
>  	up_write(&vdev->memory_lock);
>  }
>  

I find the pm_exit handling here confusing.  We only have one caller
that can signal the eventfd, so it seems cleaner to me to have that
caller do the eventfd signal.  We can then remove the arg to pm_exit
and pull the core of it out to a pre-locked function for that call
path.  Sometime like below (applies on top of this patch).  Also moved
the intx unmasking until after the eventfd signaling.  What do you
think?  Thanks,

Alex

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index dbe942bcaa67..93169b7d6da2 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -332,32 +332,27 @@ vfio_pci_core_pm_entry_with_wakeup(struct vfio_device *device, u32 flags,
 	return ret;
 }
 
-static void vfio_pci_runtime_pm_exit(struct vfio_pci_core_device *vdev,
-				     bool resume_callback)
+static void __vfio_pci_runtime_pm_exit(struct vfio_pci_core_device *vdev)
 {
-	/*
-	 * The vdev power related flags are protected with 'memory_lock'
-	 * semaphore.
-	 */
-	down_write(&vdev->memory_lock);
-	if (resume_callback && !vdev->pm_wake_eventfd_ctx) {
-		up_write(&vdev->memory_lock);
-		return;
-	}
-
 	if (vdev->pm_runtime_engaged) {
 		vdev->pm_runtime_engaged = false;
 		pm_runtime_get_noresume(&vdev->pdev->dev);
-	}
-
-	if (vdev->pm_wake_eventfd_ctx) {
-		if (resume_callback)
-			eventfd_signal(vdev->pm_wake_eventfd_ctx, 1);
 
-		eventfd_ctx_put(vdev->pm_wake_eventfd_ctx);
-		vdev->pm_wake_eventfd_ctx = NULL;
+		if (vdev->pm_wake_eventfd_ctx) {
+			eventfd_ctx_put(vdev->pm_wake_eventfd_ctx);
+			vdev->pm_wake_eventfd_ctx = NULL;
+		}
 	}
+}
 
+static void vfio_pci_runtime_pm_exit(struct vfio_pci_core_device *vdev)
+{
+	/*
+	 * The vdev power related flags are protected with 'memory_lock'
+	 * semaphore.
+	 */
+	down_write(&vdev->memory_lock);
+	__vfio_pci_runtime_pm_exit(vdev);
 	up_write(&vdev->memory_lock);
 }
 
@@ -373,22 +368,13 @@ static int vfio_pci_core_pm_exit(struct vfio_device *device, u32 flags,
 		return ret;
 
 	/*
-	 * The device should already be resumed by the vfio core layer.
-	 * vfio_pci_runtime_pm_exit() will internally increment the usage
-	 * count corresponding to pm_runtime_put() called during low power
-	 * feature entry.
-	 *
-	 * For the low power entry happened with wakeup eventfd, there will
-	 * be two cases:
-	 *
-	 * 1. The device has gone into runtime suspended state. In this case,
-	 *    the runtime resume by the vfio core layer should already have
-	 *    performed all exit related handling and the
-	 *    vfio_pci_runtime_pm_exit() will return early.
-	 * 2. The device was in runtime active state. In this case, the
-	 *    vfio_pci_runtime_pm_exit() will do all the required handling.
+	 * The device is always in the active state here due to pm wrappers
+	 * around ioctls.  If the device had entered a low power state and
+	 * pm_wake_eventfd_ctx is valid, vfio_pci_core_runtime_resume() has 
+	 * already signaled the eventfd and exited low power mode itself.
+	 * pm_runtime_engaged protects the redundant call here.
 	 */
-	vfio_pci_runtime_pm_exit(vdev, false);
+	vfio_pci_runtime_pm_exit(vdev);
 	return 0;
 }
 
@@ -425,15 +411,19 @@ static int vfio_pci_core_runtime_resume(struct device *dev)
 {
 	struct vfio_pci_core_device *vdev = dev_get_drvdata(dev);
 
-	if (vdev->pm_intx_masked)
-		vfio_pci_intx_unmask(vdev);
-
 	/*
-	 * Only for the low power entry happened with wakeup eventfd,
-	 * the vfio_pci_runtime_pm_exit() will perform exit related handling
-	 * and will trigger eventfd. For the other cases, it will return early.
+	 * Resume with a pm_wake_eventfd_ctx signals the eventfd and exits
+	 * low power mode.
 	 */
-	vfio_pci_runtime_pm_exit(vdev, true);
+	down_write(&vdev->memory_lock);
+	if (vdev->pm_wake_eventfd_ctx) {
+		eventfd_signal(vdev->pm_wake_eventfd_ctx, 1);
+		__vfio_pci_runtime_pm_exit(vdev);
+	}
+	up_write(&vdev->memory_lock);
+
+	if (vdev->pm_intx_masked)
+		vfio_pci_intx_unmask(vdev);
 
 	return 0;
 }
@@ -553,7 +543,7 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
 	 * the vfio_pci_set_power_state() will change the device power state
 	 * to D0.
 	 */
-	vfio_pci_runtime_pm_exit(vdev, false);
+	vfio_pci_runtime_pm_exit(vdev);
 	pm_runtime_resume(&pdev->dev);
 
 	/*


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 1/5] vfio: Add the device features for the low power entry and exit
  2022-07-19 12:15 ` [PATCH v5 1/5] vfio: Add the device features for the low power entry and exit Abhishek Sahu
@ 2022-07-21 22:34   ` Alex Williamson
  2022-07-25 14:40     ` Abhishek Sahu
  0 siblings, 1 reply; 26+ messages in thread
From: Alex Williamson @ 2022-07-21 22:34 UTC (permalink / raw)
  To: Abhishek Sahu
  Cc: Cornelia Huck, Yishai Hadas, Jason Gunthorpe, Shameer Kolothum,
	Kevin Tian, Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas,
	linux-kernel, kvm, linux-pm, linux-pci

On Tue, 19 Jul 2022 17:45:19 +0530
Abhishek Sahu <abhsahu@nvidia.com> wrote:

> This patch adds the following new device features for the low
> power entry and exit in the header file. The implementation for the
> same will be added in the subsequent patches.
> 
> - VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY
> - VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP
> - VFIO_DEVICE_FEATURE_LOW_POWER_EXIT
> 
> With the standard registers, all power states cannot be achieved. The

We're talking about standard PCI PM registers here, let's make that
clear since we're adding a device agnostic interface here.

> platform-based power management needs to be involved to go into the
> lowest power state. For doing low power entry and exit with
> platform-based power management, these device features can be used.
> 
> The entry device feature has two variants. These two variants are mainly
> to support the different behaviour for the low power entry.
> If there is any access for the VFIO device on the host side, then the
> device will be moved out of the low power state without the user's
> guest driver involvement. Some devices (for example NVIDIA VGA or
> 3D controller) require the user's guest driver involvement for
> each low-power entry. In the first variant, the host can move the
> device into low power without any guest driver involvement while

Perhaps, "In the first variant, the host can return the device to low
power automatically.  The device will continue to attempt to reach low
power until the low power exit feature is called."

> in the second variant, the host will send a notification to the user
> through eventfd and then the users guest driver needs to move
> the device into low power.

"In the second variant, if the device exits low power due to an access,
the host kernel will signal the user via the provided eventfd and will
not return the device to low power without a subsequent call to one of
the low power entry features.  A call to the low power exit feature is
optional if the user provided eventfd is signaled."
 
> These device features only support VFIO_DEVICE_FEATURE_SET operation.

And PROBE.

> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
> ---
>  include/uapi/linux/vfio.h | 55 +++++++++++++++++++++++++++++++++++++++
>  1 file changed, 55 insertions(+)
> 
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 733a1cddde30..08fd3482d22b 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -986,6 +986,61 @@ enum vfio_device_mig_state {
>  	VFIO_DEVICE_STATE_RUNNING_P2P = 5,
>  };
>  
> +/*
> + * Upon VFIO_DEVICE_FEATURE_SET, move the VFIO device into the low power state
> + * with the platform-based power management.  This low power state will be

This is really "allow the device to be moved into a low power state"
rather than actually "move the device into" such a state though, right?

> + * internal to the VFIO driver and the user will not come to know which power
> + * state is chosen.  If any device access happens (either from the host or
> + * the guest) when the device is in the low power state, then the host will
> + * move the device out of the low power state first.  Once the access has been
> + * finished, then the host will move the device into the low power state again.
> + * If the user wants that the device should not go into the low power state
> + * again in this case, then the user should use the
> + * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP device feature for the

This should probably just read "For single shot low power support with
wake-up notification, see
VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP below."

> + * low power entry.  The mmap'ed region access is not allowed till the low power
> + * exit happens through VFIO_DEVICE_FEATURE_LOW_POWER_EXIT and will
> + * generate the access fault.
> + */
> +#define VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY 3

Note that Yishai's DMA logging set is competing for the same feature
entries.  We'll need to coordinate.

> +
> +/*
> + * Upon VFIO_DEVICE_FEATURE_SET, move the VFIO device into the low power state
> + * with the platform-based power management and provide support for the wake-up
> + * notifications through eventfd.  This low power state will be internal to the
> + * VFIO driver and the user will not come to know which power state is chosen.
> + * If any device access happens (either from the host or the guest) when the
> + * device is in the low power state, then the host will move the device out of
> + * the low power state first and a notification will be sent to the guest
> + * through eventfd.  Once the access is finished, the host will not move back
> + * the device into the low power state.  The guest should move the device into
> + * the low power state again upon receiving the wakeup notification.  The
> + * notification will be generated only if the device physically went into the
> + * low power state.  If the low power entry has been disabled from the host
> + * side, then the device will not go into the low power state even after
> + * calling this device feature and then the device access does not require
> + * wake-up.  The mmap'ed region access is not allowed till the low power exit
> + * happens.  The low power exit can happen either through
> + * VFIO_DEVICE_FEATURE_LOW_POWER_EXIT or through any other access (where the
> + * wake-up notification has been generated).

Seems this could leverage a lot more from the previous, simply stating
that this has the same behavior as VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY
with the exception that the user provides an eventfd for notification
when the device resumes from low power and will not try to re-enter a
low power state without a subsequent user call to one of the low power
entry feature ioctls.  It might also be worth covering the fact that
device accesses by the user, including region and ioctl access, will
also trigger the eventfd if that access triggers a resume.

As I'm thinking about this, that latter clause is somewhat subtle.
AIUI a user can call the low power entry with wakeup feature and
proceed to do various ioctl and region (not mmap) accesses that could
perpetually keep the device awake, or there may be dependent devices
such that the device may never go to low power.  It needs to be very
clear that only if the wakeup eventfd has the device entered into and
exited a low power state making the low power exit ioctl optional.

> + */
> +struct vfio_device_low_power_entry_with_wakeup {
> +	__s32 wakeup_eventfd;
> +	__u32 reserved;
> +};
> +
> +#define VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP 4
> +
> +/*
> + * Upon VFIO_DEVICE_FEATURE_SET, move the VFIO device out of the low power
> + * state.

Any ioctl effectively does that, the key here is that the low power
state may not be re-entered after this ioctl.

>  This device feature should be called only if the user has previously
> + * put the device into the low power state either with
> + * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY or
> + * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP device feature.  If the

Doesn't really seem worth mentioning, we need to protect against misuse
regardless.

> + * device is not in the low power state currently, this device feature will
> + * return early with the success status.

This is an implementation detail, it doesn't need to be part of the
uAPI.  Thanks,

Alex

> + */
> +#define VFIO_DEVICE_FEATURE_LOW_POWER_EXIT 5
> +
>  /* -------- API for Type1 VFIO IOMMU -------- */
>  
>  /**


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 2/5] vfio: Increment the runtime PM usage count during IOCTL call
  2022-07-19 12:15 ` [PATCH v5 2/5] vfio: Increment the runtime PM usage count during IOCTL call Abhishek Sahu
@ 2022-07-21 22:34   ` Alex Williamson
  0 siblings, 0 replies; 26+ messages in thread
From: Alex Williamson @ 2022-07-21 22:34 UTC (permalink / raw)
  To: Abhishek Sahu
  Cc: Cornelia Huck, Yishai Hadas, Jason Gunthorpe, Shameer Kolothum,
	Kevin Tian, Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas,
	linux-kernel, kvm, linux-pm, linux-pci

On Tue, 19 Jul 2022 17:45:20 +0530
Abhishek Sahu <abhsahu@nvidia.com> wrote:

> The vfio-pci based drivers will have runtime power management
> support where the user can put the device into the low power state
> and then PCI devices can go into the D3cold state. If the device is
> in the low power state and the user issues any IOCTL, then the
> device should be moved out of the low power state first. Once
> the IOCTL is serviced, then it can go into the low power state again.
> The runtime PM framework manages this with help of usage count.
> 
> One option was to add the runtime PM related API's inside vfio-pci
> driver but some IOCTL (like VFIO_DEVICE_FEATURE) can follow a
> different path and more IOCTL can be added in the future. Also, the
> runtime PM will be added for vfio-pci based drivers variant currently,
> but the other VFIO based drivers can use the same in the
> future. So, this patch adds the runtime calls runtime-related API in
> the top-level IOCTL function itself.
> 
> For the VFIO drivers which do not have runtime power management
> support currently, the runtime PM API's won't be invoked. Only for
> vfio-pci based drivers currently, the runtime PM API's will be invoked
> to increment and decrement the usage count. In the vfio-pci drivers also,
> the variant drivers can opt-out by incrementing the usage count during
> device-open. The pm_runtime_resume_and_get() checks the device
> current status and will return early if the device is already in the
> ACTIVE state.
> 
> Taking this usage count incremented while servicing IOCTL will make
> sure that the user won't put the device into the low power state when any
> other IOCTL is being serviced in parallel. Let's consider the
> following scenario:
> 
>  1. Some other IOCTL is called.
>  2. The user has opened another device instance and called the IOCTL for
>     low power entry.
>  3. The low power entry IOCTL moves the device into the low power state.
>  4. The other IOCTL finishes.
> 
> If we don't keep the usage count incremented then the device
> access will happen between step 3 and 4 while the device has already
> gone into the low power state.
> 
> The pm_runtime_resume_and_get() will be the first call so its error
> should not be propagated to user space directly. For example, if
> pm_runtime_resume_and_get() can return -EINVAL for the cases where the
> user has passed the correct argument. So the
> pm_runtime_resume_and_get() errors have been masked behind -EIO.
> 
> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
> ---
>  drivers/vfio/vfio.c | 52 ++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 49 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index bd84ca7c5e35..1d005a0a9d3d 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -32,6 +32,7 @@
>  #include <linux/vfio.h>
>  #include <linux/wait.h>
>  #include <linux/sched/signal.h>
> +#include <linux/pm_runtime.h>
>  #include "vfio.h"
>  
>  #define DRIVER_VERSION	"0.3"
> @@ -1335,6 +1336,39 @@ static const struct file_operations vfio_group_fops = {
>  	.release	= vfio_group_fops_release,
>  };
>  
> +/*
> + * Wrapper around pm_runtime_resume_and_get().
> + * Return error code on failure or 0 on success.
> + */
> +static inline int vfio_device_pm_runtime_get(struct vfio_device *device)
> +{
> +	struct device *dev = device->dev;
> +
> +	if (dev->driver && dev->driver->pm) {
> +		int ret;
> +
> +		ret = pm_runtime_resume_and_get(dev);
> +		if (ret < 0) {

Nit, pm_runtime_resume_and_get() cannot return a positive value, it's
either zero or -errno, so we could just test (ret).  Thanks,

Alex

> +			dev_info_ratelimited(dev,
> +				"vfio: runtime resume failed %d\n", ret);
> +			return -EIO;
> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +/*
> + * Wrapper around pm_runtime_put().
> + */
> +static inline void vfio_device_pm_runtime_put(struct vfio_device *device)
> +{
> +	struct device *dev = device->dev;
> +
> +	if (dev->driver && dev->driver->pm)
> +		pm_runtime_put(dev);
> +}
> +
>  /*
>   * VFIO Device fd
>   */
> @@ -1649,15 +1683,27 @@ static long vfio_device_fops_unl_ioctl(struct file *filep,
>  				       unsigned int cmd, unsigned long arg)
>  {
>  	struct vfio_device *device = filep->private_data;
> +	int ret;
> +
> +	ret = vfio_device_pm_runtime_get(device);
> +	if (ret)
> +		return ret;
>  
>  	switch (cmd) {
>  	case VFIO_DEVICE_FEATURE:
> -		return vfio_ioctl_device_feature(device, (void __user *)arg);
> +		ret = vfio_ioctl_device_feature(device, (void __user *)arg);
> +		break;
> +
>  	default:
>  		if (unlikely(!device->ops->ioctl))
> -			return -EINVAL;
> -		return device->ops->ioctl(device, cmd, arg);
> +			ret = -EINVAL;
> +		else
> +			ret = device->ops->ioctl(device, cmd, arg);
> +		break;
>  	}
> +
> +	vfio_device_pm_runtime_put(device);
> +	return ret;
>  }
>  
>  static ssize_t vfio_device_fops_read(struct file *filep, char __user *buf,


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 4/5] vfio/pci: Implement VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY/EXIT
  2022-07-19 12:15 ` [PATCH v5 4/5] vfio/pci: Implement VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY/EXIT Abhishek Sahu
@ 2022-07-21 22:34   ` Alex Williamson
  2022-07-25 14:48     ` Abhishek Sahu
  0 siblings, 1 reply; 26+ messages in thread
From: Alex Williamson @ 2022-07-21 22:34 UTC (permalink / raw)
  To: Abhishek Sahu
  Cc: Cornelia Huck, Yishai Hadas, Jason Gunthorpe, Shameer Kolothum,
	Kevin Tian, Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas,
	linux-kernel, kvm, linux-pm, linux-pci

On Tue, 19 Jul 2022 17:45:22 +0530
Abhishek Sahu <abhsahu@nvidia.com> wrote:

> Currently, if the runtime power management is enabled for vfio-pci
> based devices in the guest OS, then the guest OS will do the register
> write for PCI_PM_CTRL register. This write request will be handled in
> vfio_pm_config_write() where it will do the actual register write of
> PCI_PM_CTRL register. With this, the maximum D3hot state can be
> achieved for low power. If we can use the runtime PM framework, then
> we can achieve the D3cold state (on the supported systems) which will
> help in saving maximum power.
> 
> 1. D3cold state can't be achieved by writing PCI standard
>    PM config registers. This patch implements the following
>    newly added low power related device features:
>     - VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY
>     - VFIO_DEVICE_FEATURE_LOW_POWER_EXIT
> 
>    The VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY will move the device into
>    the low power state, and the VFIO_DEVICE_FEATURE_LOW_POWER_EXIT
>    will move the device out of the low power state.

Isn't this really:

	The VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY feature will allow the
	device to make use of low power platform states on the host
	while the VFIO_DEVICE_FEATURE_LOW_POWER_EXIT will prevent
	further use of those power states.

ie. we can't make the device move to low power and every ioctl/access
will make it exit, it's more about allowing/preventing use of those
platform provided low power states.

> 
> 2. The vfio-pci driver uses runtime PM framework for low power entry and
>    exit. On the platforms where D3cold state is supported, the runtime
>    PM framework will put the device into D3cold otherwise, D3hot or some
>    other power state will be used. If the user has explicitly disabled
>    runtime PM for the device, then the device will be in the power state
>    configured by the guest OS through PCI_PM_CTRL.

This is talking about disabling runtime PM support for a device on the
host precluding this interface from allowing the device to enter
platform defined low power states, right?
 
> 3. The hypervisors can implement virtual ACPI methods. For example,
>    in guest linux OS if PCI device ACPI node has _PR3 and _PR0 power
>    resources with _ON/_OFF method, then guest linux OS invokes
>    the _OFF method during D3cold transition and then _ON during D0
>    transition. The hypervisor can tap these virtual ACPI calls and then
>    call the low power device feature IOCTL.
> 
> 4. The 'pm_runtime_engaged' flag tracks the entry and exit to
>    runtime PM. This flag is protected with 'memory_lock' semaphore.
> 
> 5. All the config and other region access are wrapped under
>    pm_runtime_resume_and_get() and pm_runtime_put(). So, if any
>    device access happens while the device is in the runtime suspended
>    state, then the device will be resumed first before access. Once the
>    access has been finished, then the device will again go into the
>    runtime suspended state.
> 
> 6. The memory region access through mmap will not be allowed in the low
>    power state. Since __vfio_pci_memory_enabled() is a common function,
>    so check for 'pm_runtime_engaged' has been added explicitly in
>    vfio_pci_mmap_fault() to block only mmap'ed access.
> 
> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
> ---
>  drivers/vfio/pci/vfio_pci_core.c | 151 +++++++++++++++++++++++++++++--
>  include/linux/vfio_pci_core.h    |   1 +
>  2 files changed, 144 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index 9517645acfa6..726a6f282496 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -259,11 +259,98 @@ int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t stat
>  	return ret;
>  }
>  
> +static int vfio_pci_runtime_pm_entry(struct vfio_pci_core_device *vdev)
> +{
> +	/*
> +	 * The vdev power related flags are protected with 'memory_lock'
> +	 * semaphore.
> +	 */
> +	vfio_pci_zap_and_down_write_memory_lock(vdev);
> +	if (vdev->pm_runtime_engaged) {
> +		up_write(&vdev->memory_lock);
> +		return -EINVAL;
> +	}

Awkward that we zap memory for the error path here, but optimizing
performance for a user that can't remember they've already activated
low power for a device doesn't seem like a priority ;)

> +
> +	vdev->pm_runtime_engaged = true;
> +	pm_runtime_put_noidle(&vdev->pdev->dev);
> +	up_write(&vdev->memory_lock);
> +
> +	return 0;
> +}
> +
> +static int vfio_pci_core_pm_entry(struct vfio_device *device, u32 flags,
> +				  void __user *arg, size_t argsz)
> +{
> +	struct vfio_pci_core_device *vdev =
> +		container_of(device, struct vfio_pci_core_device, vdev);
> +	int ret;
> +
> +	ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_SET, 0);
> +	if (ret != 1)
> +		return ret;
> +
> +	/*
> +	 * Inside vfio_pci_runtime_pm_entry(), only the runtime PM usage count
> +	 * will be decremented. The pm_runtime_put() will be invoked again
> +	 * while returning from the ioctl and then the device can go into
> +	 * runtime suspended state.
> +	 */
> +	return vfio_pci_runtime_pm_entry(vdev);
> +}
> +
> +static void vfio_pci_runtime_pm_exit(struct vfio_pci_core_device *vdev)
> +{
> +	/*
> +	 * The vdev power related flags are protected with 'memory_lock'
> +	 * semaphore.
> +	 */
> +	down_write(&vdev->memory_lock);
> +	if (vdev->pm_runtime_engaged) {
> +		vdev->pm_runtime_engaged = false;
> +		pm_runtime_get_noresume(&vdev->pdev->dev);
> +	}
> +
> +	up_write(&vdev->memory_lock);
> +}
> +
> +static int vfio_pci_core_pm_exit(struct vfio_device *device, u32 flags,
> +				 void __user *arg, size_t argsz)
> +{
> +	struct vfio_pci_core_device *vdev =
> +		container_of(device, struct vfio_pci_core_device, vdev);
> +	int ret;
> +
> +	ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_SET, 0);
> +	if (ret != 1)
> +		return ret;
> +
> +	/*
> +	 * The device should already be resumed by the vfio core layer.
> +	 * vfio_pci_runtime_pm_exit() will internally increment the usage
> +	 * count corresponding to pm_runtime_put() called during low power
> +	 * feature entry.
> +	 */
> +	vfio_pci_runtime_pm_exit(vdev);
> +	return 0;
> +}
> +
>  #ifdef CONFIG_PM
>  static int vfio_pci_core_runtime_suspend(struct device *dev)
>  {
>  	struct vfio_pci_core_device *vdev = dev_get_drvdata(dev);
>  
> +	down_write(&vdev->memory_lock);
> +	/*
> +	 * The user can move the device into D3hot state before invoking
> +	 * power management IOCTL. Move the device into D0 state here and then
> +	 * the pci-driver core runtime PM suspend function will move the device
> +	 * into the low power state. Also, for the devices which have
> +	 * NoSoftRst-, it will help in restoring the original state
> +	 * (saved locally in 'vdev->pm_save').
> +	 */
> +	vfio_pci_set_power_state(vdev, PCI_D0);
> +	up_write(&vdev->memory_lock);
> +
>  	/*
>  	 * If INTx is enabled, then mask INTx before going into the runtime
>  	 * suspended state and unmask the same in the runtime resume.
> @@ -393,6 +480,18 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
>  
>  	/*
>  	 * This function can be invoked while the power state is non-D0.
> +	 * This non-D0 power state can be with or without runtime PM.
> +	 * vfio_pci_runtime_pm_exit() will internally increment the usage
> +	 * count corresponding to pm_runtime_put() called during low power
> +	 * feature entry and then pm_runtime_resume() will wake up the device,
> +	 * if the device has already gone into the suspended state. Otherwise,
> +	 * the vfio_pci_set_power_state() will change the device power state
> +	 * to D0.
> +	 */
> +	vfio_pci_runtime_pm_exit(vdev);
> +	pm_runtime_resume(&pdev->dev);
> +
> +	/*
>  	 * This function calls __pci_reset_function_locked() which internally
>  	 * can use pci_pm_reset() for the function reset. pci_pm_reset() will
>  	 * fail if the power state is non-D0. Also, for the devices which
> @@ -1224,6 +1323,10 @@ int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
>  	switch (flags & VFIO_DEVICE_FEATURE_MASK) {
>  	case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN:
>  		return vfio_pci_core_feature_token(device, flags, arg, argsz);
> +	case VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY:
> +		return vfio_pci_core_pm_entry(device, flags, arg, argsz);
> +	case VFIO_DEVICE_FEATURE_LOW_POWER_EXIT:
> +		return vfio_pci_core_pm_exit(device, flags, arg, argsz);
>  	default:
>  		return -ENOTTY;
>  	}
> @@ -1234,31 +1337,47 @@ static ssize_t vfio_pci_rw(struct vfio_pci_core_device *vdev, char __user *buf,
>  			   size_t count, loff_t *ppos, bool iswrite)
>  {
>  	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
> +	int ret;
>  
>  	if (index >= VFIO_PCI_NUM_REGIONS + vdev->num_regions)
>  		return -EINVAL;
>  
> +	ret = pm_runtime_resume_and_get(&vdev->pdev->dev);
> +	if (ret < 0) {

if (ret) {

Thanks,
Alex

> +		pci_info_ratelimited(vdev->pdev, "runtime resume failed %d\n",
> +				     ret);
> +		return -EIO;
> +	}
> +
>  	switch (index) {
>  	case VFIO_PCI_CONFIG_REGION_INDEX:
> -		return vfio_pci_config_rw(vdev, buf, count, ppos, iswrite);
> +		ret = vfio_pci_config_rw(vdev, buf, count, ppos, iswrite);
> +		break;
>  
>  	case VFIO_PCI_ROM_REGION_INDEX:
>  		if (iswrite)
> -			return -EINVAL;
> -		return vfio_pci_bar_rw(vdev, buf, count, ppos, false);
> +			ret = -EINVAL;
> +		else
> +			ret = vfio_pci_bar_rw(vdev, buf, count, ppos, false);
> +		break;
>  
>  	case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
> -		return vfio_pci_bar_rw(vdev, buf, count, ppos, iswrite);
> +		ret = vfio_pci_bar_rw(vdev, buf, count, ppos, iswrite);
> +		break;
>  
>  	case VFIO_PCI_VGA_REGION_INDEX:
> -		return vfio_pci_vga_rw(vdev, buf, count, ppos, iswrite);
> +		ret = vfio_pci_vga_rw(vdev, buf, count, ppos, iswrite);
> +		break;
> +
>  	default:
>  		index -= VFIO_PCI_NUM_REGIONS;
> -		return vdev->region[index].ops->rw(vdev, buf,
> +		ret = vdev->region[index].ops->rw(vdev, buf,
>  						   count, ppos, iswrite);
> +		break;
>  	}
>  
> -	return -EINVAL;
> +	pm_runtime_put(&vdev->pdev->dev);
> +	return ret;
>  }
>  
>  ssize_t vfio_pci_core_read(struct vfio_device *core_vdev, char __user *buf,
> @@ -1453,7 +1572,11 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
>  	mutex_lock(&vdev->vma_lock);
>  	down_read(&vdev->memory_lock);
>  
> -	if (!__vfio_pci_memory_enabled(vdev)) {
> +	/*
> +	 * Memory region cannot be accessed if the low power feature is engaged
> +	 * or memory access is disabled.
> +	 */
> +	if (vdev->pm_runtime_engaged || !__vfio_pci_memory_enabled(vdev)) {
>  		ret = VM_FAULT_SIGBUS;
>  		goto up_out;
>  	}
> @@ -2164,6 +2287,15 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
>  		goto err_unlock;
>  	}
>  
> +	/*
> +	 * Some of the devices in the dev_set can be in the runtime suspended
> +	 * state. Increment the usage count for all the devices in the dev_set
> +	 * before reset and decrement the same after reset.
> +	 */
> +	ret = vfio_pci_dev_set_pm_runtime_get(dev_set);
> +	if (ret)
> +		goto err_unlock;
> +
>  	list_for_each_entry(cur_vma, &dev_set->device_list, vdev.dev_set_list) {
>  		/*
>  		 * Test whether all the affected devices are contained by the
> @@ -2219,6 +2351,9 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
>  		else
>  			mutex_unlock(&cur->vma_lock);
>  	}
> +
> +	list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list)
> +		pm_runtime_put(&cur->pdev->dev);
>  err_unlock:
>  	mutex_unlock(&dev_set->lock);
>  	return ret;
> diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
> index e96cc3081236..7ec81271bd05 100644
> --- a/include/linux/vfio_pci_core.h
> +++ b/include/linux/vfio_pci_core.h
> @@ -125,6 +125,7 @@ struct vfio_pci_core_device {
>  	bool			nointx;
>  	bool			needs_pm_restore;
>  	bool			pm_intx_masked;
> +	bool			pm_runtime_engaged;
>  	struct pci_saved_state	*pci_saved_state;
>  	struct pci_saved_state	*pm_save;
>  	int			ioeventfds_nr;


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 1/5] vfio: Add the device features for the low power entry and exit
  2022-07-21 22:34   ` Alex Williamson
@ 2022-07-25 14:40     ` Abhishek Sahu
  2022-07-25 22:09       ` Alex Williamson
  0 siblings, 1 reply; 26+ messages in thread
From: Abhishek Sahu @ 2022-07-25 14:40 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Cornelia Huck, Yishai Hadas, Jason Gunthorpe, Shameer Kolothum,
	Kevin Tian, Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas,
	linux-kernel, kvm, linux-pm, linux-pci

On 7/22/2022 4:04 AM, Alex Williamson wrote:
> On Tue, 19 Jul 2022 17:45:19 +0530
> Abhishek Sahu <abhsahu@nvidia.com> wrote:
> 
>> This patch adds the following new device features for the low
>> power entry and exit in the header file. The implementation for the
>> same will be added in the subsequent patches.
>>
>> - VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY
>> - VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP
>> - VFIO_DEVICE_FEATURE_LOW_POWER_EXIT
>>
>> With the standard registers, all power states cannot be achieved. The
> 
> We're talking about standard PCI PM registers here, let's make that
> clear since we're adding a device agnostic interface here.
> 
>> platform-based power management needs to be involved to go into the
>> lowest power state. For doing low power entry and exit with
>> platform-based power management, these device features can be used.
>>
>> The entry device feature has two variants. These two variants are mainly
>> to support the different behaviour for the low power entry.
>> If there is any access for the VFIO device on the host side, then the
>> device will be moved out of the low power state without the user's
>> guest driver involvement. Some devices (for example NVIDIA VGA or
>> 3D controller) require the user's guest driver involvement for
>> each low-power entry. In the first variant, the host can move the
>> device into low power without any guest driver involvement while
> 
> Perhaps, "In the first variant, the host can return the device to low
> power automatically.  The device will continue to attempt to reach low
> power until the low power exit feature is called."
> 
>> in the second variant, the host will send a notification to the user
>> through eventfd and then the users guest driver needs to move
>> the device into low power.
> 
> "In the second variant, if the device exits low power due to an access,
> the host kernel will signal the user via the provided eventfd and will
> not return the device to low power without a subsequent call to one of
> the low power entry features.  A call to the low power exit feature is
> optional if the user provided eventfd is signaled."
>  
>> These device features only support VFIO_DEVICE_FEATURE_SET operation.
> 
> And PROBE.
> 

 Thanks Alex.
 I will make the above changes in the commit message.

>> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
>> ---
>>  include/uapi/linux/vfio.h | 55 +++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 55 insertions(+)
>>
>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>> index 733a1cddde30..08fd3482d22b 100644
>> --- a/include/uapi/linux/vfio.h
>> +++ b/include/uapi/linux/vfio.h
>> @@ -986,6 +986,61 @@ enum vfio_device_mig_state {
>>  	VFIO_DEVICE_STATE_RUNNING_P2P = 5,
>>  };
>>  
>> +/*
>> + * Upon VFIO_DEVICE_FEATURE_SET, move the VFIO device into the low power state
>> + * with the platform-based power management.  This low power state will be
> 
> This is really "allow the device to be moved into a low power state"
> rather than actually "move the device into" such a state though, right?
> 
 
 Yes. It will just allow the device to be moved into a low power state.
 I have addressed all your suggestions in the uAPI description and
 added the updated description in the last.

 Can you please check that once and check if it looks okay.
 
>> + * internal to the VFIO driver and the user will not come to know which power
>> + * state is chosen.  If any device access happens (either from the host or
>> + * the guest) when the device is in the low power state, then the host will
>> + * move the device out of the low power state first.  Once the access has been
>> + * finished, then the host will move the device into the low power state again.
>> + * If the user wants that the device should not go into the low power state
>> + * again in this case, then the user should use the
>> + * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP device feature for the
> 
> This should probably just read "For single shot low power support with
> wake-up notification, see
> VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP below."
> 
>> + * low power entry.  The mmap'ed region access is not allowed till the low power
>> + * exit happens through VFIO_DEVICE_FEATURE_LOW_POWER_EXIT and will
>> + * generate the access fault.
>> + */
>> +#define VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY 3
> 
> Note that Yishai's DMA logging set is competing for the same feature
> entries.  We'll need to coordinate.
> 

 Since this is last week of rc, so will it possible to consider the updated
 patches for next kernel (I can try to send them after complete testing the
 by the end of this week). Otherwise, I can wait for next kernel and then
 I can rebase my patches.

>> +
>> +/*
>> + * Upon VFIO_DEVICE_FEATURE_SET, move the VFIO device into the low power state
>> + * with the platform-based power management and provide support for the wake-up
>> + * notifications through eventfd.  This low power state will be internal to the
>> + * VFIO driver and the user will not come to know which power state is chosen.
>> + * If any device access happens (either from the host or the guest) when the
>> + * device is in the low power state, then the host will move the device out of
>> + * the low power state first and a notification will be sent to the guest
>> + * through eventfd.  Once the access is finished, the host will not move back
>> + * the device into the low power state.  The guest should move the device into
>> + * the low power state again upon receiving the wakeup notification.  The
>> + * notification will be generated only if the device physically went into the
>> + * low power state.  If the low power entry has been disabled from the host
>> + * side, then the device will not go into the low power state even after
>> + * calling this device feature and then the device access does not require
>> + * wake-up.  The mmap'ed region access is not allowed till the low power exit
>> + * happens.  The low power exit can happen either through
>> + * VFIO_DEVICE_FEATURE_LOW_POWER_EXIT or through any other access (where the
>> + * wake-up notification has been generated).
> 
> Seems this could leverage a lot more from the previous, simply stating
> that this has the same behavior as VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY
> with the exception that the user provides an eventfd for notification
> when the device resumes from low power and will not try to re-enter a
> low power state without a subsequent user call to one of the low power
> entry feature ioctls.  It might also be worth covering the fact that
> device accesses by the user, including region and ioctl access, will
> also trigger the eventfd if that access triggers a resume.
> 
> As I'm thinking about this, that latter clause is somewhat subtle.
> AIUI a user can call the low power entry with wakeup feature and
> proceed to do various ioctl and region (not mmap) accesses that could
> perpetually keep the device awake, or there may be dependent devices
> such that the device may never go to low power.  It needs to be very
> clear that only if the wakeup eventfd has the device entered into and
> exited a low power state making the low power exit ioctl optional.
> 

 Yes. In my updated description, I have added more details.
 Can you please check if that helps.

>> + */
>> +struct vfio_device_low_power_entry_with_wakeup {
>> +	__s32 wakeup_eventfd;
>> +	__u32 reserved;
>> +};
>> +
>> +#define VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP 4
>> +
>> +/*
>> + * Upon VFIO_DEVICE_FEATURE_SET, move the VFIO device out of the low power
>> + * state.
> 
> Any ioctl effectively does that, the key here is that the low power
> state may not be re-entered after this ioctl.
> 
>>  This device feature should be called only if the user has previously
>> + * put the device into the low power state either with
>> + * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY or
>> + * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP device feature.  If the
> 
> Doesn't really seem worth mentioning, we need to protect against misuse
> regardless.
> 
>> + * device is not in the low power state currently, this device feature will
>> + * return early with the success status.
> 
> This is an implementation detail, it doesn't need to be part of the
> uAPI.  Thanks,
> 
> Alex
> 
>> + */
>> +#define VFIO_DEVICE_FEATURE_LOW_POWER_EXIT 5
>> +
>>  /* -------- API for Type1 VFIO IOMMU -------- */
>>  
>>  /**
> 

 /*
  * Upon VFIO_DEVICE_FEATURE_SET, allow the device to be moved into a low power
  * state with the platform-based power management.  This low power state will
  * be internal to the VFIO driver and the user will not come to know which
  * power state is chosen.  If any device access happens (either from the host
  * or the guest) when the device is in the low power state, then the host will
  * move the device out of the low power state first.  Once the access has been
  * finished, then the host will move the device into the low power state
  * again.  For single shot low power support with wake-up notification, see
  * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP below.  The mmap'ed region
  * access is not allowed till the low power exit happens through
  * VFIO_DEVICE_FEATURE_LOW_POWER_EXIT and will generate the access fault.
  */
 #define VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY 3

 /*
  * This device feature has the same behavior as
  * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY with the exception that the user
  * provides an eventfd for wake-up notification.  When the device moves out of
  * the low power state for the wake-up, the host will not try to re-enter a
  * low power state without a subsequent user call to one of the low power
  * entry device feature IOCTLs.  The low power exit can happen either through
  * VFIO_DEVICE_FEATURE_LOW_POWER_EXIT or through any other access (where the
  * wake-up notification has been generated).
  *
  * The notification through eventfd will be generated only if the device has
  * gone into the low power state after calling this device feature IOCTL.
  * There are various cases where the device will not go into the low power
  * state after calling this device feature IOCTL (for example, the low power
  * entry has been disabled from the host side, the user keeps the device busy
  * after calling this device feature IOCTL, there are dependent devices which
  * block the device low power entry, etc.) and in such cases, the device access
  * does not require wake-up.  Also, the low power exit through
  * VFIO_DEVICE_FEATURE_LOW_POWER_EXIT is mandatory for the cases where the
  * wake-up notification has not been generated.
  */
 struct vfio_device_low_power_entry_with_wakeup {
	 __s32 wakeup_eventfd;
	 __u32 reserved;
 };

 #define VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP 4

 /*
  * Upon VFIO_DEVICE_FEATURE_SET, prevent the VFIO device low power state entry
  * which has been previously allowed with VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY
  * or VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP device features.
  */
 #define VFIO_DEVICE_FEATURE_LOW_POWER_EXIT 5

 Thanks,
 Abhishek

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 4/5] vfio/pci: Implement VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY/EXIT
  2022-07-21 22:34   ` Alex Williamson
@ 2022-07-25 14:48     ` Abhishek Sahu
  0 siblings, 0 replies; 26+ messages in thread
From: Abhishek Sahu @ 2022-07-25 14:48 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Cornelia Huck, Yishai Hadas, Jason Gunthorpe, Shameer Kolothum,
	Kevin Tian, Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas,
	linux-kernel, kvm, linux-pm, linux-pci

On 7/22/2022 4:04 AM, Alex Williamson wrote:
> On Tue, 19 Jul 2022 17:45:22 +0530
> Abhishek Sahu <abhsahu@nvidia.com> wrote:
> 
>> Currently, if the runtime power management is enabled for vfio-pci
>> based devices in the guest OS, then the guest OS will do the register
>> write for PCI_PM_CTRL register. This write request will be handled in
>> vfio_pm_config_write() where it will do the actual register write of
>> PCI_PM_CTRL register. With this, the maximum D3hot state can be
>> achieved for low power. If we can use the runtime PM framework, then
>> we can achieve the D3cold state (on the supported systems) which will
>> help in saving maximum power.
>>
>> 1. D3cold state can't be achieved by writing PCI standard
>>    PM config registers. This patch implements the following
>>    newly added low power related device features:
>>     - VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY
>>     - VFIO_DEVICE_FEATURE_LOW_POWER_EXIT
>>
>>    The VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY will move the device into
>>    the low power state, and the VFIO_DEVICE_FEATURE_LOW_POWER_EXIT
>>    will move the device out of the low power state.
> 
> Isn't this really:
> 
> 	The VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY feature will allow the
> 	device to make use of low power platform states on the host
> 	while the VFIO_DEVICE_FEATURE_LOW_POWER_EXIT will prevent
> 	further use of those power states.
> 
> ie. we can't make the device move to low power and every ioctl/access
> will make it exit, it's more about allowing/preventing use of those
> platform provided low power states.
> 

 Thanks Alex.
 Yes. I will update this.

>>
>> 2. The vfio-pci driver uses runtime PM framework for low power entry and
>>    exit. On the platforms where D3cold state is supported, the runtime
>>    PM framework will put the device into D3cold otherwise, D3hot or some
>>    other power state will be used. If the user has explicitly disabled
>>    runtime PM for the device, then the device will be in the power state
>>    configured by the guest OS through PCI_PM_CTRL.
> 
> This is talking about disabling runtime PM support for a device on the
> host precluding this interface from allowing the device to enter
> platform defined low power states, right?
> 
 
 Yes. It is on the host side.
 I will update this to include the host term to make this clear.
 Also, there are more cases where device won't go into runtime
 suspended state which you mentioned in the first patch (the dependent
 devices preventing the runtime suspend or user keeps the device busy).
 I will add that as well. In all these cases, the device will be in
 active state.

>> 3. The hypervisors can implement virtual ACPI methods. For example,
>>    in guest linux OS if PCI device ACPI node has _PR3 and _PR0 power
>>    resources with _ON/_OFF method, then guest linux OS invokes
>>    the _OFF method during D3cold transition and then _ON during D0
>>    transition. The hypervisor can tap these virtual ACPI calls and then
>>    call the low power device feature IOCTL.
>>
>> 4. The 'pm_runtime_engaged' flag tracks the entry and exit to
>>    runtime PM. This flag is protected with 'memory_lock' semaphore.
>>
>> 5. All the config and other region access are wrapped under
>>    pm_runtime_resume_and_get() and pm_runtime_put(). So, if any
>>    device access happens while the device is in the runtime suspended
>>    state, then the device will be resumed first before access. Once the
>>    access has been finished, then the device will again go into the
>>    runtime suspended state.
>>
>> 6. The memory region access through mmap will not be allowed in the low
>>    power state. Since __vfio_pci_memory_enabled() is a common function,
>>    so check for 'pm_runtime_engaged' has been added explicitly in
>>    vfio_pci_mmap_fault() to block only mmap'ed access.
>>
>> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
>> ---
>>  drivers/vfio/pci/vfio_pci_core.c | 151 +++++++++++++++++++++++++++++--
>>  include/linux/vfio_pci_core.h    |   1 +
>>  2 files changed, 144 insertions(+), 8 deletions(-)
>>
>> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
>> index 9517645acfa6..726a6f282496 100644
>> --- a/drivers/vfio/pci/vfio_pci_core.c
>> +++ b/drivers/vfio/pci/vfio_pci_core.c
>> @@ -259,11 +259,98 @@ int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t stat
>>  	return ret;
>>  }
>>  
>> +static int vfio_pci_runtime_pm_entry(struct vfio_pci_core_device *vdev)
>> +{
>> +	/*
>> +	 * The vdev power related flags are protected with 'memory_lock'
>> +	 * semaphore.
>> +	 */
>> +	vfio_pci_zap_and_down_write_memory_lock(vdev);
>> +	if (vdev->pm_runtime_engaged) {
>> +		up_write(&vdev->memory_lock);
>> +		return -EINVAL;
>> +	}
> 
> Awkward that we zap memory for the error path here, but optimizing
> performance for a user that can't remember they've already activated
> low power for a device doesn't seem like a priority ;)
> 
>> +
>> +	vdev->pm_runtime_engaged = true;
>> +	pm_runtime_put_noidle(&vdev->pdev->dev);
>> +	up_write(&vdev->memory_lock);
>> +
>> +	return 0;
>> +}
>> +
>> +static int vfio_pci_core_pm_entry(struct vfio_device *device, u32 flags,
>> +				  void __user *arg, size_t argsz)
>> +{
>> +	struct vfio_pci_core_device *vdev =
>> +		container_of(device, struct vfio_pci_core_device, vdev);
>> +	int ret;
>> +
>> +	ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_SET, 0);
>> +	if (ret != 1)
>> +		return ret;
>> +
>> +	/*
>> +	 * Inside vfio_pci_runtime_pm_entry(), only the runtime PM usage count
>> +	 * will be decremented. The pm_runtime_put() will be invoked again
>> +	 * while returning from the ioctl and then the device can go into
>> +	 * runtime suspended state.
>> +	 */
>> +	return vfio_pci_runtime_pm_entry(vdev);
>> +}
>> +
>> +static void vfio_pci_runtime_pm_exit(struct vfio_pci_core_device *vdev)
>> +{
>> +	/*
>> +	 * The vdev power related flags are protected with 'memory_lock'
>> +	 * semaphore.
>> +	 */
>> +	down_write(&vdev->memory_lock);
>> +	if (vdev->pm_runtime_engaged) {
>> +		vdev->pm_runtime_engaged = false;
>> +		pm_runtime_get_noresume(&vdev->pdev->dev);
>> +	}
>> +
>> +	up_write(&vdev->memory_lock);
>> +}
>> +
>> +static int vfio_pci_core_pm_exit(struct vfio_device *device, u32 flags,
>> +				 void __user *arg, size_t argsz)
>> +{
>> +	struct vfio_pci_core_device *vdev =
>> +		container_of(device, struct vfio_pci_core_device, vdev);
>> +	int ret;
>> +
>> +	ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_SET, 0);
>> +	if (ret != 1)
>> +		return ret;
>> +
>> +	/*
>> +	 * The device should already be resumed by the vfio core layer.
>> +	 * vfio_pci_runtime_pm_exit() will internally increment the usage
>> +	 * count corresponding to pm_runtime_put() called during low power
>> +	 * feature entry.
>> +	 */
>> +	vfio_pci_runtime_pm_exit(vdev);
>> +	return 0;
>> +}
>> +
>>  #ifdef CONFIG_PM
>>  static int vfio_pci_core_runtime_suspend(struct device *dev)
>>  {
>>  	struct vfio_pci_core_device *vdev = dev_get_drvdata(dev);
>>  
>> +	down_write(&vdev->memory_lock);
>> +	/*
>> +	 * The user can move the device into D3hot state before invoking
>> +	 * power management IOCTL. Move the device into D0 state here and then
>> +	 * the pci-driver core runtime PM suspend function will move the device
>> +	 * into the low power state. Also, for the devices which have
>> +	 * NoSoftRst-, it will help in restoring the original state
>> +	 * (saved locally in 'vdev->pm_save').
>> +	 */
>> +	vfio_pci_set_power_state(vdev, PCI_D0);
>> +	up_write(&vdev->memory_lock);
>> +
>>  	/*
>>  	 * If INTx is enabled, then mask INTx before going into the runtime
>>  	 * suspended state and unmask the same in the runtime resume.
>> @@ -393,6 +480,18 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
>>  
>>  	/*
>>  	 * This function can be invoked while the power state is non-D0.
>> +	 * This non-D0 power state can be with or without runtime PM.
>> +	 * vfio_pci_runtime_pm_exit() will internally increment the usage
>> +	 * count corresponding to pm_runtime_put() called during low power
>> +	 * feature entry and then pm_runtime_resume() will wake up the device,
>> +	 * if the device has already gone into the suspended state. Otherwise,
>> +	 * the vfio_pci_set_power_state() will change the device power state
>> +	 * to D0.
>> +	 */
>> +	vfio_pci_runtime_pm_exit(vdev);
>> +	pm_runtime_resume(&pdev->dev);
>> +
>> +	/*
>>  	 * This function calls __pci_reset_function_locked() which internally
>>  	 * can use pci_pm_reset() for the function reset. pci_pm_reset() will
>>  	 * fail if the power state is non-D0. Also, for the devices which
>> @@ -1224,6 +1323,10 @@ int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
>>  	switch (flags & VFIO_DEVICE_FEATURE_MASK) {
>>  	case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN:
>>  		return vfio_pci_core_feature_token(device, flags, arg, argsz);
>> +	case VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY:
>> +		return vfio_pci_core_pm_entry(device, flags, arg, argsz);
>> +	case VFIO_DEVICE_FEATURE_LOW_POWER_EXIT:
>> +		return vfio_pci_core_pm_exit(device, flags, arg, argsz);
>>  	default:
>>  		return -ENOTTY;
>>  	}
>> @@ -1234,31 +1337,47 @@ static ssize_t vfio_pci_rw(struct vfio_pci_core_device *vdev, char __user *buf,
>>  			   size_t count, loff_t *ppos, bool iswrite)
>>  {
>>  	unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
>> +	int ret;
>>  
>>  	if (index >= VFIO_PCI_NUM_REGIONS + vdev->num_regions)
>>  		return -EINVAL;
>>  
>> +	ret = pm_runtime_resume_and_get(&vdev->pdev->dev);
>> +	if (ret < 0) {
> 
> if (ret) {
> 
> Thanks,
> Alex
> 
 
 I will fix this.
 
 Thanks,
 Abhishek

>> +		pci_info_ratelimited(vdev->pdev, "runtime resume failed %d\n",
>> +				     ret);
>> +		return -EIO;
>> +	}
>> +
>>  	switch (index) {
>>  	case VFIO_PCI_CONFIG_REGION_INDEX:
>> -		return vfio_pci_config_rw(vdev, buf, count, ppos, iswrite);
>> +		ret = vfio_pci_config_rw(vdev, buf, count, ppos, iswrite);
>> +		break;
>>  
>>  	case VFIO_PCI_ROM_REGION_INDEX:
>>  		if (iswrite)
>> -			return -EINVAL;
>> -		return vfio_pci_bar_rw(vdev, buf, count, ppos, false);
>> +			ret = -EINVAL;
>> +		else
>> +			ret = vfio_pci_bar_rw(vdev, buf, count, ppos, false);
>> +		break;
>>  
>>  	case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
>> -		return vfio_pci_bar_rw(vdev, buf, count, ppos, iswrite);
>> +		ret = vfio_pci_bar_rw(vdev, buf, count, ppos, iswrite);
>> +		break;
>>  
>>  	case VFIO_PCI_VGA_REGION_INDEX:
>> -		return vfio_pci_vga_rw(vdev, buf, count, ppos, iswrite);
>> +		ret = vfio_pci_vga_rw(vdev, buf, count, ppos, iswrite);
>> +		break;
>> +
>>  	default:
>>  		index -= VFIO_PCI_NUM_REGIONS;
>> -		return vdev->region[index].ops->rw(vdev, buf,
>> +		ret = vdev->region[index].ops->rw(vdev, buf,
>>  						   count, ppos, iswrite);
>> +		break;
>>  	}
>>  
>> -	return -EINVAL;
>> +	pm_runtime_put(&vdev->pdev->dev);
>> +	return ret;
>>  }
>>  
>>  ssize_t vfio_pci_core_read(struct vfio_device *core_vdev, char __user *buf,
>> @@ -1453,7 +1572,11 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
>>  	mutex_lock(&vdev->vma_lock);
>>  	down_read(&vdev->memory_lock);
>>  
>> -	if (!__vfio_pci_memory_enabled(vdev)) {
>> +	/*
>> +	 * Memory region cannot be accessed if the low power feature is engaged
>> +	 * or memory access is disabled.
>> +	 */
>> +	if (vdev->pm_runtime_engaged || !__vfio_pci_memory_enabled(vdev)) {
>>  		ret = VM_FAULT_SIGBUS;
>>  		goto up_out;
>>  	}
>> @@ -2164,6 +2287,15 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
>>  		goto err_unlock;
>>  	}
>>  
>> +	/*
>> +	 * Some of the devices in the dev_set can be in the runtime suspended
>> +	 * state. Increment the usage count for all the devices in the dev_set
>> +	 * before reset and decrement the same after reset.
>> +	 */
>> +	ret = vfio_pci_dev_set_pm_runtime_get(dev_set);
>> +	if (ret)
>> +		goto err_unlock;
>> +
>>  	list_for_each_entry(cur_vma, &dev_set->device_list, vdev.dev_set_list) {
>>  		/*
>>  		 * Test whether all the affected devices are contained by the
>> @@ -2219,6 +2351,9 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
>>  		else
>>  			mutex_unlock(&cur->vma_lock);
>>  	}
>> +
>> +	list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list)
>> +		pm_runtime_put(&cur->pdev->dev);
>>  err_unlock:
>>  	mutex_unlock(&dev_set->lock);
>>  	return ret;
>> diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
>> index e96cc3081236..7ec81271bd05 100644
>> --- a/include/linux/vfio_pci_core.h
>> +++ b/include/linux/vfio_pci_core.h
>> @@ -125,6 +125,7 @@ struct vfio_pci_core_device {
>>  	bool			nointx;
>>  	bool			needs_pm_restore;
>>  	bool			pm_intx_masked;
>> +	bool			pm_runtime_engaged;
>>  	struct pci_saved_state	*pci_saved_state;
>>  	struct pci_saved_state	*pm_save;
>>  	int			ioeventfds_nr;
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 5/5] vfio/pci: Implement VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP
  2022-07-21 22:34   ` Alex Williamson
@ 2022-07-25 15:04     ` Abhishek Sahu
  0 siblings, 0 replies; 26+ messages in thread
From: Abhishek Sahu @ 2022-07-25 15:04 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Cornelia Huck, Yishai Hadas, Jason Gunthorpe, Shameer Kolothum,
	Kevin Tian, Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas,
	linux-kernel, kvm, linux-pm, linux-pci

On 7/22/2022 4:04 AM, Alex Williamson wrote:
> On Tue, 19 Jul 2022 17:45:23 +0530
> Abhishek Sahu <abhsahu@nvidia.com> wrote:
> 
>> This patch implements VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP
>> device feature. In the VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY, if there is
>> any access for the VFIO device on the host side, then the device will
>> be moved out of the low power state without the user's guest driver
>> involvement. Once the device access has been finished, then the device
>> will be moved again into low power state. With the low power
>> entry happened through VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP,
>> the device will not be moved back into the low power state and
>> a notification will be sent to the user by triggering wakeup eventfd.
>>
>> vfio_pci_core_pm_entry() will be called for both the variants of low
>> power feature entry so add an extra argument for wakeup eventfd context
>> and store locally in 'struct vfio_pci_core_device'.
>>
>> For the entry happened without wakeup eventfd, all the exit related
>> handling will be done by the LOW_POWER_EXIT device feature only.
>> When the LOW_POWER_EXIT will be called, then the vfio core layer
>> vfio_device_pm_runtime_get() will increment the usage count and will
>> resume the device. In the driver runtime_resume callback,
>> the 'pm_wake_eventfd_ctx' will be NULL so the vfio_pci_runtime_pm_exit()
>> will return early. Then vfio_pci_core_pm_exit() will again call
>> vfio_pci_runtime_pm_exit() and now the exit related handling will be done.
>>
>> For the entry happened with wakeup eventfd, in the driver resume
>> callback, eventfd will be triggered and all the exit related handling will
>> be done. When vfio_pci_runtime_pm_exit() will be called by
>> vfio_pci_core_pm_exit(), then it will return early. But if the user has
>> disabled the runtime PM on the host side, the device will never go
>> runtime suspended state and in this case, all the exit related handling
>> will be done during vfio_pci_core_pm_exit() only. Also, the eventfd will
>> not be triggered since the device power state has not been changed by the
>> host driver.
>>
>> For vfio_pci_core_disable() also, all the exit related handling
>> needs to be done if user has closed the device after putting into
>> low power. In this case eventfd will not be triggered since
>> the device close has been initiated by the user only.
>>
>> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
>> ---
>>  drivers/vfio/pci/vfio_pci_core.c | 78 ++++++++++++++++++++++++++++++--
>>  include/linux/vfio_pci_core.h    |  1 +
>>  2 files changed, 74 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
>> index 726a6f282496..dbe942bcaa67 100644
>> --- a/drivers/vfio/pci/vfio_pci_core.c
>> +++ b/drivers/vfio/pci/vfio_pci_core.c
>> @@ -259,7 +259,8 @@ int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t stat
>>  	return ret;
>>  }
>>  
>> -static int vfio_pci_runtime_pm_entry(struct vfio_pci_core_device *vdev)
>> +static int vfio_pci_runtime_pm_entry(struct vfio_pci_core_device *vdev,
>> +				     struct eventfd_ctx *efdctx)
>>  {
>>  	/*
>>  	 * The vdev power related flags are protected with 'memory_lock'
>> @@ -272,6 +273,7 @@ static int vfio_pci_runtime_pm_entry(struct vfio_pci_core_device *vdev)
>>  	}
>>  
>>  	vdev->pm_runtime_engaged = true;
>> +	vdev->pm_wake_eventfd_ctx = efdctx;
>>  	pm_runtime_put_noidle(&vdev->pdev->dev);
>>  	up_write(&vdev->memory_lock);
>>  
>> @@ -295,21 +297,67 @@ static int vfio_pci_core_pm_entry(struct vfio_device *device, u32 flags,
>>  	 * while returning from the ioctl and then the device can go into
>>  	 * runtime suspended state.
>>  	 */
>> -	return vfio_pci_runtime_pm_entry(vdev);
>> +	return vfio_pci_runtime_pm_entry(vdev, NULL);
>>  }
>>  
>> -static void vfio_pci_runtime_pm_exit(struct vfio_pci_core_device *vdev)
>> +static int
>> +vfio_pci_core_pm_entry_with_wakeup(struct vfio_device *device, u32 flags,
>> +				   void __user *arg, size_t argsz)
>> +{
>> +	struct vfio_pci_core_device *vdev =
>> +		container_of(device, struct vfio_pci_core_device, vdev);
>> +	struct vfio_device_low_power_entry_with_wakeup entry;
>> +	struct eventfd_ctx *efdctx;
>> +	int ret;
>> +
>> +	ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_SET,
>> +				 sizeof(entry));
>> +	if (ret != 1)
>> +		return ret;
>> +
>> +	if (copy_from_user(&entry, arg, sizeof(entry)))
>> +		return -EFAULT;
>> +
>> +	if (entry.wakeup_eventfd < 0)
>> +		return -EINVAL;
>> +
>> +	efdctx = eventfd_ctx_fdget(entry.wakeup_eventfd);
>> +	if (IS_ERR(efdctx))
>> +		return PTR_ERR(efdctx);
>> +
>> +	ret = vfio_pci_runtime_pm_entry(vdev, efdctx);
>> +	if (ret)
>> +		eventfd_ctx_put(efdctx);
>> +
>> +	return ret;
>> +}
>> +
>> +static void vfio_pci_runtime_pm_exit(struct vfio_pci_core_device *vdev,
>> +				     bool resume_callback)
>>  {
>>  	/*
>>  	 * The vdev power related flags are protected with 'memory_lock'
>>  	 * semaphore.
>>  	 */
>>  	down_write(&vdev->memory_lock);
>> +	if (resume_callback && !vdev->pm_wake_eventfd_ctx) {
>> +		up_write(&vdev->memory_lock);
>> +		return;
>> +	}
>> +
>>  	if (vdev->pm_runtime_engaged) {
>>  		vdev->pm_runtime_engaged = false;
>>  		pm_runtime_get_noresume(&vdev->pdev->dev);
>>  	}
>>  
>> +	if (vdev->pm_wake_eventfd_ctx) {
>> +		if (resume_callback)
>> +			eventfd_signal(vdev->pm_wake_eventfd_ctx, 1);
>> +
>> +		eventfd_ctx_put(vdev->pm_wake_eventfd_ctx);
>> +		vdev->pm_wake_eventfd_ctx = NULL;
>> +	}
>> +
>>  	up_write(&vdev->memory_lock);
>>  }
>>  
> 
> I find the pm_exit handling here confusing.  We only have one caller
> that can signal the eventfd, so it seems cleaner to me to have that
> caller do the eventfd signal.  We can then remove the arg to pm_exit
> and pull the core of it out to a pre-locked function for that call
> path.  Sometime like below (applies on top of this patch).  Also moved
> the intx unmasking until after the eventfd signaling.  What do you
> think?  Thanks,
> 
> Alex
> 

 Thanks Alex. The updated code looks cleaner.
 I will make the above changes.

 Regards,
 Abhishek

> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index dbe942bcaa67..93169b7d6da2 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -332,32 +332,27 @@ vfio_pci_core_pm_entry_with_wakeup(struct vfio_device *device, u32 flags,
>  	return ret;
>  }
>  
> -static void vfio_pci_runtime_pm_exit(struct vfio_pci_core_device *vdev,
> -				     bool resume_callback)
> +static void __vfio_pci_runtime_pm_exit(struct vfio_pci_core_device *vdev)
>  {
> -	/*
> -	 * The vdev power related flags are protected with 'memory_lock'
> -	 * semaphore.
> -	 */
> -	down_write(&vdev->memory_lock);
> -	if (resume_callback && !vdev->pm_wake_eventfd_ctx) {
> -		up_write(&vdev->memory_lock);
> -		return;
> -	}
> -
>  	if (vdev->pm_runtime_engaged) {
>  		vdev->pm_runtime_engaged = false;
>  		pm_runtime_get_noresume(&vdev->pdev->dev);
> -	}
> -
> -	if (vdev->pm_wake_eventfd_ctx) {
> -		if (resume_callback)
> -			eventfd_signal(vdev->pm_wake_eventfd_ctx, 1);
>  
> -		eventfd_ctx_put(vdev->pm_wake_eventfd_ctx);
> -		vdev->pm_wake_eventfd_ctx = NULL;
> +		if (vdev->pm_wake_eventfd_ctx) {
> +			eventfd_ctx_put(vdev->pm_wake_eventfd_ctx);
> +			vdev->pm_wake_eventfd_ctx = NULL;
> +		}
>  	}
> +}
>  
> +static void vfio_pci_runtime_pm_exit(struct vfio_pci_core_device *vdev)
> +{
> +	/*
> +	 * The vdev power related flags are protected with 'memory_lock'
> +	 * semaphore.
> +	 */
> +	down_write(&vdev->memory_lock);
> +	__vfio_pci_runtime_pm_exit(vdev);
>  	up_write(&vdev->memory_lock);
>  }
>  
> @@ -373,22 +368,13 @@ static int vfio_pci_core_pm_exit(struct vfio_device *device, u32 flags,
>  		return ret;
>  
>  	/*
> -	 * The device should already be resumed by the vfio core layer.
> -	 * vfio_pci_runtime_pm_exit() will internally increment the usage
> -	 * count corresponding to pm_runtime_put() called during low power
> -	 * feature entry.
> -	 *
> -	 * For the low power entry happened with wakeup eventfd, there will
> -	 * be two cases:
> -	 *
> -	 * 1. The device has gone into runtime suspended state. In this case,
> -	 *    the runtime resume by the vfio core layer should already have
> -	 *    performed all exit related handling and the
> -	 *    vfio_pci_runtime_pm_exit() will return early.
> -	 * 2. The device was in runtime active state. In this case, the
> -	 *    vfio_pci_runtime_pm_exit() will do all the required handling.
> +	 * The device is always in the active state here due to pm wrappers
> +	 * around ioctls.  If the device had entered a low power state and
> +	 * pm_wake_eventfd_ctx is valid, vfio_pci_core_runtime_resume() has 
> +	 * already signaled the eventfd and exited low power mode itself.
> +	 * pm_runtime_engaged protects the redundant call here.
>  	 */
> -	vfio_pci_runtime_pm_exit(vdev, false);
> +	vfio_pci_runtime_pm_exit(vdev);
>  	return 0;
>  }
>  
> @@ -425,15 +411,19 @@ static int vfio_pci_core_runtime_resume(struct device *dev)
>  {
>  	struct vfio_pci_core_device *vdev = dev_get_drvdata(dev);
>  
> -	if (vdev->pm_intx_masked)
> -		vfio_pci_intx_unmask(vdev);
> -
>  	/*
> -	 * Only for the low power entry happened with wakeup eventfd,
> -	 * the vfio_pci_runtime_pm_exit() will perform exit related handling
> -	 * and will trigger eventfd. For the other cases, it will return early.
> +	 * Resume with a pm_wake_eventfd_ctx signals the eventfd and exits
> +	 * low power mode.
>  	 */
> -	vfio_pci_runtime_pm_exit(vdev, true);
> +	down_write(&vdev->memory_lock);
> +	if (vdev->pm_wake_eventfd_ctx) {
> +		eventfd_signal(vdev->pm_wake_eventfd_ctx, 1);
> +		__vfio_pci_runtime_pm_exit(vdev);
> +	}
> +	up_write(&vdev->memory_lock);
> +
> +	if (vdev->pm_intx_masked)
> +		vfio_pci_intx_unmask(vdev);
>  
>  	return 0;
>  }
> @@ -553,7 +543,7 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
>  	 * the vfio_pci_set_power_state() will change the device power state
>  	 * to D0.
>  	 */
> -	vfio_pci_runtime_pm_exit(vdev, false);
> +	vfio_pci_runtime_pm_exit(vdev);
>  	pm_runtime_resume(&pdev->dev);
>  
>  	/*
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 1/5] vfio: Add the device features for the low power entry and exit
  2022-07-25 14:40     ` Abhishek Sahu
@ 2022-07-25 22:09       ` Alex Williamson
  2022-07-26 12:47         ` Abhishek Sahu
  0 siblings, 1 reply; 26+ messages in thread
From: Alex Williamson @ 2022-07-25 22:09 UTC (permalink / raw)
  To: Abhishek Sahu
  Cc: Cornelia Huck, Yishai Hadas, Jason Gunthorpe, Shameer Kolothum,
	Kevin Tian, Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas,
	linux-kernel, kvm, linux-pm, linux-pci

On Mon, 25 Jul 2022 20:10:44 +0530
Abhishek Sahu <abhsahu@nvidia.com> wrote:

> On 7/22/2022 4:04 AM, Alex Williamson wrote:
> > On Tue, 19 Jul 2022 17:45:19 +0530
> > Abhishek Sahu <abhsahu@nvidia.com> wrote:
> >   
> >> This patch adds the following new device features for the low
> >> power entry and exit in the header file. The implementation for the
> >> same will be added in the subsequent patches.
> >>
> >> - VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY
> >> - VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP
> >> - VFIO_DEVICE_FEATURE_LOW_POWER_EXIT
> >>
> >> With the standard registers, all power states cannot be achieved. The  
> > 
> > We're talking about standard PCI PM registers here, let's make that
> > clear since we're adding a device agnostic interface here.
> >   
> >> platform-based power management needs to be involved to go into the
> >> lowest power state. For doing low power entry and exit with
> >> platform-based power management, these device features can be used.
> >>
> >> The entry device feature has two variants. These two variants are mainly
> >> to support the different behaviour for the low power entry.
> >> If there is any access for the VFIO device on the host side, then the
> >> device will be moved out of the low power state without the user's
> >> guest driver involvement. Some devices (for example NVIDIA VGA or
> >> 3D controller) require the user's guest driver involvement for
> >> each low-power entry. In the first variant, the host can move the
> >> device into low power without any guest driver involvement while  
> > 
> > Perhaps, "In the first variant, the host can return the device to low
> > power automatically.  The device will continue to attempt to reach low
> > power until the low power exit feature is called."
> >   
> >> in the second variant, the host will send a notification to the user
> >> through eventfd and then the users guest driver needs to move
> >> the device into low power.  
> > 
> > "In the second variant, if the device exits low power due to an access,
> > the host kernel will signal the user via the provided eventfd and will
> > not return the device to low power without a subsequent call to one of
> > the low power entry features.  A call to the low power exit feature is
> > optional if the user provided eventfd is signaled."
> >    
> >> These device features only support VFIO_DEVICE_FEATURE_SET operation.  
> > 
> > And PROBE.
> >   
> 
>  Thanks Alex.
>  I will make the above changes in the commit message.
> 
> >> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
> >> ---
> >>  include/uapi/linux/vfio.h | 55 +++++++++++++++++++++++++++++++++++++++
> >>  1 file changed, 55 insertions(+)
> >>
> >> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> >> index 733a1cddde30..08fd3482d22b 100644
> >> --- a/include/uapi/linux/vfio.h
> >> +++ b/include/uapi/linux/vfio.h
> >> @@ -986,6 +986,61 @@ enum vfio_device_mig_state {
> >>  	VFIO_DEVICE_STATE_RUNNING_P2P = 5,
> >>  };
> >>  
> >> +/*
> >> + * Upon VFIO_DEVICE_FEATURE_SET, move the VFIO device into the low power state
> >> + * with the platform-based power management.  This low power state will be  
> > 
> > This is really "allow the device to be moved into a low power state"
> > rather than actually "move the device into" such a state though, right?
> >   
>  
>  Yes. It will just allow the device to be moved into a low power state.
>  I have addressed all your suggestions in the uAPI description and
>  added the updated description in the last.
> 
>  Can you please check that once and check if it looks okay.
>  
> >> + * internal to the VFIO driver and the user will not come to know which power
> >> + * state is chosen.  If any device access happens (either from the host or
> >> + * the guest) when the device is in the low power state, then the host will
> >> + * move the device out of the low power state first.  Once the access has been
> >> + * finished, then the host will move the device into the low power state again.
> >> + * If the user wants that the device should not go into the low power state
> >> + * again in this case, then the user should use the
> >> + * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP device feature for the  
> > 
> > This should probably just read "For single shot low power support with
> > wake-up notification, see
> > VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP below."
> >   
> >> + * low power entry.  The mmap'ed region access is not allowed till the low power
> >> + * exit happens through VFIO_DEVICE_FEATURE_LOW_POWER_EXIT and will
> >> + * generate the access fault.
> >> + */
> >> +#define VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY 3  
> > 
> > Note that Yishai's DMA logging set is competing for the same feature
> > entries.  We'll need to coordinate.
> >   
> 
>  Since this is last week of rc, so will it possible to consider the updated
>  patches for next kernel (I can try to send them after complete testing the
>  by the end of this week). Otherwise, I can wait for next kernel and then
>  I can rebase my patches.

I think we're close, though I would like to solicit uAPI feedback from
others on the list.  We can certainly sort out feature numbers
conflicts at commit time if Yishai's series becomes ready as well.  I'm
on PTO for the rest of the week starting tomorrow afternoon, but I'd
suggest trying to get this re-posted before the end of the week, asking
for more reviews for the uAPI, and we can evaluate early next week if
this is ready.

> >> +
> >> +/*
> >> + * Upon VFIO_DEVICE_FEATURE_SET, move the VFIO device into the low power state
> >> + * with the platform-based power management and provide support for the wake-up
> >> + * notifications through eventfd.  This low power state will be internal to the
> >> + * VFIO driver and the user will not come to know which power state is chosen.
> >> + * If any device access happens (either from the host or the guest) when the
> >> + * device is in the low power state, then the host will move the device out of
> >> + * the low power state first and a notification will be sent to the guest
> >> + * through eventfd.  Once the access is finished, the host will not move back
> >> + * the device into the low power state.  The guest should move the device into
> >> + * the low power state again upon receiving the wakeup notification.  The
> >> + * notification will be generated only if the device physically went into the
> >> + * low power state.  If the low power entry has been disabled from the host
> >> + * side, then the device will not go into the low power state even after
> >> + * calling this device feature and then the device access does not require
> >> + * wake-up.  The mmap'ed region access is not allowed till the low power exit
> >> + * happens.  The low power exit can happen either through
> >> + * VFIO_DEVICE_FEATURE_LOW_POWER_EXIT or through any other access (where the
> >> + * wake-up notification has been generated).  
> > 
> > Seems this could leverage a lot more from the previous, simply stating
> > that this has the same behavior as VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY
> > with the exception that the user provides an eventfd for notification
> > when the device resumes from low power and will not try to re-enter a
> > low power state without a subsequent user call to one of the low power
> > entry feature ioctls.  It might also be worth covering the fact that
> > device accesses by the user, including region and ioctl access, will
> > also trigger the eventfd if that access triggers a resume.
> > 
> > As I'm thinking about this, that latter clause is somewhat subtle.
> > AIUI a user can call the low power entry with wakeup feature and
> > proceed to do various ioctl and region (not mmap) accesses that could
> > perpetually keep the device awake, or there may be dependent devices
> > such that the device may never go to low power.  It needs to be very
> > clear that only if the wakeup eventfd has the device entered into and
> > exited a low power state making the low power exit ioctl optional.
> >   
> 
>  Yes. In my updated description, I have added more details.
>  Can you please check if that helps.
> 
> >> + */
> >> +struct vfio_device_low_power_entry_with_wakeup {
> >> +	__s32 wakeup_eventfd;
> >> +	__u32 reserved;
> >> +};
> >> +
> >> +#define VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP 4
> >> +
> >> +/*
> >> + * Upon VFIO_DEVICE_FEATURE_SET, move the VFIO device out of the low power
> >> + * state.  
> > 
> > Any ioctl effectively does that, the key here is that the low power
> > state may not be re-entered after this ioctl.
> >   
> >>  This device feature should be called only if the user has previously
> >> + * put the device into the low power state either with
> >> + * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY or
> >> + * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP device feature.  If the  
> > 
> > Doesn't really seem worth mentioning, we need to protect against misuse
> > regardless.
> >   
> >> + * device is not in the low power state currently, this device feature will
> >> + * return early with the success status.  
> > 
> > This is an implementation detail, it doesn't need to be part of the
> > uAPI.  Thanks,
> > 
> > Alex
> >   
> >> + */
> >> +#define VFIO_DEVICE_FEATURE_LOW_POWER_EXIT 5
> >> +
> >>  /* -------- API for Type1 VFIO IOMMU -------- */
> >>  
> >>  /**  
> >   
> 
>  /*
>   * Upon VFIO_DEVICE_FEATURE_SET, allow the device to be moved into a low power
>   * state with the platform-based power management.  This low power state will
>   * be internal to the VFIO driver and the user will not come to know which
>   * power state is chosen.  If any device access happens (either from the host

Couldn't the user look in sysfs to determine the power state?  There
might also be external hardware monitors of the power state, so this
statement is a bit overreaching.  Maybe something along the lines of...

"Device use of lower power states depend on factors managed by the
runtime power management core, including system level support and
coordinating support among dependent devices.  Enabling device low
power entry does not guarantee lower power usage by the device, nor is
a mechanism provided through this feature to know the current power
state of the device."

>   * or the guest) when the device is in the low power state, then the host will

Let's not confine ourselves to a VM use case, "...from the host or
through the vfio uAPI".

>   * move the device out of the low power state first.  Once the access has been

"move the device out of the low power state as necessary prior to the
access."

>   * finished, then the host will move the device into the low power state

"Once the access is completed, the device may re-enter the low power
state."

>   * again.  For single shot low power support with wake-up notification, see
>   * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP below.  The mmap'ed region
>   * access is not allowed till the low power exit happens through
>   * VFIO_DEVICE_FEATURE_LOW_POWER_EXIT and will generate the access fault.

"Access to mmap'd device regions is disabled on LOW_POWER_ENTRY and
may only be resumed after calling LOW_POWER_EXIT."

>   */
>  #define VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY 3
> 
>  /*
>   * This device feature has the same behavior as
>   * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY with the exception that the user
>   * provides an eventfd for wake-up notification.  When the device moves out of
>   * the low power state for the wake-up, the host will not try to re-enter a

"...will not allow the device to re-enter a low power state..."

>   * low power state without a subsequent user call to one of the low power
>   * entry device feature IOCTLs.  The low power exit can happen either through
>   * VFIO_DEVICE_FEATURE_LOW_POWER_EXIT or through any other access (where the
>   * wake-up notification has been generated).
>   *
>   * The notification through eventfd will be generated only if the device has
>   * gone into the low power state after calling this device feature IOCTL.

"The notification through the provided eventfd will be generated only
when the device has entered and is resumed from a low power state after
calling this device feature IOCTL."


>   * There are various cases where the device will not go into the low power
>   * state after calling this device feature IOCTL (for example, the low power
>   * entry has been disabled from the host side, the user keeps the device busy
>   * after calling this device feature IOCTL, there are dependent devices which
>   * block the device low power entry, etc.) and in such cases, the device access
>   * does not require wake-up.  Also, the low power exit through

"A device that has not entered low power state, as managed through the
runtime power management core, will not generate a notification through
the provided eventfd on access."

>   * VFIO_DEVICE_FEATURE_LOW_POWER_EXIT is mandatory for the cases where the
>   * wake-up notification has not been generated.

"Calling the LOW_POWER_EXIT feature is optional in the case where
notification has been signaled on the provided eventfd that a resume
from low power has occurred."

We should also reiterate the statement above about mmap access because
it would be reasonable for a user to assume that if any access wakes
the device, that should include mmap faults and therefore a resume
notification should occur for such an event.  We could implement that
for the one-shot mode, but we're choosing not to.

>   */
>  struct vfio_device_low_power_entry_with_wakeup {
> 	 __s32 wakeup_eventfd;
> 	 __u32 reserved;
>  };
> 
>  #define VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP 4
> 
>  /*
>   * Upon VFIO_DEVICE_FEATURE_SET, prevent the VFIO device low power state entry
>   * which has been previously allowed with VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY
>   * or VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP device features.
>   */

"Upon VFIO_DEVICE_FEATURE_SET, disallow use of device low power states
as previously enabled via VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY or
VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP device features.  This
device feature ioctl may itself generate a wakeup eventfd notification
in the latter case if the device has previously entered a low power
state."

Thanks,
Alex


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 1/5] vfio: Add the device features for the low power entry and exit
  2022-07-25 22:09       ` Alex Williamson
@ 2022-07-26 12:47         ` Abhishek Sahu
  2022-07-26 13:13           ` Cornelia Huck
                             ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Abhishek Sahu @ 2022-07-26 12:47 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Cornelia Huck, Yishai Hadas, Jason Gunthorpe, Shameer Kolothum,
	Kevin Tian, Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas,
	linux-kernel, kvm, linux-pm, linux-pci

On 7/26/2022 3:39 AM, Alex Williamson wrote:
> On Mon, 25 Jul 2022 20:10:44 +0530
> Abhishek Sahu <abhsahu@nvidia.com> wrote:
> 
>> On 7/22/2022 4:04 AM, Alex Williamson wrote:
>>> On Tue, 19 Jul 2022 17:45:19 +0530
>>> Abhishek Sahu <abhsahu@nvidia.com> wrote:
>>>   
>>>> This patch adds the following new device features for the low
>>>> power entry and exit in the header file. The implementation for the
>>>> same will be added in the subsequent patches.
>>>>
>>>> - VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY
>>>> - VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP
>>>> - VFIO_DEVICE_FEATURE_LOW_POWER_EXIT
>>>>
>>>> With the standard registers, all power states cannot be achieved. The  
>>>
>>> We're talking about standard PCI PM registers here, let's make that
>>> clear since we're adding a device agnostic interface here.
>>>   
>>>> platform-based power management needs to be involved to go into the
>>>> lowest power state. For doing low power entry and exit with
>>>> platform-based power management, these device features can be used.
>>>>
>>>> The entry device feature has two variants. These two variants are mainly
>>>> to support the different behaviour for the low power entry.
>>>> If there is any access for the VFIO device on the host side, then the
>>>> device will be moved out of the low power state without the user's
>>>> guest driver involvement. Some devices (for example NVIDIA VGA or
>>>> 3D controller) require the user's guest driver involvement for
>>>> each low-power entry. In the first variant, the host can move the
>>>> device into low power without any guest driver involvement while  
>>>
>>> Perhaps, "In the first variant, the host can return the device to low
>>> power automatically.  The device will continue to attempt to reach low
>>> power until the low power exit feature is called."
>>>   
>>>> in the second variant, the host will send a notification to the user
>>>> through eventfd and then the users guest driver needs to move
>>>> the device into low power.  
>>>
>>> "In the second variant, if the device exits low power due to an access,
>>> the host kernel will signal the user via the provided eventfd and will
>>> not return the device to low power without a subsequent call to one of
>>> the low power entry features.  A call to the low power exit feature is
>>> optional if the user provided eventfd is signaled."
>>>    
>>>> These device features only support VFIO_DEVICE_FEATURE_SET operation.  
>>>
>>> And PROBE.
>>>   
>>
>>  Thanks Alex.
>>  I will make the above changes in the commit message.
>>
>>>> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
>>>> ---
>>>>  include/uapi/linux/vfio.h | 55 +++++++++++++++++++++++++++++++++++++++
>>>>  1 file changed, 55 insertions(+)
>>>>
>>>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>>>> index 733a1cddde30..08fd3482d22b 100644
>>>> --- a/include/uapi/linux/vfio.h
>>>> +++ b/include/uapi/linux/vfio.h
>>>> @@ -986,6 +986,61 @@ enum vfio_device_mig_state {
>>>>  	VFIO_DEVICE_STATE_RUNNING_P2P = 5,
>>>>  };
>>>>  
>>>> +/*
>>>> + * Upon VFIO_DEVICE_FEATURE_SET, move the VFIO device into the low power state
>>>> + * with the platform-based power management.  This low power state will be  
>>>
>>> This is really "allow the device to be moved into a low power state"
>>> rather than actually "move the device into" such a state though, right?
>>>   
>>  
>>  Yes. It will just allow the device to be moved into a low power state.
>>  I have addressed all your suggestions in the uAPI description and
>>  added the updated description in the last.
>>
>>  Can you please check that once and check if it looks okay.
>>  
>>>> + * internal to the VFIO driver and the user will not come to know which power
>>>> + * state is chosen.  If any device access happens (either from the host or
>>>> + * the guest) when the device is in the low power state, then the host will
>>>> + * move the device out of the low power state first.  Once the access has been
>>>> + * finished, then the host will move the device into the low power state again.
>>>> + * If the user wants that the device should not go into the low power state
>>>> + * again in this case, then the user should use the
>>>> + * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP device feature for the  
>>>
>>> This should probably just read "For single shot low power support with
>>> wake-up notification, see
>>> VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP below."
>>>   
>>>> + * low power entry.  The mmap'ed region access is not allowed till the low power
>>>> + * exit happens through VFIO_DEVICE_FEATURE_LOW_POWER_EXIT and will
>>>> + * generate the access fault.
>>>> + */
>>>> +#define VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY 3  
>>>
>>> Note that Yishai's DMA logging set is competing for the same feature
>>> entries.  We'll need to coordinate.
>>>   
>>
>>  Since this is last week of rc, so will it possible to consider the updated
>>  patches for next kernel (I can try to send them after complete testing the
>>  by the end of this week). Otherwise, I can wait for next kernel and then
>>  I can rebase my patches.
> 
> I think we're close, though I would like to solicit uAPI feedback from
> others on the list.  We can certainly sort out feature numbers
> conflicts at commit time if Yishai's series becomes ready as well.  I'm
> on PTO for the rest of the week starting tomorrow afternoon, but I'd
> suggest trying to get this re-posted before the end of the week, asking
> for more reviews for the uAPI, and we can evaluate early next week if
> this is ready.
> 

 Sure. I will re-post the updated patch series after the complete testing.
 
>>>> +
>>>> +/*
>>>> + * Upon VFIO_DEVICE_FEATURE_SET, move the VFIO device into the low power state
>>>> + * with the platform-based power management and provide support for the wake-up
>>>> + * notifications through eventfd.  This low power state will be internal to the
>>>> + * VFIO driver and the user will not come to know which power state is chosen.
>>>> + * If any device access happens (either from the host or the guest) when the
>>>> + * device is in the low power state, then the host will move the device out of
>>>> + * the low power state first and a notification will be sent to the guest
>>>> + * through eventfd.  Once the access is finished, the host will not move back
>>>> + * the device into the low power state.  The guest should move the device into
>>>> + * the low power state again upon receiving the wakeup notification.  The
>>>> + * notification will be generated only if the device physically went into the
>>>> + * low power state.  If the low power entry has been disabled from the host
>>>> + * side, then the device will not go into the low power state even after
>>>> + * calling this device feature and then the device access does not require
>>>> + * wake-up.  The mmap'ed region access is not allowed till the low power exit
>>>> + * happens.  The low power exit can happen either through
>>>> + * VFIO_DEVICE_FEATURE_LOW_POWER_EXIT or through any other access (where the
>>>> + * wake-up notification has been generated).  
>>>
>>> Seems this could leverage a lot more from the previous, simply stating
>>> that this has the same behavior as VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY
>>> with the exception that the user provides an eventfd for notification
>>> when the device resumes from low power and will not try to re-enter a
>>> low power state without a subsequent user call to one of the low power
>>> entry feature ioctls.  It might also be worth covering the fact that
>>> device accesses by the user, including region and ioctl access, will
>>> also trigger the eventfd if that access triggers a resume.
>>>
>>> As I'm thinking about this, that latter clause is somewhat subtle.
>>> AIUI a user can call the low power entry with wakeup feature and
>>> proceed to do various ioctl and region (not mmap) accesses that could
>>> perpetually keep the device awake, or there may be dependent devices
>>> such that the device may never go to low power.  It needs to be very
>>> clear that only if the wakeup eventfd has the device entered into and
>>> exited a low power state making the low power exit ioctl optional.
>>>   
>>
>>  Yes. In my updated description, I have added more details.
>>  Can you please check if that helps.
>>
>>>> + */
>>>> +struct vfio_device_low_power_entry_with_wakeup {
>>>> +	__s32 wakeup_eventfd;
>>>> +	__u32 reserved;
>>>> +};
>>>> +
>>>> +#define VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP 4
>>>> +
>>>> +/*
>>>> + * Upon VFIO_DEVICE_FEATURE_SET, move the VFIO device out of the low power
>>>> + * state.  
>>>
>>> Any ioctl effectively does that, the key here is that the low power
>>> state may not be re-entered after this ioctl.
>>>   
>>>>  This device feature should be called only if the user has previously
>>>> + * put the device into the low power state either with
>>>> + * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY or
>>>> + * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP device feature.  If the  
>>>
>>> Doesn't really seem worth mentioning, we need to protect against misuse
>>> regardless.
>>>   
>>>> + * device is not in the low power state currently, this device feature will
>>>> + * return early with the success status.  
>>>
>>> This is an implementation detail, it doesn't need to be part of the
>>> uAPI.  Thanks,
>>>
>>> Alex
>>>   
>>>> + */
>>>> +#define VFIO_DEVICE_FEATURE_LOW_POWER_EXIT 5
>>>> +
>>>>  /* -------- API for Type1 VFIO IOMMU -------- */
>>>>  
>>>>  /**  
>>>   
>>
>>  /*
>>   * Upon VFIO_DEVICE_FEATURE_SET, allow the device to be moved into a low power
>>   * state with the platform-based power management.  This low power state will
>>   * be internal to the VFIO driver and the user will not come to know which
>>   * power state is chosen.  If any device access happens (either from the host
> 
> Couldn't the user look in sysfs to determine the power state?  There
> might also be external hardware monitors of the power state, so this
> statement is a bit overreaching.  Maybe something along the lines of...
> 
> "Device use of lower power states depend on factors managed by the
> runtime power management core, including system level support and
> coordinating support among dependent devices.  Enabling device low
> power entry does not guarantee lower power usage by the device, nor is
> a mechanism provided through this feature to know the current power
> state of the device."
> 
>>   * or the guest) when the device is in the low power state, then the host will
> 
> Let's not confine ourselves to a VM use case, "...from the host or
> through the vfio uAPI".
> 
>>   * move the device out of the low power state first.  Once the access has been
> 
> "move the device out of the low power state as necessary prior to the
> access."
> 
>>   * finished, then the host will move the device into the low power state
> 
> "Once the access is completed, the device may re-enter the low power
> state."
> 
>>   * again.  For single shot low power support with wake-up notification, see
>>   * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP below.  The mmap'ed region
>>   * access is not allowed till the low power exit happens through
>>   * VFIO_DEVICE_FEATURE_LOW_POWER_EXIT and will generate the access fault.
> 
> "Access to mmap'd device regions is disabled on LOW_POWER_ENTRY and
> may only be resumed after calling LOW_POWER_EXIT."
> 
>>   */
>>  #define VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY 3
>>
>>  /*
>>   * This device feature has the same behavior as
>>   * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY with the exception that the user
>>   * provides an eventfd for wake-up notification.  When the device moves out of
>>   * the low power state for the wake-up, the host will not try to re-enter a
> 
> "...will not allow the device to re-enter a low power state..."
> 
>>   * low power state without a subsequent user call to one of the low power
>>   * entry device feature IOCTLs.  The low power exit can happen either through
>>   * VFIO_DEVICE_FEATURE_LOW_POWER_EXIT or through any other access (where the
>>   * wake-up notification has been generated).
>>   *
>>   * The notification through eventfd will be generated only if the device has
>>   * gone into the low power state after calling this device feature IOCTL.
> 
> "The notification through the provided eventfd will be generated only
> when the device has entered and is resumed from a low power state after
> calling this device feature IOCTL."
> 
> 
>>   * There are various cases where the device will not go into the low power
>>   * state after calling this device feature IOCTL (for example, the low power
>>   * entry has been disabled from the host side, the user keeps the device busy
>>   * after calling this device feature IOCTL, there are dependent devices which
>>   * block the device low power entry, etc.) and in such cases, the device access
>>   * does not require wake-up.  Also, the low power exit through
> 
> "A device that has not entered low power state, as managed through the
> runtime power management core, will not generate a notification through
> the provided eventfd on access."
> 
>>   * VFIO_DEVICE_FEATURE_LOW_POWER_EXIT is mandatory for the cases where the
>>   * wake-up notification has not been generated.
> 
> "Calling the LOW_POWER_EXIT feature is optional in the case where
> notification has been signaled on the provided eventfd that a resume
> from low power has occurred."
> 
> We should also reiterate the statement above about mmap access because
> it would be reasonable for a user to assume that if any access wakes
> the device, that should include mmap faults and therefore a resume
> notification should occur for such an event.  We could implement that
> for the one-shot mode, but we're choosing not to.
> 
>>   */
>>  struct vfio_device_low_power_entry_with_wakeup {
>> 	 __s32 wakeup_eventfd;
>> 	 __u32 reserved;
>>  };
>>
>>  #define VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP 4
>>
>>  /*
>>   * Upon VFIO_DEVICE_FEATURE_SET, prevent the VFIO device low power state entry
>>   * which has been previously allowed with VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY
>>   * or VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP device features.
>>   */
> 
> "Upon VFIO_DEVICE_FEATURE_SET, disallow use of device low power states
> as previously enabled via VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY or
> VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP device features.  This
> device feature ioctl may itself generate a wakeup eventfd notification
> in the latter case if the device has previously entered a low power
> state."
> 
> Thanks,
> Alex
> 

 Thanks Alex for your thorough review of uAPI.
 I have incorporated all the suggestions.
 Following is the updated uAPI.
 
 /*
  * Upon VFIO_DEVICE_FEATURE_SET, allow the device to be moved into a low power
  * state with the platform-based power management.  Device use of lower power
  * states depends on factors managed by the runtime power management core,
  * including system level support and coordinating support among dependent
  * devices.  Enabling device low power entry does not guarantee lower power
  * usage by the device, nor is a mechanism provided through this feature to
  * know the current power state of the device.  If any device access happens
  * (either from the host or through the vfio uAPI) when the device is in the
  * low power state, then the host will move the device out of the low power
  * state as necessary prior to the access.  Once the access is completed, the
  * device may re-enter the low power state.  For single shot low power support
  * with wake-up notification, see
  * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP below.  Access to mmap'd
  * device regions is disabled on LOW_POWER_ENTRY and may only be resumed after
  * calling LOW_POWER_EXIT.
  */
 #define VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY 3
 
 /*
  * This device feature has the same behavior as
  * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY with the exception that the user
  * provides an eventfd for wake-up notification.  When the device moves out of
  * the low power state for the wake-up, the host will not allow the device to
  * re-enter a low power state without a subsequent user call to one of the low
  * power entry device feature IOCTLs.  Access to mmap'd device regions is
  * disabled on LOW_POWER_ENTRY_WITH_WAKEUP and may only be resumed after the
  * low power exit.  The low power exit can happen either through LOW_POWER_EXIT
  * or through any other access (where the wake-up notification has been
  * generated).  The access to mmap'd device regions will not trigger low power
  * exit.
  *
  * The notification through the provided eventfd will be generated only when
  * the device has entered and is resumed from a low power state after
  * calling this device feature IOCTL.  A device that has not entered low power
  * state, as managed through the runtime power management core, will not
  * generate a notification through the provided eventfd on access.  Calling the
  * LOW_POWER_EXIT feature is optional in the case where notification has been
  * signaled on the provided eventfd that a resume from low power has occurred.
  */
 struct vfio_device_low_power_entry_with_wakeup {
 	__s32 wakeup_eventfd;
 	__u32 reserved;
 };
 
 #define VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP 4
 
 /*
  * Upon VFIO_DEVICE_FEATURE_SET, disallow use of device low power states as
  * previously enabled via VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY or
  * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP device features.
  * This device feature IOCTL may itself generate a wakeup eventfd notification
  * in the latter case if the device has previously entered a low power state.
  */
 #define VFIO_DEVICE_FEATURE_LOW_POWER_EXIT 5

 Regards,
 Abhishek

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 1/5] vfio: Add the device features for the low power entry and exit
  2022-07-26 12:47         ` Abhishek Sahu
@ 2022-07-26 13:13           ` Cornelia Huck
  2022-07-26 14:17           ` Alex Williamson
  2022-07-26 17:23           ` Jason Gunthorpe
  2 siblings, 0 replies; 26+ messages in thread
From: Cornelia Huck @ 2022-07-26 13:13 UTC (permalink / raw)
  To: Abhishek Sahu, Alex Williamson
  Cc: Yishai Hadas, Jason Gunthorpe, Shameer Kolothum, Kevin Tian,
	Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas, linux-kernel,
	kvm, linux-pm, linux-pci

On Tue, Jul 26 2022, Abhishek Sahu <abhsahu@nvidia.com> wrote:

>  #define VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP 4
>  
>  /*
>   * Upon VFIO_DEVICE_FEATURE_SET, disallow use of device low power states as
>   * previously enabled via VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY or
>   * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP device features.
>   * This device feature IOCTL may itself generate a wakeup eventfd notification
>   * in the latter case if the device has previously entered a low power state.

Nit: s/has/had/

>   */
>  #define VFIO_DEVICE_FEATURE_LOW_POWER_EXIT 5

I haven't followed this closely, and I'm not that familiar with power
management, but at least I can't spot anything obviously problematic.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 1/5] vfio: Add the device features for the low power entry and exit
  2022-07-26 12:47         ` Abhishek Sahu
  2022-07-26 13:13           ` Cornelia Huck
@ 2022-07-26 14:17           ` Alex Williamson
  2022-07-26 17:23           ` Jason Gunthorpe
  2 siblings, 0 replies; 26+ messages in thread
From: Alex Williamson @ 2022-07-26 14:17 UTC (permalink / raw)
  To: Abhishek Sahu
  Cc: Cornelia Huck, Yishai Hadas, Jason Gunthorpe, Shameer Kolothum,
	Kevin Tian, Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas,
	linux-kernel, kvm, linux-pm, linux-pci

On Tue, 26 Jul 2022 18:17:18 +0530
Abhishek Sahu <abhsahu@nvidia.com> wrote:

> On 7/26/2022 3:39 AM, Alex Williamson wrote:
> > On Mon, 25 Jul 2022 20:10:44 +0530
> > Abhishek Sahu <abhsahu@nvidia.com> wrote:
> >   
> >> On 7/22/2022 4:04 AM, Alex Williamson wrote:  
> >>> On Tue, 19 Jul 2022 17:45:19 +0530
> >>> Abhishek Sahu <abhsahu@nvidia.com> wrote:
> >>>     
> >>>> This patch adds the following new device features for the low
> >>>> power entry and exit in the header file. The implementation for the
> >>>> same will be added in the subsequent patches.
> >>>>
> >>>> - VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY
> >>>> - VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP
> >>>> - VFIO_DEVICE_FEATURE_LOW_POWER_EXIT
> >>>>
> >>>> With the standard registers, all power states cannot be achieved. The    
> >>>
> >>> We're talking about standard PCI PM registers here, let's make that
> >>> clear since we're adding a device agnostic interface here.
> >>>     
> >>>> platform-based power management needs to be involved to go into the
> >>>> lowest power state. For doing low power entry and exit with
> >>>> platform-based power management, these device features can be used.
> >>>>
> >>>> The entry device feature has two variants. These two variants are mainly
> >>>> to support the different behaviour for the low power entry.
> >>>> If there is any access for the VFIO device on the host side, then the
> >>>> device will be moved out of the low power state without the user's
> >>>> guest driver involvement. Some devices (for example NVIDIA VGA or
> >>>> 3D controller) require the user's guest driver involvement for
> >>>> each low-power entry. In the first variant, the host can move the
> >>>> device into low power without any guest driver involvement while    
> >>>
> >>> Perhaps, "In the first variant, the host can return the device to low
> >>> power automatically.  The device will continue to attempt to reach low
> >>> power until the low power exit feature is called."
> >>>     
> >>>> in the second variant, the host will send a notification to the user
> >>>> through eventfd and then the users guest driver needs to move
> >>>> the device into low power.    
> >>>
> >>> "In the second variant, if the device exits low power due to an access,
> >>> the host kernel will signal the user via the provided eventfd and will
> >>> not return the device to low power without a subsequent call to one of
> >>> the low power entry features.  A call to the low power exit feature is
> >>> optional if the user provided eventfd is signaled."
> >>>      
> >>>> These device features only support VFIO_DEVICE_FEATURE_SET operation.    
> >>>
> >>> And PROBE.
> >>>     
> >>
> >>  Thanks Alex.
> >>  I will make the above changes in the commit message.
> >>  
> >>>> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
> >>>> ---
> >>>>  include/uapi/linux/vfio.h | 55 +++++++++++++++++++++++++++++++++++++++
> >>>>  1 file changed, 55 insertions(+)
> >>>>
> >>>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> >>>> index 733a1cddde30..08fd3482d22b 100644
> >>>> --- a/include/uapi/linux/vfio.h
> >>>> +++ b/include/uapi/linux/vfio.h
> >>>> @@ -986,6 +986,61 @@ enum vfio_device_mig_state {
> >>>>  	VFIO_DEVICE_STATE_RUNNING_P2P = 5,
> >>>>  };
> >>>>  
> >>>> +/*
> >>>> + * Upon VFIO_DEVICE_FEATURE_SET, move the VFIO device into the low power state
> >>>> + * with the platform-based power management.  This low power state will be    
> >>>
> >>> This is really "allow the device to be moved into a low power state"
> >>> rather than actually "move the device into" such a state though, right?
> >>>     
> >>  
> >>  Yes. It will just allow the device to be moved into a low power state.
> >>  I have addressed all your suggestions in the uAPI description and
> >>  added the updated description in the last.
> >>
> >>  Can you please check that once and check if it looks okay.
> >>    
> >>>> + * internal to the VFIO driver and the user will not come to know which power
> >>>> + * state is chosen.  If any device access happens (either from the host or
> >>>> + * the guest) when the device is in the low power state, then the host will
> >>>> + * move the device out of the low power state first.  Once the access has been
> >>>> + * finished, then the host will move the device into the low power state again.
> >>>> + * If the user wants that the device should not go into the low power state
> >>>> + * again in this case, then the user should use the
> >>>> + * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP device feature for the    
> >>>
> >>> This should probably just read "For single shot low power support with
> >>> wake-up notification, see
> >>> VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP below."
> >>>     
> >>>> + * low power entry.  The mmap'ed region access is not allowed till the low power
> >>>> + * exit happens through VFIO_DEVICE_FEATURE_LOW_POWER_EXIT and will
> >>>> + * generate the access fault.
> >>>> + */
> >>>> +#define VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY 3    
> >>>
> >>> Note that Yishai's DMA logging set is competing for the same feature
> >>> entries.  We'll need to coordinate.
> >>>     
> >>
> >>  Since this is last week of rc, so will it possible to consider the updated
> >>  patches for next kernel (I can try to send them after complete testing the
> >>  by the end of this week). Otherwise, I can wait for next kernel and then
> >>  I can rebase my patches.  
> > 
> > I think we're close, though I would like to solicit uAPI feedback from
> > others on the list.  We can certainly sort out feature numbers
> > conflicts at commit time if Yishai's series becomes ready as well.  I'm
> > on PTO for the rest of the week starting tomorrow afternoon, but I'd
> > suggest trying to get this re-posted before the end of the week, asking
> > for more reviews for the uAPI, and we can evaluate early next week if
> > this is ready.
> >   
> 
>  Sure. I will re-post the updated patch series after the complete testing.
>  
> >>>> +
> >>>> +/*
> >>>> + * Upon VFIO_DEVICE_FEATURE_SET, move the VFIO device into the low power state
> >>>> + * with the platform-based power management and provide support for the wake-up
> >>>> + * notifications through eventfd.  This low power state will be internal to the
> >>>> + * VFIO driver and the user will not come to know which power state is chosen.
> >>>> + * If any device access happens (either from the host or the guest) when the
> >>>> + * device is in the low power state, then the host will move the device out of
> >>>> + * the low power state first and a notification will be sent to the guest
> >>>> + * through eventfd.  Once the access is finished, the host will not move back
> >>>> + * the device into the low power state.  The guest should move the device into
> >>>> + * the low power state again upon receiving the wakeup notification.  The
> >>>> + * notification will be generated only if the device physically went into the
> >>>> + * low power state.  If the low power entry has been disabled from the host
> >>>> + * side, then the device will not go into the low power state even after
> >>>> + * calling this device feature and then the device access does not require
> >>>> + * wake-up.  The mmap'ed region access is not allowed till the low power exit
> >>>> + * happens.  The low power exit can happen either through
> >>>> + * VFIO_DEVICE_FEATURE_LOW_POWER_EXIT or through any other access (where the
> >>>> + * wake-up notification has been generated).    
> >>>
> >>> Seems this could leverage a lot more from the previous, simply stating
> >>> that this has the same behavior as VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY
> >>> with the exception that the user provides an eventfd for notification
> >>> when the device resumes from low power and will not try to re-enter a
> >>> low power state without a subsequent user call to one of the low power
> >>> entry feature ioctls.  It might also be worth covering the fact that
> >>> device accesses by the user, including region and ioctl access, will
> >>> also trigger the eventfd if that access triggers a resume.
> >>>
> >>> As I'm thinking about this, that latter clause is somewhat subtle.
> >>> AIUI a user can call the low power entry with wakeup feature and
> >>> proceed to do various ioctl and region (not mmap) accesses that could
> >>> perpetually keep the device awake, or there may be dependent devices
> >>> such that the device may never go to low power.  It needs to be very
> >>> clear that only if the wakeup eventfd has the device entered into and
> >>> exited a low power state making the low power exit ioctl optional.
> >>>     
> >>
> >>  Yes. In my updated description, I have added more details.
> >>  Can you please check if that helps.
> >>  
> >>>> + */
> >>>> +struct vfio_device_low_power_entry_with_wakeup {
> >>>> +	__s32 wakeup_eventfd;
> >>>> +	__u32 reserved;
> >>>> +};
> >>>> +
> >>>> +#define VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP 4
> >>>> +
> >>>> +/*
> >>>> + * Upon VFIO_DEVICE_FEATURE_SET, move the VFIO device out of the low power
> >>>> + * state.    
> >>>
> >>> Any ioctl effectively does that, the key here is that the low power
> >>> state may not be re-entered after this ioctl.
> >>>     
> >>>>  This device feature should be called only if the user has previously
> >>>> + * put the device into the low power state either with
> >>>> + * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY or
> >>>> + * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP device feature.  If the    
> >>>
> >>> Doesn't really seem worth mentioning, we need to protect against misuse
> >>> regardless.
> >>>     
> >>>> + * device is not in the low power state currently, this device feature will
> >>>> + * return early with the success status.    
> >>>
> >>> This is an implementation detail, it doesn't need to be part of the
> >>> uAPI.  Thanks,
> >>>
> >>> Alex
> >>>     
> >>>> + */
> >>>> +#define VFIO_DEVICE_FEATURE_LOW_POWER_EXIT 5
> >>>> +
> >>>>  /* -------- API for Type1 VFIO IOMMU -------- */
> >>>>  
> >>>>  /**    
> >>>     
> >>
> >>  /*
> >>   * Upon VFIO_DEVICE_FEATURE_SET, allow the device to be moved into a low power
> >>   * state with the platform-based power management.  This low power state will
> >>   * be internal to the VFIO driver and the user will not come to know which
> >>   * power state is chosen.  If any device access happens (either from the host  
> > 
> > Couldn't the user look in sysfs to determine the power state?  There
> > might also be external hardware monitors of the power state, so this
> > statement is a bit overreaching.  Maybe something along the lines of...
> > 
> > "Device use of lower power states depend on factors managed by the
> > runtime power management core, including system level support and
> > coordinating support among dependent devices.  Enabling device low
> > power entry does not guarantee lower power usage by the device, nor is
> > a mechanism provided through this feature to know the current power
> > state of the device."
> >   
> >>   * or the guest) when the device is in the low power state, then the host will  
> > 
> > Let's not confine ourselves to a VM use case, "...from the host or
> > through the vfio uAPI".
> >   
> >>   * move the device out of the low power state first.  Once the access has been  
> > 
> > "move the device out of the low power state as necessary prior to the
> > access."
> >   
> >>   * finished, then the host will move the device into the low power state  
> > 
> > "Once the access is completed, the device may re-enter the low power
> > state."
> >   
> >>   * again.  For single shot low power support with wake-up notification, see
> >>   * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP below.  The mmap'ed region
> >>   * access is not allowed till the low power exit happens through
> >>   * VFIO_DEVICE_FEATURE_LOW_POWER_EXIT and will generate the access fault.  
> > 
> > "Access to mmap'd device regions is disabled on LOW_POWER_ENTRY and
> > may only be resumed after calling LOW_POWER_EXIT."
> >   
> >>   */
> >>  #define VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY 3
> >>
> >>  /*
> >>   * This device feature has the same behavior as
> >>   * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY with the exception that the user
> >>   * provides an eventfd for wake-up notification.  When the device moves out of
> >>   * the low power state for the wake-up, the host will not try to re-enter a  
> > 
> > "...will not allow the device to re-enter a low power state..."
> >   
> >>   * low power state without a subsequent user call to one of the low power
> >>   * entry device feature IOCTLs.  The low power exit can happen either through
> >>   * VFIO_DEVICE_FEATURE_LOW_POWER_EXIT or through any other access (where the
> >>   * wake-up notification has been generated).
> >>   *
> >>   * The notification through eventfd will be generated only if the device has
> >>   * gone into the low power state after calling this device feature IOCTL.  
> > 
> > "The notification through the provided eventfd will be generated only
> > when the device has entered and is resumed from a low power state after
> > calling this device feature IOCTL."
> > 
> >   
> >>   * There are various cases where the device will not go into the low power
> >>   * state after calling this device feature IOCTL (for example, the low power
> >>   * entry has been disabled from the host side, the user keeps the device busy
> >>   * after calling this device feature IOCTL, there are dependent devices which
> >>   * block the device low power entry, etc.) and in such cases, the device access
> >>   * does not require wake-up.  Also, the low power exit through  
> > 
> > "A device that has not entered low power state, as managed through the
> > runtime power management core, will not generate a notification through
> > the provided eventfd on access."
> >   
> >>   * VFIO_DEVICE_FEATURE_LOW_POWER_EXIT is mandatory for the cases where the
> >>   * wake-up notification has not been generated.  
> > 
> > "Calling the LOW_POWER_EXIT feature is optional in the case where
> > notification has been signaled on the provided eventfd that a resume
> > from low power has occurred."
> > 
> > We should also reiterate the statement above about mmap access because
> > it would be reasonable for a user to assume that if any access wakes
> > the device, that should include mmap faults and therefore a resume
> > notification should occur for such an event.  We could implement that
> > for the one-shot mode, but we're choosing not to.
> >   
> >>   */
> >>  struct vfio_device_low_power_entry_with_wakeup {
> >> 	 __s32 wakeup_eventfd;
> >> 	 __u32 reserved;
> >>  };
> >>
> >>  #define VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP 4
> >>
> >>  /*
> >>   * Upon VFIO_DEVICE_FEATURE_SET, prevent the VFIO device low power state entry
> >>   * which has been previously allowed with VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY
> >>   * or VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP device features.
> >>   */  
> > 
> > "Upon VFIO_DEVICE_FEATURE_SET, disallow use of device low power states
> > as previously enabled via VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY or
> > VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP device features.  This
> > device feature ioctl may itself generate a wakeup eventfd notification
> > in the latter case if the device has previously entered a low power
> > state."
> > 
> > Thanks,
> > Alex
> >   
> 
>  Thanks Alex for your thorough review of uAPI.
>  I have incorporated all the suggestions.
>  Following is the updated uAPI.
>  
>  /*
>   * Upon VFIO_DEVICE_FEATURE_SET, allow the device to be moved into a low power
>   * state with the platform-based power management.  Device use of lower power
>   * states depends on factors managed by the runtime power management core,
>   * including system level support and coordinating support among dependent
>   * devices.  Enabling device low power entry does not guarantee lower power
>   * usage by the device, nor is a mechanism provided through this feature to
>   * know the current power state of the device.  If any device access happens
>   * (either from the host or through the vfio uAPI) when the device is in the
>   * low power state, then the host will move the device out of the low power
>   * state as necessary prior to the access.  Once the access is completed, the
>   * device may re-enter the low power state.  For single shot low power support
>   * with wake-up notification, see
>   * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP below.  Access to mmap'd
>   * device regions is disabled on LOW_POWER_ENTRY and may only be resumed after
>   * calling LOW_POWER_EXIT.
>   */
>  #define VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY 3
>  
>  /*
>   * This device feature has the same behavior as
>   * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY with the exception that the user
>   * provides an eventfd for wake-up notification.  When the device moves out of
>   * the low power state for the wake-up, the host will not allow the device to
>   * re-enter a low power state without a subsequent user call to one of the low
>   * power entry device feature IOCTLs.  Access to mmap'd device regions is
>   * disabled on LOW_POWER_ENTRY_WITH_WAKEUP and may only be resumed after the
>   * low power exit.  The low power exit can happen either through LOW_POWER_EXIT
>   * or through any other access (where the wake-up notification has been
>   * generated).  The access to mmap'd device regions will not trigger low power
>   * exit.
>   *
>   * The notification through the provided eventfd will be generated only when
>   * the device has entered and is resumed from a low power state after
>   * calling this device feature IOCTL.  A device that has not entered low power
>   * state, as managed through the runtime power management core, will not
>   * generate a notification through the provided eventfd on access.  Calling the
>   * LOW_POWER_EXIT feature is optional in the case where notification has been
>   * signaled on the provided eventfd that a resume from low power has occurred.
>   */
>  struct vfio_device_low_power_entry_with_wakeup {
>  	__s32 wakeup_eventfd;
>  	__u32 reserved;
>  };
>  
>  #define VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP 4
>  
>  /*
>   * Upon VFIO_DEVICE_FEATURE_SET, disallow use of device low power states as
>   * previously enabled via VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY or
>   * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP device features.
>   * This device feature IOCTL may itself generate a wakeup eventfd notification
>   * in the latter case if the device has previously entered a low power state.
>   */
>  #define VFIO_DEVICE_FEATURE_LOW_POWER_EXIT 5

Looks ok to me.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 1/5] vfio: Add the device features for the low power entry and exit
  2022-07-26 12:47         ` Abhishek Sahu
  2022-07-26 13:13           ` Cornelia Huck
  2022-07-26 14:17           ` Alex Williamson
@ 2022-07-26 17:23           ` Jason Gunthorpe
  2022-07-27  6:07             ` Abhishek Sahu
  2 siblings, 1 reply; 26+ messages in thread
From: Jason Gunthorpe @ 2022-07-26 17:23 UTC (permalink / raw)
  To: Abhishek Sahu
  Cc: Alex Williamson, Cornelia Huck, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas,
	linux-kernel, kvm, linux-pm, linux-pci

On Tue, Jul 26, 2022 at 06:17:18PM +0530, Abhishek Sahu wrote:
>  Thanks Alex for your thorough review of uAPI.
>  I have incorporated all the suggestions.
>  Following is the updated uAPI.
>  
>  /*
>   * Upon VFIO_DEVICE_FEATURE_SET, allow the device to be moved into a low power
>   * state with the platform-based power management.  Device use of lower power
>   * states depends on factors managed by the runtime power management core,
>   * including system level support and coordinating support among dependent
>   * devices.  Enabling device low power entry does not guarantee lower power
>   * usage by the device, nor is a mechanism provided through this feature to
>   * know the current power state of the device.  If any device access happens
>   * (either from the host or through the vfio uAPI) when the device is in the
>   * low power state, then the host will move the device out of the low power
>   * state as necessary prior to the access.  Once the access is completed, the
>   * device may re-enter the low power state.  For single shot low power support
>   * with wake-up notification, see
>   * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP below.  Access to mmap'd
>   * device regions is disabled on LOW_POWER_ENTRY and may only be resumed after
>   * calling LOW_POWER_EXIT.
>   */
>  #define VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY 3
>  
>  /*
>   * This device feature has the same behavior as
>   * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY with the exception that the user
>   * provides an eventfd for wake-up notification.

It feels like this should be one entry point instead of two.

A flag "automatic re-sleep" and an optional eventfd (-1 means not
provided) seems to capture both of these behaviors in a bit clearer
and extendable way.

Jason

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 1/5] vfio: Add the device features for the low power entry and exit
  2022-07-26 17:23           ` Jason Gunthorpe
@ 2022-07-27  6:07             ` Abhishek Sahu
  2022-08-01 18:42               ` Alex Williamson
  0 siblings, 1 reply; 26+ messages in thread
From: Abhishek Sahu @ 2022-07-27  6:07 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Cornelia Huck, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas,
	linux-kernel, kvm, linux-pm, linux-pci

On 7/26/2022 10:53 PM, Jason Gunthorpe wrote:
> On Tue, Jul 26, 2022 at 06:17:18PM +0530, Abhishek Sahu wrote:
>>  Thanks Alex for your thorough review of uAPI.
>>  I have incorporated all the suggestions.
>>  Following is the updated uAPI.
>>  
>>  /*
>>   * Upon VFIO_DEVICE_FEATURE_SET, allow the device to be moved into a low power
>>   * state with the platform-based power management.  Device use of lower power
>>   * states depends on factors managed by the runtime power management core,
>>   * including system level support and coordinating support among dependent
>>   * devices.  Enabling device low power entry does not guarantee lower power
>>   * usage by the device, nor is a mechanism provided through this feature to
>>   * know the current power state of the device.  If any device access happens
>>   * (either from the host or through the vfio uAPI) when the device is in the
>>   * low power state, then the host will move the device out of the low power
>>   * state as necessary prior to the access.  Once the access is completed, the
>>   * device may re-enter the low power state.  For single shot low power support
>>   * with wake-up notification, see
>>   * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP below.  Access to mmap'd
>>   * device regions is disabled on LOW_POWER_ENTRY and may only be resumed after
>>   * calling LOW_POWER_EXIT.
>>   */
>>  #define VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY 3
>>  
>>  /*
>>   * This device feature has the same behavior as
>>   * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY with the exception that the user
>>   * provides an eventfd for wake-up notification.
> 
> It feels like this should be one entry point instead of two.
> 
> A flag "automatic re-sleep" and an optional eventfd (-1 means not
> provided) seems to capture both of these behaviors in a bit clearer
> and extendable way.
> 
> Jason

 We discussed about that in the earlier version of the patch series.
 Since we have different exit related handling, so to avoid confusion
 we proceeded with 2 separate variants for the low power entry. Also,
 we don't need any parameter for the first case.

 But, I can do the changes to make a single entry point, if we conclude
 for that. 

 From my side, I have explored how the uAPI looks like if
 we go with this approach.

 /*
  * Upon VFIO_DEVICE_FEATURE_SET, allow the device to be moved into a low power
  * state with the platform-based power management.  Device use of lower power
  * states depends on factors managed by the runtime power management core,
  * including system level support and coordinating support among dependent
  * devices.  Enabling device low power entry does not guarantee lower power
  * usage by the device, nor is a mechanism provided through this feature to
  * know the current power state of the device.  If any device access happens
  * (either from the host or through the vfio uAPI) when the device is in the
  * low power state, then the host will move the device out of the low power
  * state as necessary prior to the access.  Once the access is completed, the
  * device re-entry to a low power state will be controlled through
  * VFIO_DEVICE_LOW_POWER_REENTERY_DISABLE flag.
  *
  * If LOW_POWER_REENTERY_DISABLE flag is not set, the device may re-enter the
  * low power state.  Access to mmap'd device regions is disabled on
  * LOW_POWER_ENTRY and may only be resumed after calling LOW_POWER_EXIT.
  *
  * If LOW_POWER_REENTERY_DISABLE flag is set, then user needs to provide an
  * eventfd for wake-up notification.  When the device moves out of the low
  * power state for the wake-up, the host will not allow the device to re-enter
  * a low power state without a subsequent user call to LOW_POWER_ENTRY.
  * Access to mmap'd device regions is disabled on LOW_POWER_ENTRY and may only
  * be resumed after the low power exit.  The low power exit can happen either
  * through LOW_POWER_EXIT or through any other access (where the wake-up
  * notification has been generated).  The access to mmap'd device regions will
  * not trigger low power exit.
  *
  * The notification through the provided eventfd will be generated only when
  * the device has entered and is resumed from a low power state after
  * calling this device feature IOCTL.  A device that has not entered low power
  * state, as managed through the runtime power management core, will not
  * generate a notification through the provided eventfd on access.  Calling the
  * LOW_POWER_EXIT feature is optional in the case where notification has been
  * signaled on the provided eventfd that a resume from low power has occurred.
  *
  * The wakeup_eventfd needs to be valid only if LOW_POWER_REENTERY_DISABLE
  * flag is set, otherwise, it will be ignored.
  */
 #define VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY 3
 
 struct vfio_device_low_power_entry_with_wakeup {
 	__u32 flags;
 #define VFIO_DEVICE_LOW_POWER_REENTERY_DISABLE	(1 << 0)
 	__s32 wakeup_eventfd;
 };
 
 /*
  * Upon VFIO_DEVICE_FEATURE_SET, disallow use of device low power states as
  * previously enabled via VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY device feature.
  * This device feature IOCTL may itself generate a wakeup eventfd notification
  * if the device had previously entered a low power state with
  * VFIO_DEVICE_LOW_POWER_REENTERY_DISABLE flag set.
  */
 #define VFIO_DEVICE_FEATURE_LOW_POWER_EXIT 4

 Thanks,
 Abhishek

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 1/5] vfio: Add the device features for the low power entry and exit
  2022-07-27  6:07             ` Abhishek Sahu
@ 2022-08-01 18:42               ` Alex Williamson
  2022-08-02 14:04                 ` Jason Gunthorpe
  0 siblings, 1 reply; 26+ messages in thread
From: Alex Williamson @ 2022-08-01 18:42 UTC (permalink / raw)
  To: Abhishek Sahu
  Cc: Jason Gunthorpe, Cornelia Huck, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas,
	linux-kernel, kvm, linux-pm, linux-pci

On Wed, 27 Jul 2022 11:37:02 +0530
Abhishek Sahu <abhsahu@nvidia.com> wrote:

> On 7/26/2022 10:53 PM, Jason Gunthorpe wrote:
> > On Tue, Jul 26, 2022 at 06:17:18PM +0530, Abhishek Sahu wrote:  
> >>  Thanks Alex for your thorough review of uAPI.
> >>  I have incorporated all the suggestions.
> >>  Following is the updated uAPI.
> >>  
> >>  /*
> >>   * Upon VFIO_DEVICE_FEATURE_SET, allow the device to be moved into a low power
> >>   * state with the platform-based power management.  Device use of lower power
> >>   * states depends on factors managed by the runtime power management core,
> >>   * including system level support and coordinating support among dependent
> >>   * devices.  Enabling device low power entry does not guarantee lower power
> >>   * usage by the device, nor is a mechanism provided through this feature to
> >>   * know the current power state of the device.  If any device access happens
> >>   * (either from the host or through the vfio uAPI) when the device is in the
> >>   * low power state, then the host will move the device out of the low power
> >>   * state as necessary prior to the access.  Once the access is completed, the
> >>   * device may re-enter the low power state.  For single shot low power support
> >>   * with wake-up notification, see
> >>   * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP below.  Access to mmap'd
> >>   * device regions is disabled on LOW_POWER_ENTRY and may only be resumed after
> >>   * calling LOW_POWER_EXIT.
> >>   */
> >>  #define VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY 3
> >>  
> >>  /*
> >>   * This device feature has the same behavior as
> >>   * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY with the exception that the user
> >>   * provides an eventfd for wake-up notification.  
> > 
> > It feels like this should be one entry point instead of two.
> > 
> > A flag "automatic re-sleep" and an optional eventfd (-1 means not
> > provided) seems to capture both of these behaviors in a bit clearer
> > and extendable way.

I think the mutual exclusion between re-entrant mode and one-shot is
quite a bit more subtle in the version below, so I don't particularly
find this cleaner.  Potentially we could have variant drivers support
one w/o the other in the previously proposed model as well.  It's
interesting to see this suggestion since since we seem to have a theme
of making features single purpose elsewhere.  Thanks,

Alex 

> 
>  We discussed about that in the earlier version of the patch series.
>  Since we have different exit related handling, so to avoid confusion
>  we proceeded with 2 separate variants for the low power entry. Also,
>  we don't need any parameter for the first case.
> 
>  But, I can do the changes to make a single entry point, if we conclude
>  for that. 
> 
>  From my side, I have explored how the uAPI looks like if
>  we go with this approach.
> 
>  /*
>   * Upon VFIO_DEVICE_FEATURE_SET, allow the device to be moved into a low power
>   * state with the platform-based power management.  Device use of lower power
>   * states depends on factors managed by the runtime power management core,
>   * including system level support and coordinating support among dependent
>   * devices.  Enabling device low power entry does not guarantee lower power
>   * usage by the device, nor is a mechanism provided through this feature to
>   * know the current power state of the device.  If any device access happens
>   * (either from the host or through the vfio uAPI) when the device is in the
>   * low power state, then the host will move the device out of the low power
>   * state as necessary prior to the access.  Once the access is completed, the
>   * device re-entry to a low power state will be controlled through
>   * VFIO_DEVICE_LOW_POWER_REENTERY_DISABLE flag.
>   *
>   * If LOW_POWER_REENTERY_DISABLE flag is not set, the device may re-enter the
>   * low power state.  Access to mmap'd device regions is disabled on
>   * LOW_POWER_ENTRY and may only be resumed after calling LOW_POWER_EXIT.
>   *
>   * If LOW_POWER_REENTERY_DISABLE flag is set, then user needs to provide an
>   * eventfd for wake-up notification.  When the device moves out of the low
>   * power state for the wake-up, the host will not allow the device to re-enter
>   * a low power state without a subsequent user call to LOW_POWER_ENTRY.
>   * Access to mmap'd device regions is disabled on LOW_POWER_ENTRY and may only
>   * be resumed after the low power exit.  The low power exit can happen either
>   * through LOW_POWER_EXIT or through any other access (where the wake-up
>   * notification has been generated).  The access to mmap'd device regions will
>   * not trigger low power exit.
>   *
>   * The notification through the provided eventfd will be generated only when
>   * the device has entered and is resumed from a low power state after
>   * calling this device feature IOCTL.  A device that has not entered low power
>   * state, as managed through the runtime power management core, will not
>   * generate a notification through the provided eventfd on access.  Calling the
>   * LOW_POWER_EXIT feature is optional in the case where notification has been
>   * signaled on the provided eventfd that a resume from low power has occurred.
>   *
>   * The wakeup_eventfd needs to be valid only if LOW_POWER_REENTERY_DISABLE
>   * flag is set, otherwise, it will be ignored.
>   */
>  #define VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY 3
>  
>  struct vfio_device_low_power_entry_with_wakeup {
>  	__u32 flags;
>  #define VFIO_DEVICE_LOW_POWER_REENTERY_DISABLE	(1 << 0)
>  	__s32 wakeup_eventfd;
>  };
>  
>  /*
>   * Upon VFIO_DEVICE_FEATURE_SET, disallow use of device low power states as
>   * previously enabled via VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY device feature.
>   * This device feature IOCTL may itself generate a wakeup eventfd notification
>   * if the device had previously entered a low power state with
>   * VFIO_DEVICE_LOW_POWER_REENTERY_DISABLE flag set.
>   */
>  #define VFIO_DEVICE_FEATURE_LOW_POWER_EXIT 4
> 
>  Thanks,
>  Abhishek
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 1/5] vfio: Add the device features for the low power entry and exit
  2022-08-01 18:42               ` Alex Williamson
@ 2022-08-02 14:04                 ` Jason Gunthorpe
  2022-08-02 15:41                   ` Alex Williamson
  0 siblings, 1 reply; 26+ messages in thread
From: Jason Gunthorpe @ 2022-08-02 14:04 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Abhishek Sahu, Cornelia Huck, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas,
	linux-kernel, kvm, linux-pm, linux-pci

On Mon, Aug 01, 2022 at 12:42:53PM -0600, Alex Williamson wrote:
> On Wed, 27 Jul 2022 11:37:02 +0530
> Abhishek Sahu <abhsahu@nvidia.com> wrote:
> 
> > On 7/26/2022 10:53 PM, Jason Gunthorpe wrote:
> > > On Tue, Jul 26, 2022 at 06:17:18PM +0530, Abhishek Sahu wrote:  
> > >>  Thanks Alex for your thorough review of uAPI.
> > >>  I have incorporated all the suggestions.
> > >>  Following is the updated uAPI.
> > >>  
> > >>  /*
> > >>   * Upon VFIO_DEVICE_FEATURE_SET, allow the device to be moved into a low power
> > >>   * state with the platform-based power management.  Device use of lower power
> > >>   * states depends on factors managed by the runtime power management core,
> > >>   * including system level support and coordinating support among dependent
> > >>   * devices.  Enabling device low power entry does not guarantee lower power
> > >>   * usage by the device, nor is a mechanism provided through this feature to
> > >>   * know the current power state of the device.  If any device access happens
> > >>   * (either from the host or through the vfio uAPI) when the device is in the
> > >>   * low power state, then the host will move the device out of the low power
> > >>   * state as necessary prior to the access.  Once the access is completed, the
> > >>   * device may re-enter the low power state.  For single shot low power support
> > >>   * with wake-up notification, see
> > >>   * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP below.  Access to mmap'd
> > >>   * device regions is disabled on LOW_POWER_ENTRY and may only be resumed after
> > >>   * calling LOW_POWER_EXIT.
> > >>   */
> > >>  #define VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY 3
> > >>  
> > >>  /*
> > >>   * This device feature has the same behavior as
> > >>   * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY with the exception that the user
> > >>   * provides an eventfd for wake-up notification.  
> > > 
> > > It feels like this should be one entry point instead of two.
> > > 
> > > A flag "automatic re-sleep" and an optional eventfd (-1 means not
> > > provided) seems to capture both of these behaviors in a bit clearer
> > > and extendable way.
> 
> I think the mutual exclusion between re-entrant mode and one-shot is
> quite a bit more subtle in the version below, so I don't particularly
> find this cleaner.  Potentially we could have variant drivers support
> one w/o the other in the previously proposed model as well.  It's
> interesting to see this suggestion since since we seem to have a theme
> of making features single purpose elsewhere.  Thanks,

It is still quite single purpose, just
VFIO_DEVICE_LOW_POWER_REENTERY_DISABLE is some minor customization of
that single purpose.

Either the flag is set or not, it isn't subtle..

Jason

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 1/5] vfio: Add the device features for the low power entry and exit
  2022-08-02 14:04                 ` Jason Gunthorpe
@ 2022-08-02 15:41                   ` Alex Williamson
  2022-08-02 16:35                     ` Jason Gunthorpe
  0 siblings, 1 reply; 26+ messages in thread
From: Alex Williamson @ 2022-08-02 15:41 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Abhishek Sahu, Cornelia Huck, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas,
	linux-kernel, kvm, linux-pm, linux-pci

On Tue, 2 Aug 2022 11:04:52 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Mon, Aug 01, 2022 at 12:42:53PM -0600, Alex Williamson wrote:
> > On Wed, 27 Jul 2022 11:37:02 +0530
> > Abhishek Sahu <abhsahu@nvidia.com> wrote:
> >   
> > > On 7/26/2022 10:53 PM, Jason Gunthorpe wrote:  
> > > > On Tue, Jul 26, 2022 at 06:17:18PM +0530, Abhishek Sahu wrote:    
> > > >>  Thanks Alex for your thorough review of uAPI.
> > > >>  I have incorporated all the suggestions.
> > > >>  Following is the updated uAPI.
> > > >>  
> > > >>  /*
> > > >>   * Upon VFIO_DEVICE_FEATURE_SET, allow the device to be moved into a low power
> > > >>   * state with the platform-based power management.  Device use of lower power
> > > >>   * states depends on factors managed by the runtime power management core,
> > > >>   * including system level support and coordinating support among dependent
> > > >>   * devices.  Enabling device low power entry does not guarantee lower power
> > > >>   * usage by the device, nor is a mechanism provided through this feature to
> > > >>   * know the current power state of the device.  If any device access happens
> > > >>   * (either from the host or through the vfio uAPI) when the device is in the
> > > >>   * low power state, then the host will move the device out of the low power
> > > >>   * state as necessary prior to the access.  Once the access is completed, the
> > > >>   * device may re-enter the low power state.  For single shot low power support
> > > >>   * with wake-up notification, see
> > > >>   * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP below.  Access to mmap'd
> > > >>   * device regions is disabled on LOW_POWER_ENTRY and may only be resumed after
> > > >>   * calling LOW_POWER_EXIT.
> > > >>   */
> > > >>  #define VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY 3
> > > >>  
> > > >>  /*
> > > >>   * This device feature has the same behavior as
> > > >>   * VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY with the exception that the user
> > > >>   * provides an eventfd for wake-up notification.    
> > > > 
> > > > It feels like this should be one entry point instead of two.
> > > > 
> > > > A flag "automatic re-sleep" and an optional eventfd (-1 means not
> > > > provided) seems to capture both of these behaviors in a bit clearer
> > > > and extendable way.  
> > 
> > I think the mutual exclusion between re-entrant mode and one-shot is
> > quite a bit more subtle in the version below, so I don't particularly
> > find this cleaner.  Potentially we could have variant drivers support
> > one w/o the other in the previously proposed model as well.  It's
> > interesting to see this suggestion since since we seem to have a theme
> > of making features single purpose elsewhere.  Thanks,  
> 
> It is still quite single purpose, just
> VFIO_DEVICE_LOW_POWER_REENTERY_DISABLE is some minor customization of
> that single purpose.
> 
> Either the flag is set or not, it isn't subtle..

The subtlety is that there's a flag and a field and the flag can only
be set if the field is set, the flag can only be clear if the field is
clear, so we return -EINVAL for the other cases?  Why do we have both a
flag and a field?  This isn't like we're adding a feature later and the
flag needs to indicate that the field is present and valid.  It's just
not a very clean interface, imo.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 1/5] vfio: Add the device features for the low power entry and exit
  2022-08-02 15:41                   ` Alex Williamson
@ 2022-08-02 16:35                     ` Jason Gunthorpe
  2022-08-02 16:57                       ` Alex Williamson
  0 siblings, 1 reply; 26+ messages in thread
From: Jason Gunthorpe @ 2022-08-02 16:35 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Abhishek Sahu, Cornelia Huck, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas,
	linux-kernel, kvm, linux-pm, linux-pci

On Tue, Aug 02, 2022 at 09:41:28AM -0600, Alex Williamson wrote:

> The subtlety is that there's a flag and a field and the flag can only
> be set if the field is set, the flag can only be clear if the field is
> clear, so we return -EINVAL for the other cases?  Why do we have both a
> flag and a field?  This isn't like we're adding a feature later and the
> flag needs to indicate that the field is present and valid.  It's just
> not a very clean interface, imo.  Thanks,

That isn't how I read Abhishek's proposal.. The eventfd should always
work and should always behave as described "The notification through
the provided eventfd will be generated only when the device has
entered and is resumed from a low power state"

If userspace provides it without LOW_POWER_REENTERY_DISABLE then it
still generates the events.

The linkage to LOW_POWER_REENTERY_DISABLE is only that userspace
probably needs to use both elements together to generate the
auto-reentry behavior. Kernel should not enforce it.

Two fields, orthogonal behaviors.

Jason

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 1/5] vfio: Add the device features for the low power entry and exit
  2022-08-02 16:35                     ` Jason Gunthorpe
@ 2022-08-02 16:57                       ` Alex Williamson
  2022-08-02 17:01                         ` Jason Gunthorpe
  0 siblings, 1 reply; 26+ messages in thread
From: Alex Williamson @ 2022-08-02 16:57 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Abhishek Sahu, Cornelia Huck, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas,
	linux-kernel, kvm, linux-pm, linux-pci

On Tue, 2 Aug 2022 13:35:04 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, Aug 02, 2022 at 09:41:28AM -0600, Alex Williamson wrote:
> 
> > The subtlety is that there's a flag and a field and the flag can only
> > be set if the field is set, the flag can only be clear if the field is
> > clear, so we return -EINVAL for the other cases?  Why do we have both a
> > flag and a field?  This isn't like we're adding a feature later and the
> > flag needs to indicate that the field is present and valid.  It's just
> > not a very clean interface, imo.  Thanks,  
> 
> That isn't how I read Abhishek's proposal.. The eventfd should always
> work and should always behave as described "The notification through
> the provided eventfd will be generated only when the device has
> entered and is resumed from a low power state"
> 
> If userspace provides it without LOW_POWER_REENTERY_DISABLE then it
> still generates the events.
> 
> The linkage to LOW_POWER_REENTERY_DISABLE is only that userspace
> probably needs to use both elements together to generate the
> auto-reentry behavior. Kernel should not enforce it.
> 
> Two fields, orthogonal behaviors.

What's the point of notifying userspace that the device was resumed if
it might already be suspended again by the time userspace responds to
the eventfd?  This really muddies the single-shot vs re-entrant
behavior.  If userspace allows the kernel to re-enter low power, then
they don't need to be notified on every wake.  If userspace wants to be
notified on wake, userspace can also re-initiate low power entry.
Thanks,

Alex


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 1/5] vfio: Add the device features for the low power entry and exit
  2022-08-02 16:57                       ` Alex Williamson
@ 2022-08-02 17:01                         ` Jason Gunthorpe
  2022-08-03  6:32                           ` Abhishek Sahu
  0 siblings, 1 reply; 26+ messages in thread
From: Jason Gunthorpe @ 2022-08-02 17:01 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Abhishek Sahu, Cornelia Huck, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas,
	linux-kernel, kvm, linux-pm, linux-pci

On Tue, Aug 02, 2022 at 10:57:55AM -0600, Alex Williamson wrote:
> On Tue, 2 Aug 2022 13:35:04 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Tue, Aug 02, 2022 at 09:41:28AM -0600, Alex Williamson wrote:
> > 
> > > The subtlety is that there's a flag and a field and the flag can only
> > > be set if the field is set, the flag can only be clear if the field is
> > > clear, so we return -EINVAL for the other cases?  Why do we have both a
> > > flag and a field?  This isn't like we're adding a feature later and the
> > > flag needs to indicate that the field is present and valid.  It's just
> > > not a very clean interface, imo.  Thanks,  
> > 
> > That isn't how I read Abhishek's proposal.. The eventfd should always
> > work and should always behave as described "The notification through
> > the provided eventfd will be generated only when the device has
> > entered and is resumed from a low power state"
> > 
> > If userspace provides it without LOW_POWER_REENTERY_DISABLE then it
> > still generates the events.
> > 
> > The linkage to LOW_POWER_REENTERY_DISABLE is only that userspace
> > probably needs to use both elements together to generate the
> > auto-reentry behavior. Kernel should not enforce it.
> > 
> > Two fields, orthogonal behaviors.
> 
> What's the point of notifying userspace that the device was resumed if
> it might already be suspended again by the time userspace responds to
> the eventfd? 

I don't know - the eventfds is counting so it does let userspace
monitor frequency of auto-sleeping.

In any case the point is to make simple kernel APIs, not cover every
combination with a use case. Decoupling is simpler than coupling.

Jason

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 1/5] vfio: Add the device features for the low power entry and exit
  2022-08-02 17:01                         ` Jason Gunthorpe
@ 2022-08-03  6:32                           ` Abhishek Sahu
  0 siblings, 0 replies; 26+ messages in thread
From: Abhishek Sahu @ 2022-08-03  6:32 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: Cornelia Huck, Yishai Hadas, Shameer Kolothum, Kevin Tian,
	Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas, linux-kernel,
	kvm, linux-pm, linux-pci

On 8/2/2022 10:31 PM, Jason Gunthorpe wrote:
> On Tue, Aug 02, 2022 at 10:57:55AM -0600, Alex Williamson wrote:
>> On Tue, 2 Aug 2022 13:35:04 -0300
>> Jason Gunthorpe <jgg@nvidia.com> wrote:
>>
>>> On Tue, Aug 02, 2022 at 09:41:28AM -0600, Alex Williamson wrote:
>>>
>>>> The subtlety is that there's a flag and a field and the flag can only
>>>> be set if the field is set, the flag can only be clear if the field is
>>>> clear, so we return -EINVAL for the other cases?  Why do we have both a
>>>> flag and a field?  This isn't like we're adding a feature later and the
>>>> flag needs to indicate that the field is present and valid.  It's just
>>>> not a very clean interface, imo.  Thanks,  
>>>
>>> That isn't how I read Abhishek's proposal.. The eventfd should always
>>> work and should always behave as described "The notification through
>>> the provided eventfd will be generated only when the device has
>>> entered and is resumed from a low power state"
>>>
>>> If userspace provides it without LOW_POWER_REENTERY_DISABLE then it
>>> still generates the events.
>>>
>>> The linkage to LOW_POWER_REENTERY_DISABLE is only that userspace
>>> probably needs to use both elements together to generate the
>>> auto-reentry behavior. Kernel should not enforce it.
>>>
>>> Two fields, orthogonal behaviors.
>>
>> What's the point of notifying userspace that the device was resumed if
>> it might already be suspended again by the time userspace responds to
>> the eventfd? 
> 
> I don't know - the eventfds is counting so it does let userspace
> monitor frequency of auto-sleeping.
> 
> In any case the point is to make simple kernel APIs, not cover every
> combination with a use case. Decoupling is simpler than coupling.
> 
> Jason


 Thanks Alex and Jason for your inputs.

 It seems, I can use the original uAPI where we will have 2 variants
 of ENTRY.

 Since already kernel merge window is started, so I will wait for
 v5.20-rc1 (or v6.0-rc1) and then I will rebase and test my patches on
 the updated kernel.

 Regards,
 Abhishek

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2022-08-03  6:32 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-19 12:15 [PATCH v5 0/5] vfio/pci: power management changes Abhishek Sahu
2022-07-19 12:15 ` [PATCH v5 1/5] vfio: Add the device features for the low power entry and exit Abhishek Sahu
2022-07-21 22:34   ` Alex Williamson
2022-07-25 14:40     ` Abhishek Sahu
2022-07-25 22:09       ` Alex Williamson
2022-07-26 12:47         ` Abhishek Sahu
2022-07-26 13:13           ` Cornelia Huck
2022-07-26 14:17           ` Alex Williamson
2022-07-26 17:23           ` Jason Gunthorpe
2022-07-27  6:07             ` Abhishek Sahu
2022-08-01 18:42               ` Alex Williamson
2022-08-02 14:04                 ` Jason Gunthorpe
2022-08-02 15:41                   ` Alex Williamson
2022-08-02 16:35                     ` Jason Gunthorpe
2022-08-02 16:57                       ` Alex Williamson
2022-08-02 17:01                         ` Jason Gunthorpe
2022-08-03  6:32                           ` Abhishek Sahu
2022-07-19 12:15 ` [PATCH v5 2/5] vfio: Increment the runtime PM usage count during IOCTL call Abhishek Sahu
2022-07-21 22:34   ` Alex Williamson
2022-07-19 12:15 ` [PATCH v5 3/5] vfio/pci: Mask INTx during runtime suspend Abhishek Sahu
2022-07-19 12:15 ` [PATCH v5 4/5] vfio/pci: Implement VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY/EXIT Abhishek Sahu
2022-07-21 22:34   ` Alex Williamson
2022-07-25 14:48     ` Abhishek Sahu
2022-07-19 12:15 ` [PATCH v5 5/5] vfio/pci: Implement VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP Abhishek Sahu
2022-07-21 22:34   ` Alex Williamson
2022-07-25 15:04     ` Abhishek Sahu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).