linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/8] vfio/pci: power management changes
@ 2022-04-25  9:26 Abhishek Sahu
  2022-04-25  9:26 ` [PATCH v3 1/8] vfio/pci: Invalidate mmaps and block the access in D3hot power state Abhishek Sahu
                   ` (7 more replies)
  0 siblings, 8 replies; 41+ messages in thread
From: Abhishek Sahu @ 2022-04-25  9:26 UTC (permalink / raw)
  To: Alex Williamson, Cornelia Huck, Yishai Hadas, Jason Gunthorpe,
	Shameer Kolothum, Kevin Tian, Rafael J . Wysocki
  Cc: Max Gurtovoy, Bjorn Helgaas, linux-kernel, kvm, linux-pm,
	linux-pci, Abhishek Sahu

Currently, there is very limited power management support available
in the upstream vfio-pci driver. If there is no user of vfio-pci device,
then it will be moved into D3Hot state. Similarly, if we enable the
runtime power management for vfio-pci device in the guest OS, then the
device is being runtime suspended (for linux guest OS) and the PCI
device will be put into D3hot state (in function
vfio_pm_config_write()). If the D3cold state can be used instead of
D3hot, then it will help in saving maximum power. The D3cold state can't
be possible with native PCI PM. It requires interaction with platform
firmware which is system-specific. To go into low power states
(including D3cold), the runtime PM framework can be used which
internally interacts with PCI and platform firmware and puts the device
into the lowest possible D-States. This patch series registers the
vfio-pci driver with runtime PM framework and uses the same for moving
the physical PCI device to go into the low power state.

The current PM support was added with commit 6eb7018705de ("vfio-pci:
Move idle devices to D3hot power state") where the following point was
mentioned regarding D3cold state.

 "It's tempting to try to use D3cold, but we have no reason to inhibit
  hotplug of idle devices and we might get into a loop of having the
  device disappear before we have a chance to try to use it."

With the runtime PM, if the user want to prevent going into D3cold then
/sys/bus/pci/devices/.../d3cold_allowed can be set to 0 for the
devices where the above functionality is required instead of
disallowing the D3cold state for all the cases.

Since D3cold state can't be achieved by writing PCI standard PM
config registers, so a feature has been added in DEVICE_FEATURE IOCTL
for low power related handling, which changes the PCI
device from D3hot to D3cold state and then D3cold to D0 state.
The hypervisors can implement virtual ACPI methods. For example,
in guest linux OS if PCI device ACPI node has _PR3 and _PR0 power
resources with _ON/_OFF method, then guest linux OS makes the _OFF call
during D3cold transition and then _ON during D0 transition. The
hypervisor can tap these virtual ACPI calls and then do the D3cold
related IOCTL in vfio driver.

The BAR access needs to be disabled if device is in D3hot state.
Also, there should not be any config access if device is in D3cold
state. For SR-IOV, the PF power state should be higher than VF's power
state.

* Changes in v3

- Rebased patches on v5.18-rc3.
- Marked this series as PATCH instead of RFC.
- Addressed the review comments given in v2.
- Removed the limitation to keep device in D0 state if there is any
  access from host side. This is specific to NVIDIA use case and
  will be handled separately.
- Used the existing DEVICE_FEATURE IOCTL itself instead of adding new
  IOCTL for power management.
- Removed all custom code related with power management in runtime
  suspend/resume callbacks and IOCTL handling. Now, the callbacks
  contain code related with INTx handling and few other stuffs and
  all the PCI state and platform PM handling will be done by PCI core
  functions itself.
- Add the support of wake-up in main vfio layer itself since now we have
  more vfio/pci based drivers.
- Instead of assigning the 'struct dev_pm_ops' in individual parent
  driver, now the vfio_pci_core tself assigns the 'struct dev_pm_ops'. 
- Added handling of power management around SR-IOV handling.
- Moved the setting of drvdata in a separate patch.
- Masked INTx before during runtime suspended state.
- Changed the order of patches so that Fix related things are at beginning
  of this patch series.
- Removed storing the power state locally and used one new boolean to
  track the d3 (D3cold and D3hot) power state 
- Removed check for IO access in D3 power state.
- Used another helper function vfio_lock_and_set_power_state() instead
  of touching vfio_pci_set_power_state().
- Considered the fixes made in
  https://lore.kernel.org/lkml/20220217122107.22434-1-abhsahu@nvidia.com
  and updated the patches accordingly.

* Changes in v2
  (https://lore.kernel.org/lkml/20220124181726.19174-1-abhsahu@nvidia.com)

- Rebased patches on v5.17-rc1.
- Included the patch to handle BAR access in D3cold.
- Included the patch to fix memory leak.
- Made a separate IOCTL that can be used to change the power state from
  D3hot to D3cold and D3cold to D0.
- Addressed the review comments given in v1.

* v1
  https://lore.kernel.org/lkml/20211115133640.2231-1-abhsahu@nvidia.com/

Abhishek Sahu (8):
  vfio/pci: Invalidate mmaps and block the access in D3hot power state
  vfio/pci: Change the PF power state to D0 before enabling VFs
  vfio/pci: Virtualize PME related registers bits and initialize to zero
  vfio/pci: Add support for setting driver data inside core layer
  vfio/pci: Enable runtime PM for vfio_pci_core based drivers
  vfio: Invoke runtime PM API for IOCTL request
  vfio/pci: Mask INTx during runtime suspend
  vfio/pci: Add the support for PCI D3cold state

 .../vfio/pci/hisilicon/hisi_acc_vfio_pci.c    |   4 +-
 drivers/vfio/pci/mlx5/main.c                  |   3 +-
 drivers/vfio/pci/vfio_pci.c                   |   4 +-
 drivers/vfio/pci/vfio_pci_config.c            |  63 ++-
 drivers/vfio/pci/vfio_pci_core.c              | 358 +++++++++++++++---
 drivers/vfio/pci/vfio_pci_intrs.c             |   6 +-
 drivers/vfio/pci/vfio_pci_rdwr.c              |   6 +-
 drivers/vfio/vfio.c                           |  44 ++-
 include/linux/vfio_pci_core.h                 |  12 +-
 include/uapi/linux/vfio.h                     |  18 +
 10 files changed, 445 insertions(+), 73 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH v3 1/8] vfio/pci: Invalidate mmaps and block the access in D3hot power state
  2022-04-25  9:26 [PATCH v3 0/8] vfio/pci: power management changes Abhishek Sahu
@ 2022-04-25  9:26 ` Abhishek Sahu
  2022-04-26  1:42   ` kernel test robot
  2022-04-25  9:26 ` [PATCH v3 2/8] vfio/pci: Change the PF power state to D0 before enabling VFs Abhishek Sahu
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 41+ messages in thread
From: Abhishek Sahu @ 2022-04-25  9:26 UTC (permalink / raw)
  To: Alex Williamson, Cornelia Huck, Yishai Hadas, Jason Gunthorpe,
	Shameer Kolothum, Kevin Tian, Rafael J . Wysocki
  Cc: Max Gurtovoy, Bjorn Helgaas, linux-kernel, kvm, linux-pm,
	linux-pci, Abhishek Sahu

According to [PCIe v5 5.3.1.4.1] for D3hot state

 "Configuration and Message requests are the only TLPs accepted by a
  Function in the D3Hot state. All other received Requests must be
  handled as Unsupported Requests, and all received Completions may
  optionally be handled as Unexpected Completions."

Currently, if the vfio PCI device has been put into D3hot state and if
user makes non-config related read/write request in D3hot state, these
requests will be forwarded to the host and this access may cause
issues on a few systems.

This patch leverages the memory-disable support added in commit
'abafbc551fdd ("vfio-pci: Invalidate mmaps and block MMIO access on
disabled memory")' to generate page fault on mmap access and
return error for the direct read/write. If the device is D3hot state,
then the error will be returned for MMIO access. The IO access generally
does not make the system unresponsive so the IO access can still happen
in D3hot state. The default value should be returned in this case
without bringing down the complete system.

Also, the power related structure fields need to be protected so
we can use the same 'memory_lock' to protect these fields also.
This protection is mainly needed when user changes the PCI
power state by writing into PCI_PM_CTRL register.
vfio_lock_and_set_power_state() wrapper function will take the
required locks and then it will invoke the vfio_pci_set_power_state().

Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
---
 drivers/vfio/pci/vfio_pci_config.c | 19 ++++++++++++++++++-
 drivers/vfio/pci/vfio_pci_core.c   |  4 +++-
 drivers/vfio/pci/vfio_pci_rdwr.c   |  6 ++++--
 include/linux/vfio_pci_core.h      |  1 +
 4 files changed, 26 insertions(+), 4 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
index 6e58b4bf7a60..dd557edae6e1 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -692,6 +692,23 @@ static int __init init_pci_cap_basic_perm(struct perm_bits *perm)
 	return 0;
 }
 
+/*
+ * It takes all the required locks to protect the access of power related
+ * variables and then invokes vfio_pci_set_power_state().
+ */
+static void
+vfio_lock_and_set_power_state(struct vfio_pci_core_device *vdev,
+			      pci_power_t state)
+{
+	if (state >= PCI_D3hot)
+		vfio_pci_zap_and_down_write_memory_lock(vdev);
+	else
+		down_write(&vdev->memory_lock);
+
+	vfio_pci_set_power_state(vdev, state);
+	up_write(&vdev->memory_lock);
+}
+
 static int vfio_pm_config_write(struct vfio_pci_core_device *vdev, int pos,
 				int count, struct perm_bits *perm,
 				int offset, __le32 val)
@@ -718,7 +735,7 @@ static int vfio_pm_config_write(struct vfio_pci_core_device *vdev, int pos,
 			break;
 		}
 
-		vfio_pci_set_power_state(vdev, state);
+		vfio_lock_and_set_power_state(vdev, state);
 	}
 
 	return count;
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 06b6f3594a13..f3dfb033e1c4 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -230,6 +230,8 @@ int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t stat
 	ret = pci_set_power_state(pdev, state);
 
 	if (!ret) {
+		vdev->power_state_d3 = (pdev->current_state >= PCI_D3hot);
+
 		/* D3 might be unsupported via quirk, skip unless in D3 */
 		if (needs_save && pdev->current_state >= PCI_D3hot) {
 			/*
@@ -1398,7 +1400,7 @@ static vm_fault_t vfio_pci_mmap_fault(struct vm_fault *vmf)
 	mutex_lock(&vdev->vma_lock);
 	down_read(&vdev->memory_lock);
 
-	if (!__vfio_pci_memory_enabled(vdev)) {
+	if (!__vfio_pci_memory_enabled(vdev) || vdev->power_state_d3) {
 		ret = VM_FAULT_SIGBUS;
 		goto up_out;
 	}
diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c
index 82ac1569deb0..fac6bb40a4ce 100644
--- a/drivers/vfio/pci/vfio_pci_rdwr.c
+++ b/drivers/vfio/pci/vfio_pci_rdwr.c
@@ -43,7 +43,8 @@ static int vfio_pci_iowrite##size(struct vfio_pci_core_device *vdev,		\
 {									\
 	if (test_mem) {							\
 		down_read(&vdev->memory_lock);				\
-		if (!__vfio_pci_memory_enabled(vdev)) {			\
+		if (!__vfio_pci_memory_enabled(vdev) ||			\
+		    vdev->power_state_d3) {				\
 			up_read(&vdev->memory_lock);			\
 			return -EIO;					\
 		}							\
@@ -70,7 +71,8 @@ static int vfio_pci_ioread##size(struct vfio_pci_core_device *vdev,		\
 {									\
 	if (test_mem) {							\
 		down_read(&vdev->memory_lock);				\
-		if (!__vfio_pci_memory_enabled(vdev)) {			\
+		if (!__vfio_pci_memory_enabled(vdev) ||			\
+		    vdev->power_state_d3) {				\
 			up_read(&vdev->memory_lock);			\
 			return -EIO;					\
 		}							\
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index 48f2dd3c568c..505b2a74a479 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -124,6 +124,7 @@ struct vfio_pci_core_device {
 	bool			needs_reset;
 	bool			nointx;
 	bool			needs_pm_restore;
+	bool			power_state_d3;
 	struct pci_saved_state	*pci_saved_state;
 	struct pci_saved_state	*pm_save;
 	int			ioeventfds_nr;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v3 2/8] vfio/pci: Change the PF power state to D0 before enabling VFs
  2022-04-25  9:26 [PATCH v3 0/8] vfio/pci: power management changes Abhishek Sahu
  2022-04-25  9:26 ` [PATCH v3 1/8] vfio/pci: Invalidate mmaps and block the access in D3hot power state Abhishek Sahu
@ 2022-04-25  9:26 ` Abhishek Sahu
  2022-04-25  9:26 ` [PATCH v3 3/8] vfio/pci: Virtualize PME related registers bits and initialize to zero Abhishek Sahu
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 41+ messages in thread
From: Abhishek Sahu @ 2022-04-25  9:26 UTC (permalink / raw)
  To: Alex Williamson, Cornelia Huck, Yishai Hadas, Jason Gunthorpe,
	Shameer Kolothum, Kevin Tian, Rafael J . Wysocki
  Cc: Max Gurtovoy, Bjorn Helgaas, linux-kernel, kvm, linux-pm,
	linux-pci, Abhishek Sahu

According to [PCIe v5 9.6.2] for PF Device Power Management States

 "The PF's power management state (D-state) has global impact on its
  associated VFs. If a VF does not implement the Power Management
  Capability, then it behaves as if it is in an equivalent
  power state of its associated PF.

  If a VF implements the Power Management Capability, the Device behavior
  is undefined if the PF is placed in a lower power state than the VF.
  Software should avoid this situation by placing all VFs in lower power
  state before lowering their associated PF's power state."

From the vfio driver side, user can enable SR-IOV when the PF is in D3hot
state. If VF does not implement the Power Management Capability, then
the VF will be actually in D3hot state and then the VF BAR access will
fail. If VF implements the Power Management Capability, then VF will
assume that its current power state is D0 when the PF is D3hot and
in this case, the behavior is undefined.

To support PF power management, we need to create power management
dependency between PF and its VF's. The runtime power management support
may help with this where power management dependencies are supported
through device links. But till we have such support in place, we can
disallow the PF to go into low power state, if PF has VF enabled.
There can be a case, where user first enables the VF's and then
disables the VF's. If there is no user of PF, then the PF can put into
D3hot state again. But with this patch, the PF will still be in D0
state after disabling VF's since detecting this case inside
vfio_pci_core_sriov_configure() requires access to
struct vfio_device::open_count along with its locks. But the subsequent
patches related with runtime PM will handle this case since runtime PM
maintains its own usage count.

Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
---
 drivers/vfio/pci/vfio_pci_core.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index f3dfb033e1c4..1271728a09db 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -217,6 +217,10 @@ int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t stat
 	bool needs_restore = false, needs_save = false;
 	int ret;
 
+	/* Prevent changing power state for PFs with VFs enabled */
+	if (pci_num_vf(pdev) && state > PCI_D0)
+		return -EBUSY;
+
 	if (vdev->needs_pm_restore) {
 		if (pdev->current_state < PCI_D3hot && state >= PCI_D3hot) {
 			pci_save_state(pdev);
@@ -1959,6 +1963,13 @@ int vfio_pci_core_sriov_configure(struct pci_dev *pdev, int nr_virtfn)
 		}
 		list_add_tail(&vdev->sriov_pfs_item, &vfio_pci_sriov_pfs);
 		mutex_unlock(&vfio_pci_sriov_pfs_mutex);
+
+		/*
+		 * The PF power state should always be higher than the VF power
+		 * state. If PF is in the low power state, then change the
+		 * power state to D0 first before enabling SR-IOV.
+		 */
+		vfio_pci_set_power_state(vdev, PCI_D0);
 		ret = pci_enable_sriov(pdev, nr_virtfn);
 		if (ret)
 			goto out_del;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v3 3/8] vfio/pci: Virtualize PME related registers bits and initialize to zero
  2022-04-25  9:26 [PATCH v3 0/8] vfio/pci: power management changes Abhishek Sahu
  2022-04-25  9:26 ` [PATCH v3 1/8] vfio/pci: Invalidate mmaps and block the access in D3hot power state Abhishek Sahu
  2022-04-25  9:26 ` [PATCH v3 2/8] vfio/pci: Change the PF power state to D0 before enabling VFs Abhishek Sahu
@ 2022-04-25  9:26 ` Abhishek Sahu
  2022-04-25  9:26 ` [PATCH v3 4/8] vfio/pci: Add support for setting driver data inside core layer Abhishek Sahu
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 41+ messages in thread
From: Abhishek Sahu @ 2022-04-25  9:26 UTC (permalink / raw)
  To: Alex Williamson, Cornelia Huck, Yishai Hadas, Jason Gunthorpe,
	Shameer Kolothum, Kevin Tian, Rafael J . Wysocki
  Cc: Max Gurtovoy, Bjorn Helgaas, linux-kernel, kvm, linux-pm,
	linux-pci, Abhishek Sahu

If any PME event will be generated by PCI, then it will be mostly
handled in the host by the root port PME code. For example, in the case
of PCIe, the PME event will be sent to the root port and then the PME
interrupt will be generated. This will be handled in
drivers/pci/pcie/pme.c at the host side. Inside this, the
pci_check_pme_status() will be called where PME_Status and PME_En bits
will be cleared. So, the guest OS which is using vfio-pci device will
not come to know about this PME event.

To handle these PME events inside guests, we need some framework so
that if any PME events will happen, then it needs to be forwarded to
virtual machine monitor. We can virtualize PME related registers bits
and initialize these bits to zero so vfio-pci device user will assume
that it is not capable of asserting the PME# signal from any power state.

Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
---
 drivers/vfio/pci/vfio_pci_config.c | 33 +++++++++++++++++++++++++++++-
 1 file changed, 32 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
index dd557edae6e1..af0ae80ef324 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -755,12 +755,29 @@ static int __init init_pci_cap_pm_perm(struct perm_bits *perm)
 	 */
 	p_setb(perm, PCI_CAP_LIST_NEXT, (u8)ALL_VIRT, NO_WRITE);
 
+	/*
+	 * The guests can't process PME events. If any PME event will be
+	 * generated, then it will be mostly handled in the host and the
+	 * host will clear the PME_STATUS. So virtualize PME_Support bits.
+	 * The vconfig bits will be cleared during device capability
+	 * initialization.
+	 */
+	p_setw(perm, PCI_PM_PMC, PCI_PM_CAP_PME_MASK, NO_WRITE);
+
 	/*
 	 * Power management is defined *per function*, so we can let
 	 * the user change power state, but we trap and initiate the
 	 * change ourselves, so the state bits are read-only.
+	 *
+	 * The guest can't process PME from D3cold so virtualize PME_Status
+	 * and PME_En bits. The vconfig bits will be cleared during device
+	 * capability initialization.
 	 */
-	p_setd(perm, PCI_PM_CTRL, NO_VIRT, ~PCI_PM_CTRL_STATE_MASK);
+	p_setd(perm, PCI_PM_CTRL,
+	       PCI_PM_CTRL_PME_ENABLE | PCI_PM_CTRL_PME_STATUS,
+	       ~(PCI_PM_CTRL_PME_ENABLE | PCI_PM_CTRL_PME_STATUS |
+		 PCI_PM_CTRL_STATE_MASK));
+
 	return 0;
 }
 
@@ -1429,6 +1446,17 @@ static int vfio_ext_cap_len(struct vfio_pci_core_device *vdev, u16 ecap, u16 epo
 	return 0;
 }
 
+static void vfio_update_pm_vconfig_bytes(struct vfio_pci_core_device *vdev,
+					 int offset)
+{
+	__le16 *pmc = (__le16 *)&vdev->vconfig[offset + PCI_PM_PMC];
+	__le16 *ctrl = (__le16 *)&vdev->vconfig[offset + PCI_PM_CTRL];
+
+	/* Clear vconfig PME_Support, PME_Status, and PME_En bits */
+	*pmc &= ~cpu_to_le16(PCI_PM_CAP_PME_MASK);
+	*ctrl &= ~cpu_to_le16(PCI_PM_CTRL_PME_ENABLE | PCI_PM_CTRL_PME_STATUS);
+}
+
 static int vfio_fill_vconfig_bytes(struct vfio_pci_core_device *vdev,
 				   int offset, int size)
 {
@@ -1552,6 +1580,9 @@ static int vfio_cap_init(struct vfio_pci_core_device *vdev)
 		if (ret)
 			return ret;
 
+		if (cap == PCI_CAP_ID_PM)
+			vfio_update_pm_vconfig_bytes(vdev, pos);
+
 		prev = &vdev->vconfig[pos + PCI_CAP_LIST_NEXT];
 		pos = next;
 		caps++;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v3 4/8] vfio/pci: Add support for setting driver data inside core layer
  2022-04-25  9:26 [PATCH v3 0/8] vfio/pci: power management changes Abhishek Sahu
                   ` (2 preceding siblings ...)
  2022-04-25  9:26 ` [PATCH v3 3/8] vfio/pci: Virtualize PME related registers bits and initialize to zero Abhishek Sahu
@ 2022-04-25  9:26 ` Abhishek Sahu
  2022-05-03 17:11   ` Alex Williamson
  2022-04-25  9:26 ` [PATCH v3 5/8] vfio/pci: Enable runtime PM for vfio_pci_core based drivers Abhishek Sahu
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 41+ messages in thread
From: Abhishek Sahu @ 2022-04-25  9:26 UTC (permalink / raw)
  To: Alex Williamson, Cornelia Huck, Yishai Hadas, Jason Gunthorpe,
	Shameer Kolothum, Kevin Tian, Rafael J . Wysocki
  Cc: Max Gurtovoy, Bjorn Helgaas, linux-kernel, kvm, linux-pm,
	linux-pci, Abhishek Sahu

The vfio driver is divided into two layers: core layer (implemented in
vfio_pci_core.c) and parent driver (For example, vfio_pci, mlx5_vfio_pci,
hisi_acc_vfio_pci, etc.). All the parent driver calls dev_set_drvdata()
and assigns its own structure as driver data. Some of the callback
functions are implemented in the core layer and these callback functions
provide the reference of 'struct pci_dev' or 'struct device'. Currently,
we use vfio_device_get_from_dev() which provides reference to the
vfio_device for a device. But this function follows long path to extract
the same. There are few cases, where we don't need to go through this
long path if we get this through drvdata.

This patch moves the setting of drvdata inside the core layer. If we see
the current implementation of parent driver structure implementation,
then 'struct vfio_pci_core_device' is a first member so the pointer of
the parent structure and 'struct vfio_pci_core_device' should be the same.

struct hisi_acc_vf_core_device {
    struct vfio_pci_core_device core_device;
    ...
};

struct mlx5vf_pci_core_device {
    struct vfio_pci_core_device core_device;
    ...
};

The vfio_pci.c uses 'struct vfio_pci_core_device' itself.

To support getting the drvdata in both the layers, we can put the
restriction to make 'struct vfio_pci_core_device' as a first member.
Also, vfio_pci_core_register_device() has this validation which makes sure
that this prerequisite is always satisfied.

Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
---
 .../vfio/pci/hisilicon/hisi_acc_vfio_pci.c    |  4 ++--
 drivers/vfio/pci/mlx5/main.c                  |  3 +--
 drivers/vfio/pci/vfio_pci.c                   |  4 ++--
 drivers/vfio/pci/vfio_pci_core.c              | 24 ++++++++++++++++---
 include/linux/vfio_pci_core.h                 |  7 +++++-
 5 files changed, 32 insertions(+), 10 deletions(-)

diff --git a/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c b/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c
index 767b5d47631a..c76c09302a8f 100644
--- a/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c
+++ b/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c
@@ -1274,11 +1274,11 @@ static int hisi_acc_vfio_pci_probe(struct pci_dev *pdev, const struct pci_device
 					  &hisi_acc_vfio_pci_ops);
 	}
 
-	ret = vfio_pci_core_register_device(&hisi_acc_vdev->core_device);
+	ret = vfio_pci_core_register_device(&hisi_acc_vdev->core_device,
+					    hisi_acc_vdev);
 	if (ret)
 		goto out_free;
 
-	dev_set_drvdata(&pdev->dev, hisi_acc_vdev);
 	return 0;
 
 out_free:
diff --git a/drivers/vfio/pci/mlx5/main.c b/drivers/vfio/pci/mlx5/main.c
index bbec5d288fee..8689248f66f3 100644
--- a/drivers/vfio/pci/mlx5/main.c
+++ b/drivers/vfio/pci/mlx5/main.c
@@ -614,11 +614,10 @@ static int mlx5vf_pci_probe(struct pci_dev *pdev,
 		}
 	}
 
-	ret = vfio_pci_core_register_device(&mvdev->core_device);
+	ret = vfio_pci_core_register_device(&mvdev->core_device, mvdev);
 	if (ret)
 		goto out_free;
 
-	dev_set_drvdata(&pdev->dev, mvdev);
 	return 0;
 
 out_free:
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 2b047469e02f..e0f8027c5cd8 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -151,10 +151,10 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 		return -ENOMEM;
 	vfio_pci_core_init_device(vdev, pdev, &vfio_pci_ops);
 
-	ret = vfio_pci_core_register_device(vdev);
+	ret = vfio_pci_core_register_device(vdev, vdev);
 	if (ret)
 		goto out_free;
-	dev_set_drvdata(&pdev->dev, vdev);
+
 	return 0;
 
 out_free:
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 1271728a09db..953ac33b2f5f 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1822,9 +1822,11 @@ void vfio_pci_core_uninit_device(struct vfio_pci_core_device *vdev)
 }
 EXPORT_SYMBOL_GPL(vfio_pci_core_uninit_device);
 
-int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev)
+int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev,
+				  void *driver_data)
 {
 	struct pci_dev *pdev = vdev->pdev;
+	struct device *dev = &pdev->dev;
 	int ret;
 
 	if (pdev->hdr_type != PCI_HEADER_TYPE_NORMAL)
@@ -1843,6 +1845,17 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev)
 		return -EBUSY;
 	}
 
+	/*
+	 * The 'struct vfio_pci_core_device' should be the first member of the
+	 * of the structure referenced by 'driver_data' so that it can be
+	 * retrieved with dev_get_drvdata() inside vfio-pci core layer.
+	 */
+	if ((struct vfio_pci_core_device *)driver_data != vdev) {
+		pci_warn(pdev, "Invalid driver data\n");
+		return -EINVAL;
+	}
+	dev_set_drvdata(dev, driver_data);
+
 	if (pci_is_root_bus(pdev->bus)) {
 		ret = vfio_assign_device_set(&vdev->vdev, vdev);
 	} else if (!pci_probe_reset_slot(pdev->slot)) {
@@ -1856,10 +1869,10 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev)
 	}
 
 	if (ret)
-		return ret;
+		goto out_drvdata;
 	ret = vfio_pci_vf_init(vdev);
 	if (ret)
-		return ret;
+		goto out_drvdata;
 	ret = vfio_pci_vga_init(vdev);
 	if (ret)
 		goto out_vf;
@@ -1890,6 +1903,8 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev)
 		vfio_pci_set_power_state(vdev, PCI_D0);
 out_vf:
 	vfio_pci_vf_uninit(vdev);
+out_drvdata:
+	dev_set_drvdata(dev, NULL);
 	return ret;
 }
 EXPORT_SYMBOL_GPL(vfio_pci_core_register_device);
@@ -1897,6 +1912,7 @@ EXPORT_SYMBOL_GPL(vfio_pci_core_register_device);
 void vfio_pci_core_unregister_device(struct vfio_pci_core_device *vdev)
 {
 	struct pci_dev *pdev = vdev->pdev;
+	struct device *dev = &pdev->dev;
 
 	vfio_pci_core_sriov_configure(pdev, 0);
 
@@ -1907,6 +1923,8 @@ void vfio_pci_core_unregister_device(struct vfio_pci_core_device *vdev)
 
 	if (!disable_idle_d3)
 		vfio_pci_set_power_state(vdev, PCI_D0);
+
+	dev_set_drvdata(dev, NULL);
 }
 EXPORT_SYMBOL_GPL(vfio_pci_core_unregister_device);
 
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index 505b2a74a479..3c7d65e68340 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -225,7 +225,12 @@ void vfio_pci_core_close_device(struct vfio_device *core_vdev);
 void vfio_pci_core_init_device(struct vfio_pci_core_device *vdev,
 			       struct pci_dev *pdev,
 			       const struct vfio_device_ops *vfio_pci_ops);
-int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev);
+/*
+ * The 'struct vfio_pci_core_device' should be the first member
+ * of the structure referenced by 'driver_data'.
+ */
+int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev,
+				  void *driver_data);
 void vfio_pci_core_uninit_device(struct vfio_pci_core_device *vdev);
 void vfio_pci_core_unregister_device(struct vfio_pci_core_device *vdev);
 int vfio_pci_core_sriov_configure(struct pci_dev *pdev, int nr_virtfn);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v3 5/8] vfio/pci: Enable runtime PM for vfio_pci_core based drivers
  2022-04-25  9:26 [PATCH v3 0/8] vfio/pci: power management changes Abhishek Sahu
                   ` (3 preceding siblings ...)
  2022-04-25  9:26 ` [PATCH v3 4/8] vfio/pci: Add support for setting driver data inside core layer Abhishek Sahu
@ 2022-04-25  9:26 ` Abhishek Sahu
  2022-05-04 19:42   ` Alex Williamson
  2022-04-25  9:26 ` [PATCH v3 6/8] vfio: Invoke runtime PM API for IOCTL request Abhishek Sahu
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 41+ messages in thread
From: Abhishek Sahu @ 2022-04-25  9:26 UTC (permalink / raw)
  To: Alex Williamson, Cornelia Huck, Yishai Hadas, Jason Gunthorpe,
	Shameer Kolothum, Kevin Tian, Rafael J . Wysocki
  Cc: Max Gurtovoy, Bjorn Helgaas, linux-kernel, kvm, linux-pm,
	linux-pci, Abhishek Sahu

Currently, there is very limited power management support
available in the upstream vfio_pci_core based drivers. If there
are no users of the device, then the PCI device will be moved into
D3hot state by writing directly into PCI PM registers. This D3hot
state help in saving power but we can achieve zero power consumption
if we go into the D3cold state. The D3cold state cannot be possible
with native PCI PM. It requires interaction with platform firmware
which is system-specific. To go into low power states (including D3cold),
the runtime PM framework can be used which internally interacts with PCI
and platform firmware and puts the device into the lowest possible
D-States.

This patch registers vfio_pci_core based drivers with the
runtime PM framework.

1. The PCI core framework takes care of most of the runtime PM
   related things. For enabling the runtime PM, the PCI driver needs to
   decrement the usage count and needs to provide 'struct dev_pm_ops'
   at least. The runtime suspend/resume callbacks are optional and needed
   only if we need to do any extra handling. Now there are multiple
   vfio_pci_core based drivers. Instead of assigning the
   'struct dev_pm_ops' in individual parent driver, the vfio_pci_core
   itself assigns the 'struct dev_pm_ops'. There are other drivers where
   the 'struct dev_pm_ops' is being assigned inside core layer
   (For example, wlcore_probe() and some sound based driver, etc.).

2. This patch provides the stub implementation of 'struct dev_pm_ops'.
   The subsequent patch will provide the runtime suspend/resume
   callbacks. All the config state saving, and PCI power management
   related things will be done by PCI core framework itself inside its
   runtime suspend/resume callbacks (pci_pm_runtime_suspend() and
   pci_pm_runtime_resume()).

3. Inside pci_reset_bus(), all the devices in dev_set needs to be
   runtime resumed. vfio_pci_dev_set_pm_runtime_get() will take
   care of the runtime resume and its error handling.

4. Inside vfio_pci_core_disable(), the device usage count always needs
   to be decremented which was incremented in vfio_pci_core_enable().

5. Since the runtime PM framework will provide the same functionality,
   so directly writing into PCI PM config register can be replaced with
   the use of runtime PM routines. Also, the use of runtime PM can help
   us in more power saving.

   In the systems which do not support D3cold,

   With the existing implementation:

   // PCI device
   # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
   D3hot
   // upstream bridge
   # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
   D0

   With runtime PM:

   // PCI device
   # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
   D3hot
   // upstream bridge
   # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
   D3hot

   So, with runtime PM, the upstream bridge or root port will also go
   into lower power state which is not possible with existing
   implementation.

   In the systems which support D3cold,

   // PCI device
   # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
   D3hot
   // upstream bridge
   # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
   D0

   With runtime PM:

   // PCI device
   # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
   D3cold
   // upstream bridge
   # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
   D3cold

   So, with runtime PM, both the PCI device and upstream bridge will
   go into D3cold state.

6. If 'disable_idle_d3' module parameter is set, then also the runtime
   PM will be enabled, but in this case, the usage count should not be
   decremented.

7. vfio_pci_dev_set_try_reset() return value is unused now, so this
   function return type can be changed to void.

8. Use the runtime PM API's in vfio_pci_core_sriov_configure().
   For preventing any runtime usage mismatch, pci_num_vf() has been
   called explicitly during disable.

Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
---
 drivers/vfio/pci/vfio_pci_core.c | 169 +++++++++++++++++++++----------
 1 file changed, 114 insertions(+), 55 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 953ac33b2f5f..aee5e0cd6137 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -156,7 +156,7 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_core_device *vdev)
 }
 
 struct vfio_pci_group_info;
-static bool vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set);
+static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set);
 static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
 				      struct vfio_pci_group_info *groups);
 
@@ -261,6 +261,19 @@ int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t stat
 	return ret;
 }
 
+#ifdef CONFIG_PM
+/*
+ * The dev_pm_ops needs to be provided to make pci-driver runtime PM working,
+ * so use structure without any callbacks.
+ *
+ * The pci-driver core runtime PM routines always save the device state
+ * before going into suspended state. If the device is going into low power
+ * state with only with runtime PM ops, then no explicit handling is needed
+ * for the devices which have NoSoftRst-.
+ */
+static const struct dev_pm_ops vfio_pci_core_pm_ops = { };
+#endif
+
 int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
 {
 	struct pci_dev *pdev = vdev->pdev;
@@ -268,21 +281,23 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
 	u16 cmd;
 	u8 msix_pos;
 
-	vfio_pci_set_power_state(vdev, PCI_D0);
+	if (!disable_idle_d3) {
+		ret = pm_runtime_resume_and_get(&pdev->dev);
+		if (ret < 0)
+			return ret;
+	}
 
 	/* Don't allow our initial saved state to include busmaster */
 	pci_clear_master(pdev);
 
 	ret = pci_enable_device(pdev);
 	if (ret)
-		return ret;
+		goto out_power;
 
 	/* If reset fails because of the device lock, fail this path entirely */
 	ret = pci_try_reset_function(pdev);
-	if (ret == -EAGAIN) {
-		pci_disable_device(pdev);
-		return ret;
-	}
+	if (ret == -EAGAIN)
+		goto out_disable_device;
 
 	vdev->reset_works = !ret;
 	pci_save_state(pdev);
@@ -306,12 +321,8 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
 	}
 
 	ret = vfio_config_init(vdev);
-	if (ret) {
-		kfree(vdev->pci_saved_state);
-		vdev->pci_saved_state = NULL;
-		pci_disable_device(pdev);
-		return ret;
-	}
+	if (ret)
+		goto out_free_state;
 
 	msix_pos = pdev->msix_cap;
 	if (msix_pos) {
@@ -332,6 +343,16 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
 
 
 	return 0;
+
+out_free_state:
+	kfree(vdev->pci_saved_state);
+	vdev->pci_saved_state = NULL;
+out_disable_device:
+	pci_disable_device(pdev);
+out_power:
+	if (!disable_idle_d3)
+		pm_runtime_put(&pdev->dev);
+	return ret;
 }
 EXPORT_SYMBOL_GPL(vfio_pci_core_enable);
 
@@ -439,8 +460,11 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
 out:
 	pci_disable_device(pdev);
 
-	if (!vfio_pci_dev_set_try_reset(vdev->vdev.dev_set) && !disable_idle_d3)
-		vfio_pci_set_power_state(vdev, PCI_D3hot);
+	vfio_pci_dev_set_try_reset(vdev->vdev.dev_set);
+
+	/* Put the pm-runtime usage counter acquired during enable */
+	if (!disable_idle_d3)
+		pm_runtime_put(&pdev->dev);
 }
 EXPORT_SYMBOL_GPL(vfio_pci_core_disable);
 
@@ -1879,19 +1903,24 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev,
 
 	vfio_pci_probe_power_state(vdev);
 
-	if (!disable_idle_d3) {
-		/*
-		 * pci-core sets the device power state to an unknown value at
-		 * bootup and after being removed from a driver.  The only
-		 * transition it allows from this unknown state is to D0, which
-		 * typically happens when a driver calls pci_enable_device().
-		 * We're not ready to enable the device yet, but we do want to
-		 * be able to get to D3.  Therefore first do a D0 transition
-		 * before going to D3.
-		 */
-		vfio_pci_set_power_state(vdev, PCI_D0);
-		vfio_pci_set_power_state(vdev, PCI_D3hot);
-	}
+	/*
+	 * pci-core sets the device power state to an unknown value at
+	 * bootup and after being removed from a driver.  The only
+	 * transition it allows from this unknown state is to D0, which
+	 * typically happens when a driver calls pci_enable_device().
+	 * We're not ready to enable the device yet, but we do want to
+	 * be able to get to D3.  Therefore first do a D0 transition
+	 * before enabling runtime PM.
+	 */
+	vfio_pci_set_power_state(vdev, PCI_D0);
+
+#if defined(CONFIG_PM)
+	dev->driver->pm = &vfio_pci_core_pm_ops,
+#endif
+
+	pm_runtime_allow(dev);
+	if (!disable_idle_d3)
+		pm_runtime_put(dev);
 
 	ret = vfio_register_group_dev(&vdev->vdev);
 	if (ret)
@@ -1900,7 +1929,9 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev,
 
 out_power:
 	if (!disable_idle_d3)
-		vfio_pci_set_power_state(vdev, PCI_D0);
+		pm_runtime_get_noresume(dev);
+
+	pm_runtime_forbid(dev);
 out_vf:
 	vfio_pci_vf_uninit(vdev);
 out_drvdata:
@@ -1922,8 +1953,9 @@ void vfio_pci_core_unregister_device(struct vfio_pci_core_device *vdev)
 	vfio_pci_vga_uninit(vdev);
 
 	if (!disable_idle_d3)
-		vfio_pci_set_power_state(vdev, PCI_D0);
+		pm_runtime_get_noresume(dev);
 
+	pm_runtime_forbid(dev);
 	dev_set_drvdata(dev, NULL);
 }
 EXPORT_SYMBOL_GPL(vfio_pci_core_unregister_device);
@@ -1984,18 +2016,26 @@ int vfio_pci_core_sriov_configure(struct pci_dev *pdev, int nr_virtfn)
 
 		/*
 		 * The PF power state should always be higher than the VF power
-		 * state. If PF is in the low power state, then change the
-		 * power state to D0 first before enabling SR-IOV.
+		 * state. If PF is in the runtime suspended state, then resume
+		 * it first before enabling SR-IOV.
 		 */
-		vfio_pci_set_power_state(vdev, PCI_D0);
-		ret = pci_enable_sriov(pdev, nr_virtfn);
+		ret = pm_runtime_resume_and_get(&pdev->dev);
 		if (ret)
 			goto out_del;
+
+		ret = pci_enable_sriov(pdev, nr_virtfn);
+		if (ret) {
+			pm_runtime_put(&pdev->dev);
+			goto out_del;
+		}
 		ret = nr_virtfn;
 		goto out_put;
 	}
 
-	pci_disable_sriov(pdev);
+	if (pci_num_vf(pdev)) {
+		pci_disable_sriov(pdev);
+		pm_runtime_put(&pdev->dev);
+	}
 
 out_del:
 	mutex_lock(&vfio_pci_sriov_pfs_mutex);
@@ -2072,6 +2112,30 @@ vfio_pci_dev_set_resettable(struct vfio_device_set *dev_set)
 	return pdev;
 }
 
+static int vfio_pci_dev_set_pm_runtime_get(struct vfio_device_set *dev_set)
+{
+	struct vfio_pci_core_device *cur_pm;
+	struct vfio_pci_core_device *cur;
+	int ret = 0;
+
+	list_for_each_entry(cur_pm, &dev_set->device_list, vdev.dev_set_list) {
+		ret = pm_runtime_resume_and_get(&cur_pm->pdev->dev);
+		if (ret < 0)
+			break;
+	}
+
+	if (!ret)
+		return 0;
+
+	list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) {
+		if (cur == cur_pm)
+			break;
+		pm_runtime_put(&cur->pdev->dev);
+	}
+
+	return ret;
+}
+
 /*
  * We need to get memory_lock for each device, but devices can share mmap_lock,
  * therefore we need to zap and hold the vma_lock for each device, and only then
@@ -2178,43 +2242,38 @@ static bool vfio_pci_dev_set_needs_reset(struct vfio_device_set *dev_set)
  *  - At least one of the affected devices is marked dirty via
  *    needs_reset (such as by lack of FLR support)
  * Then attempt to perform that bus or slot reset.
- * Returns true if the dev_set was reset.
  */
-static bool vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set)
+static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set)
 {
 	struct vfio_pci_core_device *cur;
 	struct pci_dev *pdev;
-	int ret;
+	bool reset_done = false;
 
 	if (!vfio_pci_dev_set_needs_reset(dev_set))
-		return false;
+		return;
 
 	pdev = vfio_pci_dev_set_resettable(dev_set);
 	if (!pdev)
-		return false;
+		return;
 
 	/*
-	 * The pci_reset_bus() will reset all the devices in the bus.
-	 * The power state can be non-D0 for some of the devices in the bus.
-	 * For these devices, the pci_reset_bus() will internally set
-	 * the power state to D0 without vfio driver involvement.
-	 * For the devices which have NoSoftRst-, the reset function can
-	 * cause the PCI config space reset without restoring the original
-	 * state (saved locally in 'vdev->pm_save').
+	 * Some of the devices in the bus can be in the runtime suspended
+	 * state. Increment the usage count for all the devices in the dev_set
+	 * before reset and decrement the same after reset.
 	 */
-	list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list)
-		vfio_pci_set_power_state(cur, PCI_D0);
+	if (!disable_idle_d3 && vfio_pci_dev_set_pm_runtime_get(dev_set))
+		return;
 
-	ret = pci_reset_bus(pdev);
-	if (ret)
-		return false;
+	if (!pci_reset_bus(pdev))
+		reset_done = true;
 
 	list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) {
-		cur->needs_reset = false;
+		if (reset_done)
+			cur->needs_reset = false;
+
 		if (!disable_idle_d3)
-			vfio_pci_set_power_state(cur, PCI_D3hot);
+			pm_runtime_put(&cur->pdev->dev);
 	}
-	return true;
 }
 
 void vfio_pci_core_set_params(bool is_nointxmask, bool is_disable_vga,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v3 6/8] vfio: Invoke runtime PM API for IOCTL request
  2022-04-25  9:26 [PATCH v3 0/8] vfio/pci: power management changes Abhishek Sahu
                   ` (4 preceding siblings ...)
  2022-04-25  9:26 ` [PATCH v3 5/8] vfio/pci: Enable runtime PM for vfio_pci_core based drivers Abhishek Sahu
@ 2022-04-25  9:26 ` Abhishek Sahu
  2022-05-04 19:42   ` Alex Williamson
  2022-04-25  9:26 ` [PATCH v3 7/8] vfio/pci: Mask INTx during runtime suspend Abhishek Sahu
  2022-04-25  9:26 ` [PATCH v3 8/8] vfio/pci: Add the support for PCI D3cold state Abhishek Sahu
  7 siblings, 1 reply; 41+ messages in thread
From: Abhishek Sahu @ 2022-04-25  9:26 UTC (permalink / raw)
  To: Alex Williamson, Cornelia Huck, Yishai Hadas, Jason Gunthorpe,
	Shameer Kolothum, Kevin Tian, Rafael J . Wysocki
  Cc: Max Gurtovoy, Bjorn Helgaas, linux-kernel, kvm, linux-pm,
	linux-pci, Abhishek Sahu

The vfio/pci driver will have runtime power management support where the
user can put the device low power state and then PCI devices can go into
the D3cold state. If the device is in low power state and user issues any
IOCTL, then the device should be moved out of low power state first. Once
the IOCTL is serviced, then it can go into low power state again. The
runtime PM framework manages this with help of usage count. One option
was to add the runtime PM related API's inside vfio/pci driver but some
IOCTL (like VFIO_DEVICE_FEATURE) can follow a different path and more
IOCTL can be added in the future. Also, the runtime PM will be
added for vfio/pci based drivers variant currently but the other vfio
based drivers can use the same in the future. So, this patch adds the
runtime calls runtime related API in the top level IOCTL function itself.

For the vfio drivers which do not have runtime power management support
currently, the runtime PM API's won't be invoked. Only for vfio/pci
based drivers currently, the runtime PM API's will be invoked to increment
and decrement the usage count. Taking this usage count incremented while
servicing IOCTL will make sure that user won't put the device into low
power state when any other IOCTL is being serviced in parallel.

Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
---
 drivers/vfio/vfio.c | 44 +++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 41 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index a4555014bd1e..4e65a127744e 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -32,6 +32,7 @@
 #include <linux/vfio.h>
 #include <linux/wait.h>
 #include <linux/sched/signal.h>
+#include <linux/pm_runtime.h>
 #include "vfio.h"
 
 #define DRIVER_VERSION	"0.3"
@@ -1536,6 +1537,30 @@ static const struct file_operations vfio_group_fops = {
 	.release	= vfio_group_fops_release,
 };
 
+/*
+ * Wrapper around pm_runtime_resume_and_get().
+ * Return 0, if driver power management callbacks are not present i.e. the driver is not
+ * using runtime power management.
+ * Return 1 upon success, otherwise -errno
+ */
+static inline int vfio_device_pm_runtime_get(struct device *dev)
+{
+#ifdef CONFIG_PM
+	int ret;
+
+	if (!dev->driver || !dev->driver->pm)
+		return 0;
+
+	ret = pm_runtime_resume_and_get(dev);
+	if (ret < 0)
+		return ret;
+
+	return 1;
+#else
+	return 0;
+#endif
+}
+
 /*
  * VFIO Device fd
  */
@@ -1845,15 +1870,28 @@ static long vfio_device_fops_unl_ioctl(struct file *filep,
 				       unsigned int cmd, unsigned long arg)
 {
 	struct vfio_device *device = filep->private_data;
+	int pm_ret, ret = 0;
+
+	pm_ret = vfio_device_pm_runtime_get(device->dev);
+	if (pm_ret < 0)
+		return pm_ret;
 
 	switch (cmd) {
 	case VFIO_DEVICE_FEATURE:
-		return vfio_ioctl_device_feature(device, (void __user *)arg);
+		ret = vfio_ioctl_device_feature(device, (void __user *)arg);
+		break;
 	default:
 		if (unlikely(!device->ops->ioctl))
-			return -EINVAL;
-		return device->ops->ioctl(device, cmd, arg);
+			ret = -EINVAL;
+		else
+			ret = device->ops->ioctl(device, cmd, arg);
+		break;
 	}
+
+	if (pm_ret)
+		pm_runtime_put(device->dev);
+
+	return ret;
 }
 
 static ssize_t vfio_device_fops_read(struct file *filep, char __user *buf,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v3 7/8] vfio/pci: Mask INTx during runtime suspend
  2022-04-25  9:26 [PATCH v3 0/8] vfio/pci: power management changes Abhishek Sahu
                   ` (5 preceding siblings ...)
  2022-04-25  9:26 ` [PATCH v3 6/8] vfio: Invoke runtime PM API for IOCTL request Abhishek Sahu
@ 2022-04-25  9:26 ` Abhishek Sahu
  2022-04-25  9:26 ` [PATCH v3 8/8] vfio/pci: Add the support for PCI D3cold state Abhishek Sahu
  7 siblings, 0 replies; 41+ messages in thread
From: Abhishek Sahu @ 2022-04-25  9:26 UTC (permalink / raw)
  To: Alex Williamson, Cornelia Huck, Yishai Hadas, Jason Gunthorpe,
	Shameer Kolothum, Kevin Tian, Rafael J . Wysocki
  Cc: Max Gurtovoy, Bjorn Helgaas, linux-kernel, kvm, linux-pm,
	linux-pci, Abhishek Sahu

This patch just adds INTx handling during runtime suspend/resume and
all the suspend/resume related code for the user to put device into
low power state will be added in subsequent patches.

The INTx are shared among devices. Whenever any INTx interrupt comes
for the vfio devices, then vfio_intx_handler() will be called for each
device. Inside vfio_intx_handler(), it calls pci_check_and_mask_intx()
and checks if the interrupt has been generated for the current device.
Now, if the device is already in D3cold state, then the config space
can not be read. Attempt to read config space in D3cold state can
cause system unresponsiveness in few systems. To Prevent this, mask
INTx in runtime suspend callback and unmask the same in runtime resume
callback. If INTx has been already masked, then no handling is needed
in runtime suspend/resume callbacks. 'pm_intx_masked' tracks this and
vfio_pci_intx_mask() has been updated to return true if INTx has been
masked inside this function.

For the runtime suspend which is triggered for the no user of vfio
device, the is_intx() will return false and these callbacks won't do
anything.

The MSI/MSI-X are not shared so no handling should be needed for
these. vfio_msihandler() triggers eventfd_signal() without doing any
device specific config access and when user receives this signal then
user tries to perform any config access or IOCTL, then the device will
be moved to D0 state first before servicing any request.

Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
---
 drivers/vfio/pci/vfio_pci_core.c  | 35 +++++++++++++++++++++++++++----
 drivers/vfio/pci/vfio_pci_intrs.c |  6 +++++-
 include/linux/vfio_pci_core.h     |  3 ++-
 3 files changed, 38 insertions(+), 6 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index aee5e0cd6137..05a68ca9d9e7 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -262,16 +262,43 @@ int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t stat
 }
 
 #ifdef CONFIG_PM
+static int vfio_pci_core_runtime_suspend(struct device *dev)
+{
+	struct vfio_pci_core_device *vdev = dev_get_drvdata(dev);
+
+	/*
+	 * If INTx is enabled, then mask INTx before going into runtime
+	 * suspended state and unmask the same in the runtime resume.
+	 * If INTx has already been masked by the user, then
+	 * vfio_pci_intx_mask() will return false and in that case, INTx
+	 * should not be unmasked in the runtime resume.
+	 */
+	vdev->pm_intx_masked = (is_intx(vdev) && vfio_pci_intx_mask(vdev));
+
+	return 0;
+}
+
+static int vfio_pci_core_runtime_resume(struct device *dev)
+{
+	struct vfio_pci_core_device *vdev = dev_get_drvdata(dev);
+
+	if (vdev->pm_intx_masked)
+		vfio_pci_intx_unmask(vdev);
+
+	return 0;
+}
+
 /*
- * The dev_pm_ops needs to be provided to make pci-driver runtime PM working,
- * so use structure without any callbacks.
- *
  * The pci-driver core runtime PM routines always save the device state
  * before going into suspended state. If the device is going into low power
  * state with only with runtime PM ops, then no explicit handling is needed
  * for the devices which have NoSoftRst-.
  */
-static const struct dev_pm_ops vfio_pci_core_pm_ops = { };
+static const struct dev_pm_ops vfio_pci_core_pm_ops = {
+	SET_RUNTIME_PM_OPS(vfio_pci_core_runtime_suspend,
+			   vfio_pci_core_runtime_resume,
+			   NULL)
+};
 #endif
 
 int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index 6069a11fb51a..1a37db99df48 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -33,10 +33,12 @@ static void vfio_send_intx_eventfd(void *opaque, void *unused)
 		eventfd_signal(vdev->ctx[0].trigger, 1);
 }
 
-void vfio_pci_intx_mask(struct vfio_pci_core_device *vdev)
+/* Returns true if INTx has been masked by this function. */
+bool vfio_pci_intx_mask(struct vfio_pci_core_device *vdev)
 {
 	struct pci_dev *pdev = vdev->pdev;
 	unsigned long flags;
+	bool intx_masked = false;
 
 	spin_lock_irqsave(&vdev->irqlock, flags);
 
@@ -60,9 +62,11 @@ void vfio_pci_intx_mask(struct vfio_pci_core_device *vdev)
 			disable_irq_nosync(pdev->irq);
 
 		vdev->ctx[0].masked = true;
+		intx_masked = true;
 	}
 
 	spin_unlock_irqrestore(&vdev->irqlock, flags);
+	return intx_masked;
 }
 
 /*
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index 3c7d65e68340..e84f31e44238 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -125,6 +125,7 @@ struct vfio_pci_core_device {
 	bool			nointx;
 	bool			needs_pm_restore;
 	bool			power_state_d3;
+	bool			pm_intx_masked;
 	struct pci_saved_state	*pci_saved_state;
 	struct pci_saved_state	*pm_save;
 	int			ioeventfds_nr;
@@ -148,7 +149,7 @@ struct vfio_pci_core_device {
 #define is_irq_none(vdev) (!(is_intx(vdev) || is_msi(vdev) || is_msix(vdev)))
 #define irq_is(vdev, type) (vdev->irq_type == type)
 
-extern void vfio_pci_intx_mask(struct vfio_pci_core_device *vdev);
+extern bool vfio_pci_intx_mask(struct vfio_pci_core_device *vdev);
 extern void vfio_pci_intx_unmask(struct vfio_pci_core_device *vdev);
 
 extern int vfio_pci_set_irqs_ioctl(struct vfio_pci_core_device *vdev,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v3 8/8] vfio/pci: Add the support for PCI D3cold state
  2022-04-25  9:26 [PATCH v3 0/8] vfio/pci: power management changes Abhishek Sahu
                   ` (6 preceding siblings ...)
  2022-04-25  9:26 ` [PATCH v3 7/8] vfio/pci: Mask INTx during runtime suspend Abhishek Sahu
@ 2022-04-25  9:26 ` Abhishek Sahu
  2022-05-04 19:45   ` Alex Williamson
  7 siblings, 1 reply; 41+ messages in thread
From: Abhishek Sahu @ 2022-04-25  9:26 UTC (permalink / raw)
  To: Alex Williamson, Cornelia Huck, Yishai Hadas, Jason Gunthorpe,
	Shameer Kolothum, Kevin Tian, Rafael J . Wysocki
  Cc: Max Gurtovoy, Bjorn Helgaas, linux-kernel, kvm, linux-pm,
	linux-pci, Abhishek Sahu

Currently, if the runtime power management is enabled for vfio pci
based device in the guest OS, then guest OS will do the register
write for PCI_PM_CTRL register. This write request will be handled in
vfio_pm_config_write() where it will do the actual register write
of PCI_PM_CTRL register. With this, the maximum D3hot state can be
achieved for low power. If we can use the runtime PM framework,
then we can achieve the D3cold state which will help in saving
maximum power.

1. Since D3cold state can't be achieved by writing PCI standard
   PM config registers, so this patch adds a new feature in the
   existing VFIO_DEVICE_FEATURE IOCTL. This IOCTL can be used
   to change the PCI device from D3hot to D3cold state and
   then D3cold to D0 state. The device feature uses low power term
   instead of D3cold so that if other vfio driver wants to implement
   low power support, then the same IOCTL can be used.

2. The hypervisors can implement virtual ACPI methods. For
   example, in guest linux OS if PCI device ACPI node has _PR3 and _PR0
   power resources with _ON/_OFF method, then guest linux OS makes the
   _OFF call during D3cold transition and then _ON during D0 transition.
   The hypervisor can tap these virtual ACPI calls and then do the D3cold
   related IOCTL in the vfio driver.

3. The vfio driver uses runtime PM framework to achieve the
   D3cold state. For the D3cold transition, decrement the usage count and
   for the D0 transition, increment the usage count.

4. For D3cold, the device current power state should be D3hot.
   Then during runtime suspend, the pci_platform_power_transition() is
   required for D3cold state. If the D3cold state is not supported, then
   the device will still be in D3hot state. But with the runtime PM, the
   root port can now also go into suspended state.

5. For most of the systems, the D3cold is supported at the root
   port level. So, when root port will transition to D3cold state, then
   the vfio PCI device will go from D3hot to D3cold state during its
   runtime suspend. If root port does not support D3cold, then the root
   will go into D3hot state.

6. The runtime suspend callback can now happen for 2 cases: there
   are no users of vfio device and the case where user has initiated
   D3cold. The 'platform_pm_engaged' flag can help to distinguish
   between these 2 cases.

7. In D3cold, all kind of BAR related access needs to be disabled
   like D3hot. Additionally, the config space will also be disabled in
   D3cold state. To prevent access of config space in D3cold state, do
   increment the runtime PM usage count before doing any config space
   access.

8. If user has engaged low power entry through IOCTL, then user should
   do low power exit first. The user can issue config access or IOCTL
   after low power entry. We can add an explicit error check but since
   we are already waking-up device, so IOCTL and config access can be
   fulfilled. But 'power_state_d3' won't be cleared without issuing
   low power exit so all BAR related access will still return error till
   user do low power exit.

9. Since multiple layers are involved, so following is the high level
   code flow for D3cold entry and exit.

D3cold entry:

a. User put the PCI device into D3hot by writing into standard config
   register (vfio_pm_config_write() -> vfio_lock_and_set_power_state() ->
   vfio_pci_set_power_state()). The device power state will be D3hot and
   power_state_d3 will be true.
b. Set vfio_device_feature_power_management::low_power_state =
   VFIO_DEVICE_LOW_POWER_STATE_ENTER and call VFIO_DEVICE_FEATURE IOCTL.
c. Inside vfio_device_fops_unl_ioctl(), pm_runtime_resume_and_get()
   will be called first which will make the usage count as 2 and then
   vfio_pci_core_ioctl_feature() will be invoked.
d. vfio_pci_core_feature_pm() will be called and it will go inside
   VFIO_DEVICE_LOW_POWER_STATE_ENTER switch case. platform_pm_engaged will
   be true and pm_runtime_put_noidle() will decrement the usage count
   to 1.
e. Inside vfio_device_fops_unl_ioctl() while returning the
   pm_runtime_put() will make the usage count to 0 and the runtime PM
   framework will engage the runtime suspend entry.
f. pci_pm_runtime_suspend() will be called and invokes driver runtime
   suspend callback.
g. vfio_pci_core_runtime_suspend() will change the power state to D0
   and do the INTx mask related handling.
h. pci_pm_runtime_suspend() will take care of saving the PCI state and
   all power management handling for D3cold.

D3cold exit:

a. Set vfio_device_feature_power_management::low_power_state =
   VFIO_DEVICE_LOW_POWER_STATE_EXIT and call VFIO_DEVICE_FEATURE IOCTL.
b. Inside vfio_device_fops_unl_ioctl(), pm_runtime_resume_and_get()
   will be called first which will make the usage count as 1.
c. pci_pm_runtime_resume() will take care of moving the device into D0
   state again and then vfio_pci_core_runtime_resume() will be called.
d. vfio_pci_core_runtime_resume() will do the INTx unmask related
   handling.
e. vfio_pci_core_ioctl_feature() will be invoked.
f. vfio_pci_core_feature_pm() will be called and it will go inside
   VFIO_DEVICE_LOW_POWER_STATE_EXIT switch case. platform_pm_engaged and
   power_state_d3 will be cleared and pm_runtime_get_noresume() will make
   the usage count as 2.
g. Inside vfio_device_fops_unl_ioctl() while returning the
   pm_runtime_put() will make the usage count to 1 and the device will
   be in D0 state only.

Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
---
 drivers/vfio/pci/vfio_pci_config.c |  11 ++-
 drivers/vfio/pci/vfio_pci_core.c   | 131 ++++++++++++++++++++++++++++-
 include/linux/vfio_pci_core.h      |   1 +
 include/uapi/linux/vfio.h          |  18 ++++
 4 files changed, 159 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
index af0ae80ef324..65b1bc9586ab 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -25,6 +25,7 @@
 #include <linux/uaccess.h>
 #include <linux/vfio.h>
 #include <linux/slab.h>
+#include <linux/pm_runtime.h>
 
 #include <linux/vfio_pci_core.h>
 
@@ -1936,16 +1937,23 @@ static ssize_t vfio_config_do_rw(struct vfio_pci_core_device *vdev, char __user
 ssize_t vfio_pci_config_rw(struct vfio_pci_core_device *vdev, char __user *buf,
 			   size_t count, loff_t *ppos, bool iswrite)
 {
+	struct device *dev = &vdev->pdev->dev;
 	size_t done = 0;
 	int ret = 0;
 	loff_t pos = *ppos;
 
 	pos &= VFIO_PCI_OFFSET_MASK;
 
+	ret = pm_runtime_resume_and_get(dev);
+	if (ret < 0)
+		return ret;
+
 	while (count) {
 		ret = vfio_config_do_rw(vdev, buf, count, &pos, iswrite);
-		if (ret < 0)
+		if (ret < 0) {
+			pm_runtime_put(dev);
 			return ret;
+		}
 
 		count -= ret;
 		done += ret;
@@ -1953,6 +1961,7 @@ ssize_t vfio_pci_config_rw(struct vfio_pci_core_device *vdev, char __user *buf,
 		pos += ret;
 	}
 
+	pm_runtime_put(dev);
 	*ppos += done;
 
 	return done;
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 05a68ca9d9e7..beac6e05f97f 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -234,7 +234,14 @@ int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t stat
 	ret = pci_set_power_state(pdev, state);
 
 	if (!ret) {
-		vdev->power_state_d3 = (pdev->current_state >= PCI_D3hot);
+		/*
+		 * If 'platform_pm_engaged' is true then 'power_state_d3' can
+		 * be cleared only when user makes the explicit request to
+		 * move out of low power state by using power management ioctl.
+		 */
+		if (!vdev->platform_pm_engaged)
+			vdev->power_state_d3 =
+				(pdev->current_state >= PCI_D3hot);
 
 		/* D3 might be unsupported via quirk, skip unless in D3 */
 		if (needs_save && pdev->current_state >= PCI_D3hot) {
@@ -266,6 +273,25 @@ static int vfio_pci_core_runtime_suspend(struct device *dev)
 {
 	struct vfio_pci_core_device *vdev = dev_get_drvdata(dev);
 
+	down_read(&vdev->memory_lock);
+
+	/* 'platform_pm_engaged' will be false if there are no users. */
+	if (!vdev->platform_pm_engaged) {
+		up_read(&vdev->memory_lock);
+		return 0;
+	}
+
+	/*
+	 * The user will move the device into D3hot state first before invoking
+	 * power management ioctl. Move the device into D0 state here and then
+	 * the pci-driver core runtime PM suspend will move the device into
+	 * low power state. Also, for the devices which have NoSoftRst-,
+	 * it will help in restoring the original state (saved locally in
+	 * 'vdev->pm_save').
+	 */
+	vfio_pci_set_power_state(vdev, PCI_D0);
+	up_read(&vdev->memory_lock);
+
 	/*
 	 * If INTx is enabled, then mask INTx before going into runtime
 	 * suspended state and unmask the same in the runtime resume.
@@ -395,6 +421,19 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
 
 	/*
 	 * This function can be invoked while the power state is non-D0.
+	 * This non-D0 power state can be with or without runtime PM.
+	 * Increment the usage count corresponding to pm_runtime_put()
+	 * called during setting of 'platform_pm_engaged'. The device will
+	 * wake up if it has already went into suspended state. Otherwise,
+	 * the next vfio_pci_set_power_state() will change the
+	 * device power state to D0.
+	 */
+	if (vdev->platform_pm_engaged) {
+		pm_runtime_resume_and_get(&pdev->dev);
+		vdev->platform_pm_engaged = false;
+	}
+
+	/*
 	 * This function calls __pci_reset_function_locked() which internally
 	 * can use pci_pm_reset() for the function reset. pci_pm_reset() will
 	 * fail if the power state is non-D0. Also, for the devices which
@@ -1192,6 +1231,80 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
 }
 EXPORT_SYMBOL_GPL(vfio_pci_core_ioctl);
 
+#ifdef CONFIG_PM
+static int vfio_pci_core_feature_pm(struct vfio_device *device, u32 flags,
+				    void __user *arg, size_t argsz)
+{
+	struct vfio_pci_core_device *vdev =
+		container_of(device, struct vfio_pci_core_device, vdev);
+	struct pci_dev *pdev = vdev->pdev;
+	struct vfio_device_feature_power_management vfio_pm = { 0 };
+	int ret = 0;
+
+	ret = vfio_check_feature(flags, argsz,
+				 VFIO_DEVICE_FEATURE_SET |
+				 VFIO_DEVICE_FEATURE_GET,
+				 sizeof(vfio_pm));
+	if (ret != 1)
+		return ret;
+
+	if (flags & VFIO_DEVICE_FEATURE_GET) {
+		down_read(&vdev->memory_lock);
+		vfio_pm.low_power_state = vdev->platform_pm_engaged ?
+				VFIO_DEVICE_LOW_POWER_STATE_ENTER :
+				VFIO_DEVICE_LOW_POWER_STATE_EXIT;
+		up_read(&vdev->memory_lock);
+		if (copy_to_user(arg, &vfio_pm, sizeof(vfio_pm)))
+			return -EFAULT;
+		return 0;
+	}
+
+	if (copy_from_user(&vfio_pm, arg, sizeof(vfio_pm)))
+		return -EFAULT;
+
+	/*
+	 * The vdev power related fields are protected with memory_lock
+	 * semaphore.
+	 */
+	down_write(&vdev->memory_lock);
+	switch (vfio_pm.low_power_state) {
+	case VFIO_DEVICE_LOW_POWER_STATE_ENTER:
+		if (!vdev->power_state_d3 || vdev->platform_pm_engaged) {
+			ret = EINVAL;
+			break;
+		}
+
+		vdev->platform_pm_engaged = true;
+
+		/*
+		 * The pm_runtime_put() will be called again while returning
+		 * from ioctl after which the device can go into runtime
+		 * suspended.
+		 */
+		pm_runtime_put_noidle(&pdev->dev);
+		break;
+
+	case VFIO_DEVICE_LOW_POWER_STATE_EXIT:
+		if (!vdev->platform_pm_engaged) {
+			ret = EINVAL;
+			break;
+		}
+
+		vdev->platform_pm_engaged = false;
+		vdev->power_state_d3 = false;
+		pm_runtime_get_noresume(&pdev->dev);
+		break;
+
+	default:
+		ret = EINVAL;
+		break;
+	}
+
+	up_write(&vdev->memory_lock);
+	return ret;
+}
+#endif
+
 static int vfio_pci_core_feature_token(struct vfio_device *device, u32 flags,
 				       void __user *arg, size_t argsz)
 {
@@ -1226,6 +1339,10 @@ int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
 	switch (flags & VFIO_DEVICE_FEATURE_MASK) {
 	case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN:
 		return vfio_pci_core_feature_token(device, flags, arg, argsz);
+#ifdef CONFIG_PM
+	case VFIO_DEVICE_FEATURE_POWER_MANAGEMENT:
+		return vfio_pci_core_feature_pm(device, flags, arg, argsz);
+#endif
 	default:
 		return -ENOTTY;
 	}
@@ -2189,6 +2306,15 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
 		goto err_unlock;
 	}
 
+	/*
+	 * Some of the devices in the dev_set can be in the runtime suspended
+	 * state. Increment the usage count for all the devices in the dev_set
+	 * before reset and decrement the same after reset.
+	 */
+	ret = vfio_pci_dev_set_pm_runtime_get(dev_set);
+	if (ret)
+		goto err_unlock;
+
 	list_for_each_entry(cur_vma, &dev_set->device_list, vdev.dev_set_list) {
 		/*
 		 * Test whether all the affected devices are contained by the
@@ -2244,6 +2370,9 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
 		else
 			mutex_unlock(&cur->vma_lock);
 	}
+
+	list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list)
+		pm_runtime_put(&cur->pdev->dev);
 err_unlock:
 	mutex_unlock(&dev_set->lock);
 	return ret;
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index e84f31e44238..337983a877d6 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -126,6 +126,7 @@ struct vfio_pci_core_device {
 	bool			needs_pm_restore;
 	bool			power_state_d3;
 	bool			pm_intx_masked;
+	bool			platform_pm_engaged;
 	struct pci_saved_state	*pci_saved_state;
 	struct pci_saved_state	*pm_save;
 	int			ioeventfds_nr;
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index fea86061b44e..53ff890dbd27 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -986,6 +986,24 @@ enum vfio_device_mig_state {
 	VFIO_DEVICE_STATE_RUNNING_P2P = 5,
 };
 
+/*
+ * Use platform-based power management for moving the device into low power
+ * state.  This low power state is device specific.
+ *
+ * For PCI, this low power state is D3cold.  The native PCI power management
+ * does not support the D3cold power state.  For moving the device into D3cold
+ * state, change the PCI state to D3hot with standard configuration registers
+ * and then call this IOCTL to setting the D3cold state.  Similarly, if the
+ * device in D3cold state, then call this IOCTL to exit from D3cold state.
+ */
+struct vfio_device_feature_power_management {
+#define VFIO_DEVICE_LOW_POWER_STATE_EXIT	0x0
+#define VFIO_DEVICE_LOW_POWER_STATE_ENTER	0x1
+	__u64	low_power_state;
+};
+
+#define VFIO_DEVICE_FEATURE_POWER_MANAGEMENT	3
+
 /* -------- API for Type1 VFIO IOMMU -------- */
 
 /**
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 1/8] vfio/pci: Invalidate mmaps and block the access in D3hot power state
  2022-04-25  9:26 ` [PATCH v3 1/8] vfio/pci: Invalidate mmaps and block the access in D3hot power state Abhishek Sahu
@ 2022-04-26  1:42   ` kernel test robot
  2022-04-26 14:14     ` Bjorn Helgaas
  0 siblings, 1 reply; 41+ messages in thread
From: kernel test robot @ 2022-04-26  1:42 UTC (permalink / raw)
  To: Abhishek Sahu, Alex Williamson, Cornelia Huck, Yishai Hadas,
	Jason Gunthorpe, Shameer Kolothum, Kevin Tian,
	Rafael J . Wysocki
  Cc: kbuild-all, Max Gurtovoy, Bjorn Helgaas, linux-kernel, kvm,
	linux-pm, linux-pci, Abhishek Sahu

Hi Abhishek,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on v5.18-rc4]
[also build test WARNING on next-20220422]
[cannot apply to awilliam-vfio/next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/intel-lab-lkp/linux/commits/Abhishek-Sahu/vfio-pci-power-management-changes/20220425-173224
base:    af2d861d4cd2a4da5137f795ee3509e6f944a25b
config: x86_64-rhel-8.3-kselftests (https://download.01.org/0day-ci/archive/20220426/202204260928.TsUAxMD3-lkp@intel.com/config)
compiler: gcc-11 (Debian 11.2.0-20) 11.2.0
reproduce:
        # apt-get install sparse
        # sparse version: v0.6.4-dirty
        # https://github.com/intel-lab-lkp/linux/commit/1d48b86a17606c483f200c1734085ab415dbfc3c
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Abhishek-Sahu/vfio-pci-power-management-changes/20220425-173224
        git checkout 1d48b86a17606c483f200c1734085ab415dbfc3c
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' O=build_dir ARCH=x86_64 SHELL=/bin/bash drivers/vfio/pci/

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>


sparse warnings: (new ones prefixed by >>)
>> drivers/vfio/pci/vfio_pci_config.c:703:13: sparse: sparse: restricted pci_power_t degrades to integer
   drivers/vfio/pci/vfio_pci_config.c:703:22: sparse: sparse: restricted pci_power_t degrades to integer

vim +703 drivers/vfio/pci/vfio_pci_config.c

   694	
   695	/*
   696	 * It takes all the required locks to protect the access of power related
   697	 * variables and then invokes vfio_pci_set_power_state().
   698	 */
   699	static void
   700	vfio_lock_and_set_power_state(struct vfio_pci_core_device *vdev,
   701				      pci_power_t state)
   702	{
 > 703		if (state >= PCI_D3hot)
   704			vfio_pci_zap_and_down_write_memory_lock(vdev);
   705		else
   706			down_write(&vdev->memory_lock);
   707	
   708		vfio_pci_set_power_state(vdev, state);
   709		up_write(&vdev->memory_lock);
   710	}
   711	

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 1/8] vfio/pci: Invalidate mmaps and block the access in D3hot power state
  2022-04-26  1:42   ` kernel test robot
@ 2022-04-26 14:14     ` Bjorn Helgaas
  0 siblings, 0 replies; 41+ messages in thread
From: Bjorn Helgaas @ 2022-04-26 14:14 UTC (permalink / raw)
  To: kernel test robot
  Cc: Abhishek Sahu, Alex Williamson, Cornelia Huck, Yishai Hadas,
	Jason Gunthorpe, Shameer Kolothum, Kevin Tian,
	Rafael J . Wysocki, kbuild-all, Max Gurtovoy, linux-kernel, kvm,
	linux-pm, linux-pci

On Tue, Apr 26, 2022 at 09:42:45AM +0800, kernel test robot wrote:
> ...

> sparse warnings: (new ones prefixed by >>)
> >> drivers/vfio/pci/vfio_pci_config.c:703:13: sparse: sparse: restricted pci_power_t degrades to integer
>    drivers/vfio/pci/vfio_pci_config.c:703:22: sparse: sparse: restricted pci_power_t degrades to integer

I dunno what Alex thinks, but we have several of these warnings in
drivers/pci/.  I'd like to get rid of them, but we haven't figured out
a good way yet.  So this might be something we just live with for now.

> vim +703 drivers/vfio/pci/vfio_pci_config.c
> 
>    694	
>    695	/*
>    696	 * It takes all the required locks to protect the access of power related
>    697	 * variables and then invokes vfio_pci_set_power_state().
>    698	 */
>    699	static void
>    700	vfio_lock_and_set_power_state(struct vfio_pci_core_device *vdev,
>    701				      pci_power_t state)
>    702	{
>  > 703		if (state >= PCI_D3hot)
>    704			vfio_pci_zap_and_down_write_memory_lock(vdev);
>    705		else
>    706			down_write(&vdev->memory_lock);
>    707	
>    708		vfio_pci_set_power_state(vdev, state);
>    709		up_write(&vdev->memory_lock);
>    710	}
>    711	
> 
> -- 
> 0-DAY CI Kernel Test Service
> https://01.org/lkp

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 4/8] vfio/pci: Add support for setting driver data inside core layer
  2022-04-25  9:26 ` [PATCH v3 4/8] vfio/pci: Add support for setting driver data inside core layer Abhishek Sahu
@ 2022-05-03 17:11   ` Alex Williamson
  2022-05-04  0:20     ` Jason Gunthorpe
  0 siblings, 1 reply; 41+ messages in thread
From: Alex Williamson @ 2022-05-03 17:11 UTC (permalink / raw)
  To: Abhishek Sahu
  Cc: Cornelia Huck, Yishai Hadas, Jason Gunthorpe, Shameer Kolothum,
	Kevin Tian, Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas,
	linux-kernel, kvm, linux-pm, linux-pci

On Mon, 25 Apr 2022 14:56:11 +0530
Abhishek Sahu <abhsahu@nvidia.com> wrote:

> The vfio driver is divided into two layers: core layer (implemented in
> vfio_pci_core.c) and parent driver (For example, vfio_pci, mlx5_vfio_pci,
> hisi_acc_vfio_pci, etc.). All the parent driver calls dev_set_drvdata()
> and assigns its own structure as driver data. Some of the callback
> functions are implemented in the core layer and these callback functions
> provide the reference of 'struct pci_dev' or 'struct device'. Currently,
> we use vfio_device_get_from_dev() which provides reference to the
> vfio_device for a device. But this function follows long path to extract
> the same. There are few cases, where we don't need to go through this
> long path if we get this through drvdata.
> 
> This patch moves the setting of drvdata inside the core layer. If we see
> the current implementation of parent driver structure implementation,
> then 'struct vfio_pci_core_device' is a first member so the pointer of
> the parent structure and 'struct vfio_pci_core_device' should be the same.
> 
> struct hisi_acc_vf_core_device {
>     struct vfio_pci_core_device core_device;
>     ...
> };
> 
> struct mlx5vf_pci_core_device {
>     struct vfio_pci_core_device core_device;
>     ...
> };
> 
> The vfio_pci.c uses 'struct vfio_pci_core_device' itself.
> 
> To support getting the drvdata in both the layers, we can put the
> restriction to make 'struct vfio_pci_core_device' as a first member.
> Also, vfio_pci_core_register_device() has this validation which makes sure
> that this prerequisite is always satisfied.
> 
> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
> ---
>  .../vfio/pci/hisilicon/hisi_acc_vfio_pci.c    |  4 ++--
>  drivers/vfio/pci/mlx5/main.c                  |  3 +--
>  drivers/vfio/pci/vfio_pci.c                   |  4 ++--
>  drivers/vfio/pci/vfio_pci_core.c              | 24 ++++++++++++++++---
>  include/linux/vfio_pci_core.h                 |  7 +++++-
>  5 files changed, 32 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c b/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c
> index 767b5d47631a..c76c09302a8f 100644
> --- a/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c
> +++ b/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c
> @@ -1274,11 +1274,11 @@ static int hisi_acc_vfio_pci_probe(struct pci_dev *pdev, const struct pci_device
>  					  &hisi_acc_vfio_pci_ops);
>  	}
>  
> -	ret = vfio_pci_core_register_device(&hisi_acc_vdev->core_device);
> +	ret = vfio_pci_core_register_device(&hisi_acc_vdev->core_device,
> +					    hisi_acc_vdev);
>  	if (ret)
>  		goto out_free;
>  
> -	dev_set_drvdata(&pdev->dev, hisi_acc_vdev);
>  	return 0;
>  
>  out_free:
> diff --git a/drivers/vfio/pci/mlx5/main.c b/drivers/vfio/pci/mlx5/main.c
> index bbec5d288fee..8689248f66f3 100644
> --- a/drivers/vfio/pci/mlx5/main.c
> +++ b/drivers/vfio/pci/mlx5/main.c
> @@ -614,11 +614,10 @@ static int mlx5vf_pci_probe(struct pci_dev *pdev,
>  		}
>  	}
>  
> -	ret = vfio_pci_core_register_device(&mvdev->core_device);
> +	ret = vfio_pci_core_register_device(&mvdev->core_device, mvdev);
>  	if (ret)
>  		goto out_free;
>  
> -	dev_set_drvdata(&pdev->dev, mvdev);
>  	return 0;
>  
>  out_free:
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 2b047469e02f..e0f8027c5cd8 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -151,10 +151,10 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>  		return -ENOMEM;
>  	vfio_pci_core_init_device(vdev, pdev, &vfio_pci_ops);
>  
> -	ret = vfio_pci_core_register_device(vdev);
> +	ret = vfio_pci_core_register_device(vdev, vdev);
>  	if (ret)
>  		goto out_free;
> -	dev_set_drvdata(&pdev->dev, vdev);
> +
>  	return 0;
>  
>  out_free:
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index 1271728a09db..953ac33b2f5f 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -1822,9 +1822,11 @@ void vfio_pci_core_uninit_device(struct vfio_pci_core_device *vdev)
>  }
>  EXPORT_SYMBOL_GPL(vfio_pci_core_uninit_device);
>  
> -int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev)
> +int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev,
> +				  void *driver_data)
>  {
>  	struct pci_dev *pdev = vdev->pdev;
> +	struct device *dev = &pdev->dev;
>  	int ret;
>  
>  	if (pdev->hdr_type != PCI_HEADER_TYPE_NORMAL)
> @@ -1843,6 +1845,17 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev)
>  		return -EBUSY;
>  	}
>  
> +	/*
> +	 * The 'struct vfio_pci_core_device' should be the first member of the
> +	 * of the structure referenced by 'driver_data' so that it can be
> +	 * retrieved with dev_get_drvdata() inside vfio-pci core layer.
> +	 */
> +	if ((struct vfio_pci_core_device *)driver_data != vdev) {
> +		pci_warn(pdev, "Invalid driver data\n");
> +		return -EINVAL;
> +	}

It seems a bit odd to me to add a driver_data arg to the function,
which is actually required to point to the same thing as the existing
function arg.  Is this just to codify the requirement?  Maybe others
can suggest alternatives.

We also need to collaborate with Jason's patch:

https://lore.kernel.org/all/0-v2-0f36bcf6ec1e+64d-vfio_get_from_dev_jgg@nvidia.com/

(and maybe others)

If we implement a change like proposed here that vfio-pci-core sets
drvdata then we don't need for each variant driver to implement their
own wrapper around err_handler or err_detected as Jason proposes in the
linked patch.  Thanks,

Alex

> +	dev_set_drvdata(dev, driver_data);
> +
>  	if (pci_is_root_bus(pdev->bus)) {
>  		ret = vfio_assign_device_set(&vdev->vdev, vdev);
>  	} else if (!pci_probe_reset_slot(pdev->slot)) {
> @@ -1856,10 +1869,10 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev)
>  	}
>  
>  	if (ret)
> -		return ret;
> +		goto out_drvdata;
>  	ret = vfio_pci_vf_init(vdev);
>  	if (ret)
> -		return ret;
> +		goto out_drvdata;
>  	ret = vfio_pci_vga_init(vdev);
>  	if (ret)
>  		goto out_vf;
> @@ -1890,6 +1903,8 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev)
>  		vfio_pci_set_power_state(vdev, PCI_D0);
>  out_vf:
>  	vfio_pci_vf_uninit(vdev);
> +out_drvdata:
> +	dev_set_drvdata(dev, NULL);
>  	return ret;
>  }
>  EXPORT_SYMBOL_GPL(vfio_pci_core_register_device);
> @@ -1897,6 +1912,7 @@ EXPORT_SYMBOL_GPL(vfio_pci_core_register_device);
>  void vfio_pci_core_unregister_device(struct vfio_pci_core_device *vdev)
>  {
>  	struct pci_dev *pdev = vdev->pdev;
> +	struct device *dev = &pdev->dev;
>  
>  	vfio_pci_core_sriov_configure(pdev, 0);
>  
> @@ -1907,6 +1923,8 @@ void vfio_pci_core_unregister_device(struct vfio_pci_core_device *vdev)
>  
>  	if (!disable_idle_d3)
>  		vfio_pci_set_power_state(vdev, PCI_D0);
> +
> +	dev_set_drvdata(dev, NULL);
>  }
>  EXPORT_SYMBOL_GPL(vfio_pci_core_unregister_device);
>  
> diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
> index 505b2a74a479..3c7d65e68340 100644
> --- a/include/linux/vfio_pci_core.h
> +++ b/include/linux/vfio_pci_core.h
> @@ -225,7 +225,12 @@ void vfio_pci_core_close_device(struct vfio_device *core_vdev);
>  void vfio_pci_core_init_device(struct vfio_pci_core_device *vdev,
>  			       struct pci_dev *pdev,
>  			       const struct vfio_device_ops *vfio_pci_ops);
> -int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev);
> +/*
> + * The 'struct vfio_pci_core_device' should be the first member
> + * of the structure referenced by 'driver_data'.
> + */
> +int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev,
> +				  void *driver_data);
>  void vfio_pci_core_uninit_device(struct vfio_pci_core_device *vdev);
>  void vfio_pci_core_unregister_device(struct vfio_pci_core_device *vdev);
>  int vfio_pci_core_sriov_configure(struct pci_dev *pdev, int nr_virtfn);


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 4/8] vfio/pci: Add support for setting driver data inside core layer
  2022-05-03 17:11   ` Alex Williamson
@ 2022-05-04  0:20     ` Jason Gunthorpe
  2022-05-04 10:32       ` Abhishek Sahu
  0 siblings, 1 reply; 41+ messages in thread
From: Jason Gunthorpe @ 2022-05-04  0:20 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Abhishek Sahu, Cornelia Huck, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas,
	linux-kernel, kvm, linux-pm, linux-pci

On Tue, May 03, 2022 at 11:11:24AM -0600, Alex Williamson wrote:
> > @@ -1843,6 +1845,17 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev)
> >  		return -EBUSY;
> >  	}
> >  
> > +	/*
> > +	 * The 'struct vfio_pci_core_device' should be the first member of the
> > +	 * of the structure referenced by 'driver_data' so that it can be
> > +	 * retrieved with dev_get_drvdata() inside vfio-pci core layer.
> > +	 */
> > +	if ((struct vfio_pci_core_device *)driver_data != vdev) {
> > +		pci_warn(pdev, "Invalid driver data\n");
> > +		return -EINVAL;
> > +	}
> 
> It seems a bit odd to me to add a driver_data arg to the function,
> which is actually required to point to the same thing as the existing
> function arg.  Is this just to codify the requirement?  Maybe others
> can suggest alternatives.
> 
> We also need to collaborate with Jason's patch:
> 
> https://lore.kernel.org/all/0-v2-0f36bcf6ec1e+64d-vfio_get_from_dev_jgg@nvidia.com/
> 
> (and maybe others)
> 
> If we implement a change like proposed here that vfio-pci-core sets
> drvdata then we don't need for each variant driver to implement their
> own wrapper around err_handler or err_detected as Jason proposes in the
> linked patch.  Thanks,

Oh, I forgot about this series completely.

Yes, we need to pick a method, either drvdata always points at the
core struct, or we wrapper the core functions.

I have an independent version of the above patch that uses the
drvdata, but I chucked it because it was unnecessary for just a couple
of AER functions. 

We should probably go back to it though if we are adding more
functions, as the wrapping is a bit repetitive. I'll go and respin
that series then. Abhishek can base on top of it.

My approach was more type-sane though:

commit 12ba94a72d7aa134af8752d6ff78193acdac93ae
Author: Jason Gunthorpe <jgg@ziepe.ca>
Date:   Tue Mar 29 16:32:32 2022 -0300

    vfio/pci: Have all VFIO PCI drivers store the vfio_pci_core_device in drvdata
    
    Having a consistent pointer in the drvdata will allow the next patch to
    make use of the drvdata from some of the core code helpers.
    
    Use a WARN_ON inside vfio_pci_core_unregister_device() to detect drivers
    that miss this.
    
    Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>

diff --git a/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c b/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c
index 767b5d47631a49..665691967a030c 100644
--- a/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c
+++ b/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c
@@ -337,6 +337,14 @@ static int vf_qm_cache_wb(struct hisi_qm *qm)
 	return 0;
 }
 
+static struct hisi_acc_vf_core_device *hssi_acc_drvdata(struct pci_dev *pdev)
+{
+	struct vfio_pci_core_device *core_device = dev_get_drvdata(&pdev->dev);
+
+	return container_of(core_device, struct hisi_acc_vf_core_device,
+			    core_device);
+}
+
 static void vf_qm_fun_reset(struct hisi_acc_vf_core_device *hisi_acc_vdev,
 			    struct hisi_qm *qm)
 {
@@ -962,7 +970,7 @@ hisi_acc_vfio_pci_get_device_state(struct vfio_device *vdev,
 
 static void hisi_acc_vf_pci_aer_reset_done(struct pci_dev *pdev)
 {
-	struct hisi_acc_vf_core_device *hisi_acc_vdev = dev_get_drvdata(&pdev->dev);
+	struct hisi_acc_vf_core_device *hisi_acc_vdev = hssi_acc_drvdata(pdev);
 
 	if (hisi_acc_vdev->core_device.vdev.migration_flags !=
 				VFIO_MIGRATION_STOP_COPY)
@@ -1278,7 +1286,7 @@ static int hisi_acc_vfio_pci_probe(struct pci_dev *pdev, const struct pci_device
 	if (ret)
 		goto out_free;
 
-	dev_set_drvdata(&pdev->dev, hisi_acc_vdev);
+	dev_set_drvdata(&pdev->dev, &hisi_acc_vdev->core_device);
 	return 0;
 
 out_free:
@@ -1289,7 +1297,7 @@ static int hisi_acc_vfio_pci_probe(struct pci_dev *pdev, const struct pci_device
 
 static void hisi_acc_vfio_pci_remove(struct pci_dev *pdev)
 {
-	struct hisi_acc_vf_core_device *hisi_acc_vdev = dev_get_drvdata(&pdev->dev);
+	struct hisi_acc_vf_core_device *hisi_acc_vdev = hssi_acc_drvdata(pdev);
 
 	vfio_pci_core_unregister_device(&hisi_acc_vdev->core_device);
 	vfio_pci_core_uninit_device(&hisi_acc_vdev->core_device);
diff --git a/drivers/vfio/pci/mlx5/main.c b/drivers/vfio/pci/mlx5/main.c
index bbec5d288fee97..3391f965abd9f0 100644
--- a/drivers/vfio/pci/mlx5/main.c
+++ b/drivers/vfio/pci/mlx5/main.c
@@ -39,6 +39,14 @@ struct mlx5vf_pci_core_device {
 	struct mlx5_vf_migration_file *saving_migf;
 };
 
+static struct mlx5vf_pci_core_device *mlx5vf_drvdata(struct pci_dev *pdev)
+{
+	struct vfio_pci_core_device *core_device = dev_get_drvdata(&pdev->dev);
+
+	return container_of(core_device, struct mlx5vf_pci_core_device,
+			    core_device);
+}
+
 static struct page *
 mlx5vf_get_migration_page(struct mlx5_vf_migration_file *migf,
 			  unsigned long offset)
@@ -505,7 +513,7 @@ static int mlx5vf_pci_get_device_state(struct vfio_device *vdev,
 
 static void mlx5vf_pci_aer_reset_done(struct pci_dev *pdev)
 {
-	struct mlx5vf_pci_core_device *mvdev = dev_get_drvdata(&pdev->dev);
+	struct mlx5vf_pci_core_device *mvdev = mlx5vf_drvdata(pdev);
 
 	if (!mvdev->migrate_cap)
 		return;
@@ -618,7 +626,7 @@ static int mlx5vf_pci_probe(struct pci_dev *pdev,
 	if (ret)
 		goto out_free;
 
-	dev_set_drvdata(&pdev->dev, mvdev);
+	dev_set_drvdata(&pdev->dev, &mvdev->core_device);
 	return 0;
 
 out_free:
@@ -629,7 +637,7 @@ static int mlx5vf_pci_probe(struct pci_dev *pdev,
 
 static void mlx5vf_pci_remove(struct pci_dev *pdev)
 {
-	struct mlx5vf_pci_core_device *mvdev = dev_get_drvdata(&pdev->dev);
+	struct mlx5vf_pci_core_device *mvdev = mlx5vf_drvdata(pdev);
 
 	vfio_pci_core_unregister_device(&mvdev->core_device);
 	vfio_pci_core_uninit_device(&mvdev->core_device);
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 06b6f3594a1316..53ad39d617653d 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -262,6 +262,10 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
 	u16 cmd;
 	u8 msix_pos;
 
+	/* Drivers must set the vfio_pci_core_device to their drvdata */
+	if (WARN_ON(vdev != dev_get_drvdata(&vdev->pdev->dev)))
+		return -EINVAL;
+
 	vfio_pci_set_power_state(vdev, PCI_D0);
 
 	/* Don't allow our initial saved state to include busmaster */

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 4/8] vfio/pci: Add support for setting driver data inside core layer
  2022-05-04  0:20     ` Jason Gunthorpe
@ 2022-05-04 10:32       ` Abhishek Sahu
  0 siblings, 0 replies; 41+ messages in thread
From: Abhishek Sahu @ 2022-05-04 10:32 UTC (permalink / raw)
  To: Jason Gunthorpe, Alex Williamson
  Cc: Cornelia Huck, Yishai Hadas, Shameer Kolothum, Kevin Tian,
	Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas, linux-kernel,
	kvm, linux-pm, linux-pci

On 5/4/2022 5:50 AM, Jason Gunthorpe wrote:
> On Tue, May 03, 2022 at 11:11:24AM -0600, Alex Williamson wrote:
>>> @@ -1843,6 +1845,17 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev)
>>>  		return -EBUSY;
>>>  	}
>>>  
>>> +	/*
>>> +	 * The 'struct vfio_pci_core_device' should be the first member of the
>>> +	 * of the structure referenced by 'driver_data' so that it can be
>>> +	 * retrieved with dev_get_drvdata() inside vfio-pci core layer.
>>> +	 */
>>> +	if ((struct vfio_pci_core_device *)driver_data != vdev) {
>>> +		pci_warn(pdev, "Invalid driver data\n");
>>> +		return -EINVAL;
>>> +	}
>>
>> It seems a bit odd to me to add a driver_data arg to the function,
>> which is actually required to point to the same thing as the existing
>> function arg.  Is this just to codify the requirement?  Maybe others
>> can suggest alternatives.

 Yes. It was mainly for enforcing this requirement, otherwise in future
 if someone tries to add new driver (or done changes in the existing
 structure) and does not follow this convention then the pointer will
 be wrong.

>>
>> We also need to collaborate with Jason's patch:
>>
>> https://lore.kernel.org/all/0-v2-0f36bcf6ec1e+64d-vfio_get_from_dev_jgg@nvidia.com/
>>
>> (and maybe others)
>>
>> If we implement a change like proposed here that vfio-pci-core sets
>> drvdata then we don't need for each variant driver to implement their
>> own wrapper around err_handler or err_detected as Jason proposes in the
>> linked patch.  Thanks,
> 
> Oh, I forgot about this series completely.
> 
> Yes, we need to pick a method, either drvdata always points at the
> core struct, or we wrapper the core functions.
> 
> I have an independent version of the above patch that uses the
> drvdata, but I chucked it because it was unnecessary for just a couple
> of AER functions. 
> 
> We should probably go back to it though if we are adding more
> functions, as the wrapping is a bit repetitive. I'll go and respin
> that series then. Abhishek can base on top of it.
> 

 Sure. I will rebase on top of Jason patch series.

> My approach was more type-sane though:
> 
 This is also fine.

 Initially I wanted to do the same but it requires to have a new
 wrapper function for each driver so I implemented in the core layer.
 
 Thanks,
 Abhishek

> commit 12ba94a72d7aa134af8752d6ff78193acdac93ae
> Author: Jason Gunthorpe <jgg@ziepe.ca>
> Date:   Tue Mar 29 16:32:32 2022 -0300
> 
>     vfio/pci: Have all VFIO PCI drivers store the vfio_pci_core_device in drvdata
>     
>     Having a consistent pointer in the drvdata will allow the next patch to
>     make use of the drvdata from some of the core code helpers.
>     
>     Use a WARN_ON inside vfio_pci_core_unregister_device() to detect drivers
>     that miss this.
>     
>     Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> 
> diff --git a/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c b/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c
> index 767b5d47631a49..665691967a030c 100644
> --- a/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c
> +++ b/drivers/vfio/pci/hisilicon/hisi_acc_vfio_pci.c
> @@ -337,6 +337,14 @@ static int vf_qm_cache_wb(struct hisi_qm *qm)
>  	return 0;
>  }
>  
> +static struct hisi_acc_vf_core_device *hssi_acc_drvdata(struct pci_dev *pdev)
> +{
> +	struct vfio_pci_core_device *core_device = dev_get_drvdata(&pdev->dev);
> +
> +	return container_of(core_device, struct hisi_acc_vf_core_device,
> +			    core_device);
> +}
> +
>  static void vf_qm_fun_reset(struct hisi_acc_vf_core_device *hisi_acc_vdev,
>  			    struct hisi_qm *qm)
>  {
> @@ -962,7 +970,7 @@ hisi_acc_vfio_pci_get_device_state(struct vfio_device *vdev,
>  
>  static void hisi_acc_vf_pci_aer_reset_done(struct pci_dev *pdev)
>  {
> -	struct hisi_acc_vf_core_device *hisi_acc_vdev = dev_get_drvdata(&pdev->dev);
> +	struct hisi_acc_vf_core_device *hisi_acc_vdev = hssi_acc_drvdata(pdev);
>  
>  	if (hisi_acc_vdev->core_device.vdev.migration_flags !=
>  				VFIO_MIGRATION_STOP_COPY)
> @@ -1278,7 +1286,7 @@ static int hisi_acc_vfio_pci_probe(struct pci_dev *pdev, const struct pci_device
>  	if (ret)
>  		goto out_free;
>  
> -	dev_set_drvdata(&pdev->dev, hisi_acc_vdev);
> +	dev_set_drvdata(&pdev->dev, &hisi_acc_vdev->core_device);
>  	return 0;
>  
>  out_free:
> @@ -1289,7 +1297,7 @@ static int hisi_acc_vfio_pci_probe(struct pci_dev *pdev, const struct pci_device
>  
>  static void hisi_acc_vfio_pci_remove(struct pci_dev *pdev)
>  {
> -	struct hisi_acc_vf_core_device *hisi_acc_vdev = dev_get_drvdata(&pdev->dev);
> +	struct hisi_acc_vf_core_device *hisi_acc_vdev = hssi_acc_drvdata(pdev);
>  
>  	vfio_pci_core_unregister_device(&hisi_acc_vdev->core_device);
>  	vfio_pci_core_uninit_device(&hisi_acc_vdev->core_device);
> diff --git a/drivers/vfio/pci/mlx5/main.c b/drivers/vfio/pci/mlx5/main.c
> index bbec5d288fee97..3391f965abd9f0 100644
> --- a/drivers/vfio/pci/mlx5/main.c
> +++ b/drivers/vfio/pci/mlx5/main.c
> @@ -39,6 +39,14 @@ struct mlx5vf_pci_core_device {
>  	struct mlx5_vf_migration_file *saving_migf;
>  };
>  
> +static struct mlx5vf_pci_core_device *mlx5vf_drvdata(struct pci_dev *pdev)
> +{
> +	struct vfio_pci_core_device *core_device = dev_get_drvdata(&pdev->dev);
> +
> +	return container_of(core_device, struct mlx5vf_pci_core_device,
> +			    core_device);
> +}
> +
>  static struct page *
>  mlx5vf_get_migration_page(struct mlx5_vf_migration_file *migf,
>  			  unsigned long offset)
> @@ -505,7 +513,7 @@ static int mlx5vf_pci_get_device_state(struct vfio_device *vdev,
>  
>  static void mlx5vf_pci_aer_reset_done(struct pci_dev *pdev)
>  {
> -	struct mlx5vf_pci_core_device *mvdev = dev_get_drvdata(&pdev->dev);
> +	struct mlx5vf_pci_core_device *mvdev = mlx5vf_drvdata(pdev);
>  
>  	if (!mvdev->migrate_cap)
>  		return;
> @@ -618,7 +626,7 @@ static int mlx5vf_pci_probe(struct pci_dev *pdev,
>  	if (ret)
>  		goto out_free;
>  
> -	dev_set_drvdata(&pdev->dev, mvdev);
> +	dev_set_drvdata(&pdev->dev, &mvdev->core_device);
>  	return 0;
>  
>  out_free:
> @@ -629,7 +637,7 @@ static int mlx5vf_pci_probe(struct pci_dev *pdev,
>  
>  static void mlx5vf_pci_remove(struct pci_dev *pdev)
>  {
> -	struct mlx5vf_pci_core_device *mvdev = dev_get_drvdata(&pdev->dev);
> +	struct mlx5vf_pci_core_device *mvdev = mlx5vf_drvdata(pdev);
>  
>  	vfio_pci_core_unregister_device(&mvdev->core_device);
>  	vfio_pci_core_uninit_device(&mvdev->core_device);
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index 06b6f3594a1316..53ad39d617653d 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -262,6 +262,10 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
>  	u16 cmd;
>  	u8 msix_pos;
>  
> +	/* Drivers must set the vfio_pci_core_device to their drvdata */
> +	if (WARN_ON(vdev != dev_get_drvdata(&vdev->pdev->dev)))
> +		return -EINVAL;
> +
>  	vfio_pci_set_power_state(vdev, PCI_D0);
>  
>  	/* Don't allow our initial saved state to include busmaster */


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 5/8] vfio/pci: Enable runtime PM for vfio_pci_core based drivers
  2022-04-25  9:26 ` [PATCH v3 5/8] vfio/pci: Enable runtime PM for vfio_pci_core based drivers Abhishek Sahu
@ 2022-05-04 19:42   ` Alex Williamson
  2022-05-05  9:07     ` Abhishek Sahu
  0 siblings, 1 reply; 41+ messages in thread
From: Alex Williamson @ 2022-05-04 19:42 UTC (permalink / raw)
  To: Abhishek Sahu
  Cc: Cornelia Huck, Yishai Hadas, Jason Gunthorpe, Shameer Kolothum,
	Kevin Tian, Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas,
	linux-kernel, kvm, linux-pm, linux-pci

On Mon, 25 Apr 2022 14:56:12 +0530
Abhishek Sahu <abhsahu@nvidia.com> wrote:

> Currently, there is very limited power management support
> available in the upstream vfio_pci_core based drivers. If there
> are no users of the device, then the PCI device will be moved into
> D3hot state by writing directly into PCI PM registers. This D3hot
> state help in saving power but we can achieve zero power consumption
> if we go into the D3cold state. The D3cold state cannot be possible
> with native PCI PM. It requires interaction with platform firmware
> which is system-specific. To go into low power states (including D3cold),
> the runtime PM framework can be used which internally interacts with PCI
> and platform firmware and puts the device into the lowest possible
> D-States.
> 
> This patch registers vfio_pci_core based drivers with the
> runtime PM framework.
> 
> 1. The PCI core framework takes care of most of the runtime PM
>    related things. For enabling the runtime PM, the PCI driver needs to
>    decrement the usage count and needs to provide 'struct dev_pm_ops'
>    at least. The runtime suspend/resume callbacks are optional and needed
>    only if we need to do any extra handling. Now there are multiple
>    vfio_pci_core based drivers. Instead of assigning the
>    'struct dev_pm_ops' in individual parent driver, the vfio_pci_core
>    itself assigns the 'struct dev_pm_ops'. There are other drivers where
>    the 'struct dev_pm_ops' is being assigned inside core layer
>    (For example, wlcore_probe() and some sound based driver, etc.).
> 
> 2. This patch provides the stub implementation of 'struct dev_pm_ops'.
>    The subsequent patch will provide the runtime suspend/resume
>    callbacks. All the config state saving, and PCI power management
>    related things will be done by PCI core framework itself inside its
>    runtime suspend/resume callbacks (pci_pm_runtime_suspend() and
>    pci_pm_runtime_resume()).
> 
> 3. Inside pci_reset_bus(), all the devices in dev_set needs to be
>    runtime resumed. vfio_pci_dev_set_pm_runtime_get() will take
>    care of the runtime resume and its error handling.
> 
> 4. Inside vfio_pci_core_disable(), the device usage count always needs
>    to be decremented which was incremented in vfio_pci_core_enable().
> 
> 5. Since the runtime PM framework will provide the same functionality,
>    so directly writing into PCI PM config register can be replaced with
>    the use of runtime PM routines. Also, the use of runtime PM can help
>    us in more power saving.
> 
>    In the systems which do not support D3cold,
> 
>    With the existing implementation:
> 
>    // PCI device
>    # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
>    D3hot
>    // upstream bridge
>    # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
>    D0
> 
>    With runtime PM:
> 
>    // PCI device
>    # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
>    D3hot
>    // upstream bridge
>    # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
>    D3hot
> 
>    So, with runtime PM, the upstream bridge or root port will also go
>    into lower power state which is not possible with existing
>    implementation.
> 
>    In the systems which support D3cold,
> 
>    // PCI device
>    # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
>    D3hot
>    // upstream bridge
>    # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
>    D0
> 
>    With runtime PM:
> 
>    // PCI device
>    # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
>    D3cold
>    // upstream bridge
>    # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
>    D3cold
> 
>    So, with runtime PM, both the PCI device and upstream bridge will
>    go into D3cold state.
> 
> 6. If 'disable_idle_d3' module parameter is set, then also the runtime
>    PM will be enabled, but in this case, the usage count should not be
>    decremented.
> 
> 7. vfio_pci_dev_set_try_reset() return value is unused now, so this
>    function return type can be changed to void.
> 
> 8. Use the runtime PM API's in vfio_pci_core_sriov_configure().
>    For preventing any runtime usage mismatch, pci_num_vf() has been
>    called explicitly during disable.
> 
> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
> ---
>  drivers/vfio/pci/vfio_pci_core.c | 169 +++++++++++++++++++++----------
>  1 file changed, 114 insertions(+), 55 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index 953ac33b2f5f..aee5e0cd6137 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -156,7 +156,7 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_core_device *vdev)
>  }
>  
>  struct vfio_pci_group_info;
> -static bool vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set);
> +static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set);
>  static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
>  				      struct vfio_pci_group_info *groups);
>  
> @@ -261,6 +261,19 @@ int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t stat
>  	return ret;
>  }
>  
> +#ifdef CONFIG_PM
> +/*
> + * The dev_pm_ops needs to be provided to make pci-driver runtime PM working,
> + * so use structure without any callbacks.
> + *
> + * The pci-driver core runtime PM routines always save the device state
> + * before going into suspended state. If the device is going into low power
> + * state with only with runtime PM ops, then no explicit handling is needed
> + * for the devices which have NoSoftRst-.
> + */
> +static const struct dev_pm_ops vfio_pci_core_pm_ops = { };
> +#endif
> +
>  int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
>  {
>  	struct pci_dev *pdev = vdev->pdev;
> @@ -268,21 +281,23 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
>  	u16 cmd;
>  	u8 msix_pos;
>  
> -	vfio_pci_set_power_state(vdev, PCI_D0);
> +	if (!disable_idle_d3) {
> +		ret = pm_runtime_resume_and_get(&pdev->dev);
> +		if (ret < 0)
> +			return ret;
> +	}
>  
>  	/* Don't allow our initial saved state to include busmaster */
>  	pci_clear_master(pdev);
>  
>  	ret = pci_enable_device(pdev);
>  	if (ret)
> -		return ret;
> +		goto out_power;
>  
>  	/* If reset fails because of the device lock, fail this path entirely */
>  	ret = pci_try_reset_function(pdev);
> -	if (ret == -EAGAIN) {
> -		pci_disable_device(pdev);
> -		return ret;
> -	}
> +	if (ret == -EAGAIN)
> +		goto out_disable_device;
>  
>  	vdev->reset_works = !ret;
>  	pci_save_state(pdev);
> @@ -306,12 +321,8 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
>  	}
>  
>  	ret = vfio_config_init(vdev);
> -	if (ret) {
> -		kfree(vdev->pci_saved_state);
> -		vdev->pci_saved_state = NULL;
> -		pci_disable_device(pdev);
> -		return ret;
> -	}
> +	if (ret)
> +		goto out_free_state;
>  
>  	msix_pos = pdev->msix_cap;
>  	if (msix_pos) {
> @@ -332,6 +343,16 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
>  
>  
>  	return 0;
> +
> +out_free_state:
> +	kfree(vdev->pci_saved_state);
> +	vdev->pci_saved_state = NULL;
> +out_disable_device:
> +	pci_disable_device(pdev);
> +out_power:
> +	if (!disable_idle_d3)
> +		pm_runtime_put(&pdev->dev);
> +	return ret;
>  }
>  EXPORT_SYMBOL_GPL(vfio_pci_core_enable);
>  
> @@ -439,8 +460,11 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
>  out:
>  	pci_disable_device(pdev);
>  
> -	if (!vfio_pci_dev_set_try_reset(vdev->vdev.dev_set) && !disable_idle_d3)
> -		vfio_pci_set_power_state(vdev, PCI_D3hot);
> +	vfio_pci_dev_set_try_reset(vdev->vdev.dev_set);
> +
> +	/* Put the pm-runtime usage counter acquired during enable */
> +	if (!disable_idle_d3)
> +		pm_runtime_put(&pdev->dev);
>  }
>  EXPORT_SYMBOL_GPL(vfio_pci_core_disable);
>  
> @@ -1879,19 +1903,24 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev,
>  
>  	vfio_pci_probe_power_state(vdev);
>  
> -	if (!disable_idle_d3) {
> -		/*
> -		 * pci-core sets the device power state to an unknown value at
> -		 * bootup and after being removed from a driver.  The only
> -		 * transition it allows from this unknown state is to D0, which
> -		 * typically happens when a driver calls pci_enable_device().
> -		 * We're not ready to enable the device yet, but we do want to
> -		 * be able to get to D3.  Therefore first do a D0 transition
> -		 * before going to D3.
> -		 */
> -		vfio_pci_set_power_state(vdev, PCI_D0);
> -		vfio_pci_set_power_state(vdev, PCI_D3hot);
> -	}
> +	/*
> +	 * pci-core sets the device power state to an unknown value at
> +	 * bootup and after being removed from a driver.  The only
> +	 * transition it allows from this unknown state is to D0, which
> +	 * typically happens when a driver calls pci_enable_device().
> +	 * We're not ready to enable the device yet, but we do want to
> +	 * be able to get to D3.  Therefore first do a D0 transition
> +	 * before enabling runtime PM.
> +	 */
> +	vfio_pci_set_power_state(vdev, PCI_D0);
> +
> +#if defined(CONFIG_PM)
> +	dev->driver->pm = &vfio_pci_core_pm_ops,
> +#endif
> +
> +	pm_runtime_allow(dev);
> +	if (!disable_idle_d3)
> +		pm_runtime_put(dev);
>  
>  	ret = vfio_register_group_dev(&vdev->vdev);
>  	if (ret)
> @@ -1900,7 +1929,9 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev,
>  
>  out_power:
>  	if (!disable_idle_d3)
> -		vfio_pci_set_power_state(vdev, PCI_D0);
> +		pm_runtime_get_noresume(dev);
> +
> +	pm_runtime_forbid(dev);
>  out_vf:
>  	vfio_pci_vf_uninit(vdev);
>  out_drvdata:
> @@ -1922,8 +1953,9 @@ void vfio_pci_core_unregister_device(struct vfio_pci_core_device *vdev)
>  	vfio_pci_vga_uninit(vdev);
>  
>  	if (!disable_idle_d3)
> -		vfio_pci_set_power_state(vdev, PCI_D0);
> +		pm_runtime_get_noresume(dev);
>  
> +	pm_runtime_forbid(dev);
>  	dev_set_drvdata(dev, NULL);
>  }
>  EXPORT_SYMBOL_GPL(vfio_pci_core_unregister_device);
> @@ -1984,18 +2016,26 @@ int vfio_pci_core_sriov_configure(struct pci_dev *pdev, int nr_virtfn)
>  
>  		/*
>  		 * The PF power state should always be higher than the VF power
> -		 * state. If PF is in the low power state, then change the
> -		 * power state to D0 first before enabling SR-IOV.
> +		 * state. If PF is in the runtime suspended state, then resume
> +		 * it first before enabling SR-IOV.
>  		 */
> -		vfio_pci_set_power_state(vdev, PCI_D0);
> -		ret = pci_enable_sriov(pdev, nr_virtfn);
> +		ret = pm_runtime_resume_and_get(&pdev->dev);
>  		if (ret)
>  			goto out_del;
> +
> +		ret = pci_enable_sriov(pdev, nr_virtfn);
> +		if (ret) {
> +			pm_runtime_put(&pdev->dev);
> +			goto out_del;
> +		}
>  		ret = nr_virtfn;
>  		goto out_put;
>  	}
>  
> -	pci_disable_sriov(pdev);
> +	if (pci_num_vf(pdev)) {
> +		pci_disable_sriov(pdev);
> +		pm_runtime_put(&pdev->dev);
> +	}
>  
>  out_del:
>  	mutex_lock(&vfio_pci_sriov_pfs_mutex);
> @@ -2072,6 +2112,30 @@ vfio_pci_dev_set_resettable(struct vfio_device_set *dev_set)
>  	return pdev;
>  }
>  
> +static int vfio_pci_dev_set_pm_runtime_get(struct vfio_device_set *dev_set)
> +{
> +	struct vfio_pci_core_device *cur_pm;
> +	struct vfio_pci_core_device *cur;
> +	int ret = 0;
> +
> +	list_for_each_entry(cur_pm, &dev_set->device_list, vdev.dev_set_list) {
> +		ret = pm_runtime_resume_and_get(&cur_pm->pdev->dev);
> +		if (ret < 0)
> +			break;
> +	}
> +
> +	if (!ret)
> +		return 0;
> +
> +	list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) {
> +		if (cur == cur_pm)
> +			break;
> +		pm_runtime_put(&cur->pdev->dev);
> +	}
> +
> +	return ret;
> +}

The above works, but maybe could be a little cleaner taking advantage
of list_for_each_entry_continue_reverse as:

{
	struct vfio_pci_core_device *cur;
	int ret;

	list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) {
		ret = pm_runtime_resume_and_get(&cur->pdev->dev);
		if (ret)
			goto unwind;
	}

	return 0;

unwind:
	list_for_each_entry_continue_reverse(cur, &dev_set->device_list, vdev.dev_set_list)
		pm_runtime_put(&cur->pdev->dev);

	return ret;
}

Thanks,
Alex

> +
>  /*
>   * We need to get memory_lock for each device, but devices can share mmap_lock,
>   * therefore we need to zap and hold the vma_lock for each device, and only then
> @@ -2178,43 +2242,38 @@ static bool vfio_pci_dev_set_needs_reset(struct vfio_device_set *dev_set)
>   *  - At least one of the affected devices is marked dirty via
>   *    needs_reset (such as by lack of FLR support)
>   * Then attempt to perform that bus or slot reset.
> - * Returns true if the dev_set was reset.
>   */
> -static bool vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set)
> +static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set)
>  {
>  	struct vfio_pci_core_device *cur;
>  	struct pci_dev *pdev;
> -	int ret;
> +	bool reset_done = false;
>  
>  	if (!vfio_pci_dev_set_needs_reset(dev_set))
> -		return false;
> +		return;
>  
>  	pdev = vfio_pci_dev_set_resettable(dev_set);
>  	if (!pdev)
> -		return false;
> +		return;
>  
>  	/*
> -	 * The pci_reset_bus() will reset all the devices in the bus.
> -	 * The power state can be non-D0 for some of the devices in the bus.
> -	 * For these devices, the pci_reset_bus() will internally set
> -	 * the power state to D0 without vfio driver involvement.
> -	 * For the devices which have NoSoftRst-, the reset function can
> -	 * cause the PCI config space reset without restoring the original
> -	 * state (saved locally in 'vdev->pm_save').
> +	 * Some of the devices in the bus can be in the runtime suspended
> +	 * state. Increment the usage count for all the devices in the dev_set
> +	 * before reset and decrement the same after reset.
>  	 */
> -	list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list)
> -		vfio_pci_set_power_state(cur, PCI_D0);
> +	if (!disable_idle_d3 && vfio_pci_dev_set_pm_runtime_get(dev_set))
> +		return;
>  
> -	ret = pci_reset_bus(pdev);
> -	if (ret)
> -		return false;
> +	if (!pci_reset_bus(pdev))
> +		reset_done = true;
>  
>  	list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) {
> -		cur->needs_reset = false;
> +		if (reset_done)
> +			cur->needs_reset = false;
> +
>  		if (!disable_idle_d3)
> -			vfio_pci_set_power_state(cur, PCI_D3hot);
> +			pm_runtime_put(&cur->pdev->dev);
>  	}
> -	return true;
>  }
>  
>  void vfio_pci_core_set_params(bool is_nointxmask, bool is_disable_vga,


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 6/8] vfio: Invoke runtime PM API for IOCTL request
  2022-04-25  9:26 ` [PATCH v3 6/8] vfio: Invoke runtime PM API for IOCTL request Abhishek Sahu
@ 2022-05-04 19:42   ` Alex Williamson
  2022-05-05  9:40     ` Abhishek Sahu
  0 siblings, 1 reply; 41+ messages in thread
From: Alex Williamson @ 2022-05-04 19:42 UTC (permalink / raw)
  To: Abhishek Sahu
  Cc: Cornelia Huck, Yishai Hadas, Jason Gunthorpe, Shameer Kolothum,
	Kevin Tian, Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas,
	linux-kernel, kvm, linux-pm, linux-pci

On Mon, 25 Apr 2022 14:56:13 +0530
Abhishek Sahu <abhsahu@nvidia.com> wrote:

> The vfio/pci driver will have runtime power management support where the
> user can put the device low power state and then PCI devices can go into
> the D3cold state. If the device is in low power state and user issues any
> IOCTL, then the device should be moved out of low power state first. Once
> the IOCTL is serviced, then it can go into low power state again. The
> runtime PM framework manages this with help of usage count. One option
> was to add the runtime PM related API's inside vfio/pci driver but some
> IOCTL (like VFIO_DEVICE_FEATURE) can follow a different path and more
> IOCTL can be added in the future. Also, the runtime PM will be
> added for vfio/pci based drivers variant currently but the other vfio
> based drivers can use the same in the future. So, this patch adds the
> runtime calls runtime related API in the top level IOCTL function itself.
> 
> For the vfio drivers which do not have runtime power management support
> currently, the runtime PM API's won't be invoked. Only for vfio/pci
> based drivers currently, the runtime PM API's will be invoked to increment
> and decrement the usage count. Taking this usage count incremented while
> servicing IOCTL will make sure that user won't put the device into low
> power state when any other IOCTL is being serviced in parallel.
> 
> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
> ---
>  drivers/vfio/vfio.c | 44 +++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 41 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index a4555014bd1e..4e65a127744e 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -32,6 +32,7 @@
>  #include <linux/vfio.h>
>  #include <linux/wait.h>
>  #include <linux/sched/signal.h>
> +#include <linux/pm_runtime.h>
>  #include "vfio.h"
>  
>  #define DRIVER_VERSION	"0.3"
> @@ -1536,6 +1537,30 @@ static const struct file_operations vfio_group_fops = {
>  	.release	= vfio_group_fops_release,
>  };
>  
> +/*
> + * Wrapper around pm_runtime_resume_and_get().
> + * Return 0, if driver power management callbacks are not present i.e. the driver is not

Mind the gratuitous long comment line here.

> + * using runtime power management.
> + * Return 1 upon success, otherwise -errno

Changing semantics vs the thing we're wrapping, why not provide a
wrapper for the `put` as well to avoid?  The only cases where we return
zero are just as easy to detect on the other side.

> + */
> +static inline int vfio_device_pm_runtime_get(struct device *dev)

Given some of Jason's recent series, this should probably just accept a
vfio_device.

> +{
> +#ifdef CONFIG_PM
> +	int ret;
> +
> +	if (!dev->driver || !dev->driver->pm)
> +		return 0;
> +
> +	ret = pm_runtime_resume_and_get(dev);
> +	if (ret < 0)
> +		return ret;
> +
> +	return 1;
> +#else
> +	return 0;
> +#endif
> +}
> +
>  /*
>   * VFIO Device fd
>   */
> @@ -1845,15 +1870,28 @@ static long vfio_device_fops_unl_ioctl(struct file *filep,
>  				       unsigned int cmd, unsigned long arg)
>  {
>  	struct vfio_device *device = filep->private_data;
> +	int pm_ret, ret = 0;
> +
> +	pm_ret = vfio_device_pm_runtime_get(device->dev);
> +	if (pm_ret < 0)
> +		return pm_ret;

I wonder if we might simply want to mask pm errors behind -EIO, maybe
with a rate limited dev_info().  My concern would be that we might mask
errnos that userspace has come to expect for certain ioctls.  Thanks,

Alex

>  
>  	switch (cmd) {
>  	case VFIO_DEVICE_FEATURE:
> -		return vfio_ioctl_device_feature(device, (void __user *)arg);
> +		ret = vfio_ioctl_device_feature(device, (void __user *)arg);
> +		break;
>  	default:
>  		if (unlikely(!device->ops->ioctl))
> -			return -EINVAL;
> -		return device->ops->ioctl(device, cmd, arg);
> +			ret = -EINVAL;
> +		else
> +			ret = device->ops->ioctl(device, cmd, arg);
> +		break;
>  	}
> +
> +	if (pm_ret)
> +		pm_runtime_put(device->dev);
> +
> +	return ret;
>  }
>  
>  static ssize_t vfio_device_fops_read(struct file *filep, char __user *buf,


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 8/8] vfio/pci: Add the support for PCI D3cold state
  2022-04-25  9:26 ` [PATCH v3 8/8] vfio/pci: Add the support for PCI D3cold state Abhishek Sahu
@ 2022-05-04 19:45   ` Alex Williamson
  2022-05-05 12:16     ` Abhishek Sahu
  0 siblings, 1 reply; 41+ messages in thread
From: Alex Williamson @ 2022-05-04 19:45 UTC (permalink / raw)
  To: Abhishek Sahu
  Cc: Cornelia Huck, Yishai Hadas, Jason Gunthorpe, Shameer Kolothum,
	Kevin Tian, Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas,
	linux-kernel, kvm, linux-pm, linux-pci

On Mon, 25 Apr 2022 14:56:15 +0530
Abhishek Sahu <abhsahu@nvidia.com> wrote:

> Currently, if the runtime power management is enabled for vfio pci
> based device in the guest OS, then guest OS will do the register
> write for PCI_PM_CTRL register. This write request will be handled in
> vfio_pm_config_write() where it will do the actual register write
> of PCI_PM_CTRL register. With this, the maximum D3hot state can be
> achieved for low power. If we can use the runtime PM framework,
> then we can achieve the D3cold state which will help in saving
> maximum power.
> 
> 1. Since D3cold state can't be achieved by writing PCI standard
>    PM config registers, so this patch adds a new feature in the
>    existing VFIO_DEVICE_FEATURE IOCTL. This IOCTL can be used
>    to change the PCI device from D3hot to D3cold state and
>    then D3cold to D0 state. The device feature uses low power term
>    instead of D3cold so that if other vfio driver wants to implement
>    low power support, then the same IOCTL can be used.

How does this enable you to handle the full-off vs memory-refresh modes
for NVIDIA GPUs?

The feature ioctl supports a probe, but here the probe only indicates
that the ioctl is available, not what degree of low power support
available.  Even if the host doesn't support d3cold for the device, we
can still achieve root port d3hot, but can we provide further
capability info to the user?
 
> 2. The hypervisors can implement virtual ACPI methods. For
>    example, in guest linux OS if PCI device ACPI node has _PR3 and _PR0
>    power resources with _ON/_OFF method, then guest linux OS makes the
>    _OFF call during D3cold transition and then _ON during D0 transition.
>    The hypervisor can tap these virtual ACPI calls and then do the D3cold
>    related IOCTL in the vfio driver.
> 
> 3. The vfio driver uses runtime PM framework to achieve the
>    D3cold state. For the D3cold transition, decrement the usage count and
>    for the D0 transition, increment the usage count.
> 
> 4. For D3cold, the device current power state should be D3hot.
>    Then during runtime suspend, the pci_platform_power_transition() is
>    required for D3cold state. If the D3cold state is not supported, then
>    the device will still be in D3hot state. But with the runtime PM, the
>    root port can now also go into suspended state.

Why do we create this requirement for the device to be in d3hot prior
to entering low power when our pm ops suspend function wakes the device
do d0?

> 5. For most of the systems, the D3cold is supported at the root
>    port level. So, when root port will transition to D3cold state, then
>    the vfio PCI device will go from D3hot to D3cold state during its
>    runtime suspend. If root port does not support D3cold, then the root
>    will go into D3hot state.
> 
> 6. The runtime suspend callback can now happen for 2 cases: there
>    are no users of vfio device and the case where user has initiated
>    D3cold. The 'platform_pm_engaged' flag can help to distinguish
>    between these 2 cases.

If this were the only use case we could rely on vfio_device.open_count
instead.  I don't think it is though.
 
> 7. In D3cold, all kind of BAR related access needs to be disabled
>    like D3hot. Additionally, the config space will also be disabled in
>    D3cold state. To prevent access of config space in D3cold state, do
>    increment the runtime PM usage count before doing any config space
>    access.

Or we could actually prevent access to config space rather than waking
the device for the access.  Addressed in further comment below.
 
> 8. If user has engaged low power entry through IOCTL, then user should
>    do low power exit first. The user can issue config access or IOCTL
>    after low power entry. We can add an explicit error check but since
>    we are already waking-up device, so IOCTL and config access can be
>    fulfilled. But 'power_state_d3' won't be cleared without issuing
>    low power exit so all BAR related access will still return error till
>    user do low power exit.

The fact that power_state_d3 no longer tracks the device power state
when platform_pm_engaged is set is a confusing discontinuity.

> 9. Since multiple layers are involved, so following is the high level
>    code flow for D3cold entry and exit.
> 
> D3cold entry:
> 
> a. User put the PCI device into D3hot by writing into standard config
>    register (vfio_pm_config_write() -> vfio_lock_and_set_power_state() ->
>    vfio_pci_set_power_state()). The device power state will be D3hot and
>    power_state_d3 will be true.
> b. Set vfio_device_feature_power_management::low_power_state =
>    VFIO_DEVICE_LOW_POWER_STATE_ENTER and call VFIO_DEVICE_FEATURE IOCTL.
> c. Inside vfio_device_fops_unl_ioctl(), pm_runtime_resume_and_get()
>    will be called first which will make the usage count as 2 and then
>    vfio_pci_core_ioctl_feature() will be invoked.
> d. vfio_pci_core_feature_pm() will be called and it will go inside
>    VFIO_DEVICE_LOW_POWER_STATE_ENTER switch case. platform_pm_engaged will
>    be true and pm_runtime_put_noidle() will decrement the usage count
>    to 1.
> e. Inside vfio_device_fops_unl_ioctl() while returning the
>    pm_runtime_put() will make the usage count to 0 and the runtime PM
>    framework will engage the runtime suspend entry.
> f. pci_pm_runtime_suspend() will be called and invokes driver runtime
>    suspend callback.
> g. vfio_pci_core_runtime_suspend() will change the power state to D0
>    and do the INTx mask related handling.
> h. pci_pm_runtime_suspend() will take care of saving the PCI state and
>    all power management handling for D3cold.
> 
> D3cold exit:
> 
> a. Set vfio_device_feature_power_management::low_power_state =
>    VFIO_DEVICE_LOW_POWER_STATE_EXIT and call VFIO_DEVICE_FEATURE IOCTL.
> b. Inside vfio_device_fops_unl_ioctl(), pm_runtime_resume_and_get()
>    will be called first which will make the usage count as 1.
> c. pci_pm_runtime_resume() will take care of moving the device into D0
>    state again and then vfio_pci_core_runtime_resume() will be called.
> d. vfio_pci_core_runtime_resume() will do the INTx unmask related
>    handling.
> e. vfio_pci_core_ioctl_feature() will be invoked.
> f. vfio_pci_core_feature_pm() will be called and it will go inside
>    VFIO_DEVICE_LOW_POWER_STATE_EXIT switch case. platform_pm_engaged and
>    power_state_d3 will be cleared and pm_runtime_get_noresume() will make
>    the usage count as 2.
> g. Inside vfio_device_fops_unl_ioctl() while returning the
>    pm_runtime_put() will make the usage count to 1 and the device will
>    be in D0 state only.
> 
> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
> ---
>  drivers/vfio/pci/vfio_pci_config.c |  11 ++-
>  drivers/vfio/pci/vfio_pci_core.c   | 131 ++++++++++++++++++++++++++++-
>  include/linux/vfio_pci_core.h      |   1 +
>  include/uapi/linux/vfio.h          |  18 ++++
>  4 files changed, 159 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
> index af0ae80ef324..65b1bc9586ab 100644
> --- a/drivers/vfio/pci/vfio_pci_config.c
> +++ b/drivers/vfio/pci/vfio_pci_config.c
> @@ -25,6 +25,7 @@
>  #include <linux/uaccess.h>
>  #include <linux/vfio.h>
>  #include <linux/slab.h>
> +#include <linux/pm_runtime.h>
>  
>  #include <linux/vfio_pci_core.h>
>  
> @@ -1936,16 +1937,23 @@ static ssize_t vfio_config_do_rw(struct vfio_pci_core_device *vdev, char __user
>  ssize_t vfio_pci_config_rw(struct vfio_pci_core_device *vdev, char __user *buf,
>  			   size_t count, loff_t *ppos, bool iswrite)
>  {
> +	struct device *dev = &vdev->pdev->dev;
>  	size_t done = 0;
>  	int ret = 0;
>  	loff_t pos = *ppos;
>  
>  	pos &= VFIO_PCI_OFFSET_MASK;
>  
> +	ret = pm_runtime_resume_and_get(dev);
> +	if (ret < 0)
> +		return ret;

Alternatively we could just check platform_pm_engaged here and return
-EINVAL, right?  Why is waking the device the better option?

> +
>  	while (count) {
>  		ret = vfio_config_do_rw(vdev, buf, count, &pos, iswrite);
> -		if (ret < 0)
> +		if (ret < 0) {
> +			pm_runtime_put(dev);
>  			return ret;
> +		}
>  
>  		count -= ret;
>  		done += ret;
> @@ -1953,6 +1961,7 @@ ssize_t vfio_pci_config_rw(struct vfio_pci_core_device *vdev, char __user *buf,
>  		pos += ret;
>  	}
>  
> +	pm_runtime_put(dev);
>  	*ppos += done;
>  
>  	return done;
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index 05a68ca9d9e7..beac6e05f97f 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -234,7 +234,14 @@ int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t stat
>  	ret = pci_set_power_state(pdev, state);
>  
>  	if (!ret) {
> -		vdev->power_state_d3 = (pdev->current_state >= PCI_D3hot);
> +		/*
> +		 * If 'platform_pm_engaged' is true then 'power_state_d3' can
> +		 * be cleared only when user makes the explicit request to
> +		 * move out of low power state by using power management ioctl.
> +		 */
> +		if (!vdev->platform_pm_engaged)
> +			vdev->power_state_d3 =
> +				(pdev->current_state >= PCI_D3hot);

power_state_d3 is essentially only used as a secondary test to
__vfio_pci_memory_enabled() to block r/w access to device regions and
generate a fault on mmap access.  Its existence already seems a little
questionable when we could just look at vdev->pdev->current_state, and
we could incorporate that into __vfio_pci_memory_enabled().  So rather
than creating this inconsistency, couldn't we just make that function
return:

!vdev->platform_pm_enagaged && pdev->current_state < PCI_D3hot &&
(pdev->no_command_memory || (cmd & PCI_COMMAND_MEMORY))


>  
>  		/* D3 might be unsupported via quirk, skip unless in D3 */
>  		if (needs_save && pdev->current_state >= PCI_D3hot) {
> @@ -266,6 +273,25 @@ static int vfio_pci_core_runtime_suspend(struct device *dev)
>  {
>  	struct vfio_pci_core_device *vdev = dev_get_drvdata(dev);
>  
> +	down_read(&vdev->memory_lock);
> +
> +	/* 'platform_pm_engaged' will be false if there are no users. */
> +	if (!vdev->platform_pm_engaged) {
> +		up_read(&vdev->memory_lock);
> +		return 0;
> +	}
> +
> +	/*
> +	 * The user will move the device into D3hot state first before invoking
> +	 * power management ioctl. Move the device into D0 state here and then
> +	 * the pci-driver core runtime PM suspend will move the device into
> +	 * low power state. Also, for the devices which have NoSoftRst-,
> +	 * it will help in restoring the original state (saved locally in
> +	 * 'vdev->pm_save').
> +	 */
> +	vfio_pci_set_power_state(vdev, PCI_D0);
> +	up_read(&vdev->memory_lock);
> +
>  	/*
>  	 * If INTx is enabled, then mask INTx before going into runtime
>  	 * suspended state and unmask the same in the runtime resume.
> @@ -395,6 +421,19 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
>  
>  	/*
>  	 * This function can be invoked while the power state is non-D0.
> +	 * This non-D0 power state can be with or without runtime PM.
> +	 * Increment the usage count corresponding to pm_runtime_put()
> +	 * called during setting of 'platform_pm_engaged'. The device will
> +	 * wake up if it has already went into suspended state. Otherwise,
> +	 * the next vfio_pci_set_power_state() will change the
> +	 * device power state to D0.
> +	 */
> +	if (vdev->platform_pm_engaged) {
> +		pm_runtime_resume_and_get(&pdev->dev);
> +		vdev->platform_pm_engaged = false;
> +	}
> +
> +	/*
>  	 * This function calls __pci_reset_function_locked() which internally
>  	 * can use pci_pm_reset() for the function reset. pci_pm_reset() will
>  	 * fail if the power state is non-D0. Also, for the devices which
> @@ -1192,6 +1231,80 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
>  }
>  EXPORT_SYMBOL_GPL(vfio_pci_core_ioctl);
>  
> +#ifdef CONFIG_PM
> +static int vfio_pci_core_feature_pm(struct vfio_device *device, u32 flags,
> +				    void __user *arg, size_t argsz)
> +{
> +	struct vfio_pci_core_device *vdev =
> +		container_of(device, struct vfio_pci_core_device, vdev);
> +	struct pci_dev *pdev = vdev->pdev;
> +	struct vfio_device_feature_power_management vfio_pm = { 0 };
> +	int ret = 0;
> +
> +	ret = vfio_check_feature(flags, argsz,
> +				 VFIO_DEVICE_FEATURE_SET |
> +				 VFIO_DEVICE_FEATURE_GET,
> +				 sizeof(vfio_pm));
> +	if (ret != 1)
> +		return ret;
> +
> +	if (flags & VFIO_DEVICE_FEATURE_GET) {
> +		down_read(&vdev->memory_lock);
> +		vfio_pm.low_power_state = vdev->platform_pm_engaged ?
> +				VFIO_DEVICE_LOW_POWER_STATE_ENTER :
> +				VFIO_DEVICE_LOW_POWER_STATE_EXIT;
> +		up_read(&vdev->memory_lock);
> +		if (copy_to_user(arg, &vfio_pm, sizeof(vfio_pm)))
> +			return -EFAULT;
> +		return 0;
> +	}
> +
> +	if (copy_from_user(&vfio_pm, arg, sizeof(vfio_pm)))
> +		return -EFAULT;
> +
> +	/*
> +	 * The vdev power related fields are protected with memory_lock
> +	 * semaphore.
> +	 */
> +	down_write(&vdev->memory_lock);
> +	switch (vfio_pm.low_power_state) {
> +	case VFIO_DEVICE_LOW_POWER_STATE_ENTER:
> +		if (!vdev->power_state_d3 || vdev->platform_pm_engaged) {
> +			ret = EINVAL;
> +			break;
> +		}
> +
> +		vdev->platform_pm_engaged = true;
> +
> +		/*
> +		 * The pm_runtime_put() will be called again while returning
> +		 * from ioctl after which the device can go into runtime
> +		 * suspended.
> +		 */
> +		pm_runtime_put_noidle(&pdev->dev);
> +		break;
> +
> +	case VFIO_DEVICE_LOW_POWER_STATE_EXIT:
> +		if (!vdev->platform_pm_engaged) {
> +			ret = EINVAL;
> +			break;
> +		}
> +
> +		vdev->platform_pm_engaged = false;
> +		vdev->power_state_d3 = false;
> +		pm_runtime_get_noresume(&pdev->dev);
> +		break;
> +
> +	default:
> +		ret = EINVAL;
> +		break;
> +	}
> +
> +	up_write(&vdev->memory_lock);
> +	return ret;
> +}
> +#endif
> +
>  static int vfio_pci_core_feature_token(struct vfio_device *device, u32 flags,
>  				       void __user *arg, size_t argsz)
>  {
> @@ -1226,6 +1339,10 @@ int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
>  	switch (flags & VFIO_DEVICE_FEATURE_MASK) {
>  	case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN:
>  		return vfio_pci_core_feature_token(device, flags, arg, argsz);
> +#ifdef CONFIG_PM
> +	case VFIO_DEVICE_FEATURE_POWER_MANAGEMENT:
> +		return vfio_pci_core_feature_pm(device, flags, arg, argsz);
> +#endif
>  	default:
>  		return -ENOTTY;
>  	}
> @@ -2189,6 +2306,15 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
>  		goto err_unlock;
>  	}
>  
> +	/*
> +	 * Some of the devices in the dev_set can be in the runtime suspended
> +	 * state. Increment the usage count for all the devices in the dev_set
> +	 * before reset and decrement the same after reset.
> +	 */
> +	ret = vfio_pci_dev_set_pm_runtime_get(dev_set);
> +	if (ret)
> +		goto err_unlock;
> +
>  	list_for_each_entry(cur_vma, &dev_set->device_list, vdev.dev_set_list) {
>  		/*
>  		 * Test whether all the affected devices are contained by the
> @@ -2244,6 +2370,9 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
>  		else
>  			mutex_unlock(&cur->vma_lock);
>  	}
> +
> +	list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list)
> +		pm_runtime_put(&cur->pdev->dev);
>  err_unlock:
>  	mutex_unlock(&dev_set->lock);
>  	return ret;
> diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
> index e84f31e44238..337983a877d6 100644
> --- a/include/linux/vfio_pci_core.h
> +++ b/include/linux/vfio_pci_core.h
> @@ -126,6 +126,7 @@ struct vfio_pci_core_device {
>  	bool			needs_pm_restore;
>  	bool			power_state_d3;
>  	bool			pm_intx_masked;
> +	bool			platform_pm_engaged;
>  	struct pci_saved_state	*pci_saved_state;
>  	struct pci_saved_state	*pm_save;
>  	int			ioeventfds_nr;
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index fea86061b44e..53ff890dbd27 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -986,6 +986,24 @@ enum vfio_device_mig_state {
>  	VFIO_DEVICE_STATE_RUNNING_P2P = 5,
>  };
>  
> +/*
> + * Use platform-based power management for moving the device into low power
> + * state.  This low power state is device specific.
> + *
> + * For PCI, this low power state is D3cold.  The native PCI power management
> + * does not support the D3cold power state.  For moving the device into D3cold
> + * state, change the PCI state to D3hot with standard configuration registers
> + * and then call this IOCTL to setting the D3cold state.  Similarly, if the
> + * device in D3cold state, then call this IOCTL to exit from D3cold state.
> + */
> +struct vfio_device_feature_power_management {
> +#define VFIO_DEVICE_LOW_POWER_STATE_EXIT	0x0
> +#define VFIO_DEVICE_LOW_POWER_STATE_ENTER	0x1
> +	__u64	low_power_state;
> +};
> +
> +#define VFIO_DEVICE_FEATURE_POWER_MANAGEMENT	3

__u8 seems more than sufficient here.  Thanks,

Alex

> +
>  /* -------- API for Type1 VFIO IOMMU -------- */
>  
>  /**


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 5/8] vfio/pci: Enable runtime PM for vfio_pci_core based drivers
  2022-05-04 19:42   ` Alex Williamson
@ 2022-05-05  9:07     ` Abhishek Sahu
  0 siblings, 0 replies; 41+ messages in thread
From: Abhishek Sahu @ 2022-05-05  9:07 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Cornelia Huck, Yishai Hadas, Jason Gunthorpe, Shameer Kolothum,
	Kevin Tian, Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas,
	linux-kernel, kvm, linux-pm, linux-pci

On 5/5/2022 1:12 AM, Alex Williamson wrote:
> On Mon, 25 Apr 2022 14:56:12 +0530
> Abhishek Sahu <abhsahu@nvidia.com> wrote:
> 
>> Currently, there is very limited power management support
>> available in the upstream vfio_pci_core based drivers. If there
>> are no users of the device, then the PCI device will be moved into
>> D3hot state by writing directly into PCI PM registers. This D3hot
>> state help in saving power but we can achieve zero power consumption
>> if we go into the D3cold state. The D3cold state cannot be possible
>> with native PCI PM. It requires interaction with platform firmware
>> which is system-specific. To go into low power states (including D3cold),
>> the runtime PM framework can be used which internally interacts with PCI
>> and platform firmware and puts the device into the lowest possible
>> D-States.
>>
>> This patch registers vfio_pci_core based drivers with the
>> runtime PM framework.
>>
>> 1. The PCI core framework takes care of most of the runtime PM
>>    related things. For enabling the runtime PM, the PCI driver needs to
>>    decrement the usage count and needs to provide 'struct dev_pm_ops'
>>    at least. The runtime suspend/resume callbacks are optional and needed
>>    only if we need to do any extra handling. Now there are multiple
>>    vfio_pci_core based drivers. Instead of assigning the
>>    'struct dev_pm_ops' in individual parent driver, the vfio_pci_core
>>    itself assigns the 'struct dev_pm_ops'. There are other drivers where
>>    the 'struct dev_pm_ops' is being assigned inside core layer
>>    (For example, wlcore_probe() and some sound based driver, etc.).
>>
>> 2. This patch provides the stub implementation of 'struct dev_pm_ops'.
>>    The subsequent patch will provide the runtime suspend/resume
>>    callbacks. All the config state saving, and PCI power management
>>    related things will be done by PCI core framework itself inside its
>>    runtime suspend/resume callbacks (pci_pm_runtime_suspend() and
>>    pci_pm_runtime_resume()).
>>
>> 3. Inside pci_reset_bus(), all the devices in dev_set needs to be
>>    runtime resumed. vfio_pci_dev_set_pm_runtime_get() will take
>>    care of the runtime resume and its error handling.
>>
>> 4. Inside vfio_pci_core_disable(), the device usage count always needs
>>    to be decremented which was incremented in vfio_pci_core_enable().
>>
>> 5. Since the runtime PM framework will provide the same functionality,
>>    so directly writing into PCI PM config register can be replaced with
>>    the use of runtime PM routines. Also, the use of runtime PM can help
>>    us in more power saving.
>>
>>    In the systems which do not support D3cold,
>>
>>    With the existing implementation:
>>
>>    // PCI device
>>    # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
>>    D3hot
>>    // upstream bridge
>>    # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
>>    D0
>>
>>    With runtime PM:
>>
>>    // PCI device
>>    # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
>>    D3hot
>>    // upstream bridge
>>    # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
>>    D3hot
>>
>>    So, with runtime PM, the upstream bridge or root port will also go
>>    into lower power state which is not possible with existing
>>    implementation.
>>
>>    In the systems which support D3cold,
>>
>>    // PCI device
>>    # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
>>    D3hot
>>    // upstream bridge
>>    # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
>>    D0
>>
>>    With runtime PM:
>>
>>    // PCI device
>>    # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
>>    D3cold
>>    // upstream bridge
>>    # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
>>    D3cold
>>
>>    So, with runtime PM, both the PCI device and upstream bridge will
>>    go into D3cold state.
>>
>> 6. If 'disable_idle_d3' module parameter is set, then also the runtime
>>    PM will be enabled, but in this case, the usage count should not be
>>    decremented.
>>
>> 7. vfio_pci_dev_set_try_reset() return value is unused now, so this
>>    function return type can be changed to void.
>>
>> 8. Use the runtime PM API's in vfio_pci_core_sriov_configure().
>>    For preventing any runtime usage mismatch, pci_num_vf() has been
>>    called explicitly during disable.
>>
>> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
>> ---
>>  drivers/vfio/pci/vfio_pci_core.c | 169 +++++++++++++++++++++----------
>>  1 file changed, 114 insertions(+), 55 deletions(-)
>>
>> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
>> index 953ac33b2f5f..aee5e0cd6137 100644
>> --- a/drivers/vfio/pci/vfio_pci_core.c
>> +++ b/drivers/vfio/pci/vfio_pci_core.c
>> @@ -156,7 +156,7 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_core_device *vdev)
>>  }
>>  
>>  struct vfio_pci_group_info;
>> -static bool vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set);
>> +static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set);
>>  static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
>>  				      struct vfio_pci_group_info *groups);
>>  
>> @@ -261,6 +261,19 @@ int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t stat
>>  	return ret;
>>  }
>>  
>> +#ifdef CONFIG_PM
>> +/*
>> + * The dev_pm_ops needs to be provided to make pci-driver runtime PM working,
>> + * so use structure without any callbacks.
>> + *
>> + * The pci-driver core runtime PM routines always save the device state
>> + * before going into suspended state. If the device is going into low power
>> + * state with only with runtime PM ops, then no explicit handling is needed
>> + * for the devices which have NoSoftRst-.
>> + */
>> +static const struct dev_pm_ops vfio_pci_core_pm_ops = { };
>> +#endif
>> +
>>  int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
>>  {
>>  	struct pci_dev *pdev = vdev->pdev;
>> @@ -268,21 +281,23 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
>>  	u16 cmd;
>>  	u8 msix_pos;
>>  
>> -	vfio_pci_set_power_state(vdev, PCI_D0);
>> +	if (!disable_idle_d3) {
>> +		ret = pm_runtime_resume_and_get(&pdev->dev);
>> +		if (ret < 0)
>> +			return ret;
>> +	}
>>  
>>  	/* Don't allow our initial saved state to include busmaster */
>>  	pci_clear_master(pdev);
>>  
>>  	ret = pci_enable_device(pdev);
>>  	if (ret)
>> -		return ret;
>> +		goto out_power;
>>  
>>  	/* If reset fails because of the device lock, fail this path entirely */
>>  	ret = pci_try_reset_function(pdev);
>> -	if (ret == -EAGAIN) {
>> -		pci_disable_device(pdev);
>> -		return ret;
>> -	}
>> +	if (ret == -EAGAIN)
>> +		goto out_disable_device;
>>  
>>  	vdev->reset_works = !ret;
>>  	pci_save_state(pdev);
>> @@ -306,12 +321,8 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
>>  	}
>>  
>>  	ret = vfio_config_init(vdev);
>> -	if (ret) {
>> -		kfree(vdev->pci_saved_state);
>> -		vdev->pci_saved_state = NULL;
>> -		pci_disable_device(pdev);
>> -		return ret;
>> -	}
>> +	if (ret)
>> +		goto out_free_state;
>>  
>>  	msix_pos = pdev->msix_cap;
>>  	if (msix_pos) {
>> @@ -332,6 +343,16 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
>>  
>>  
>>  	return 0;
>> +
>> +out_free_state:
>> +	kfree(vdev->pci_saved_state);
>> +	vdev->pci_saved_state = NULL;
>> +out_disable_device:
>> +	pci_disable_device(pdev);
>> +out_power:
>> +	if (!disable_idle_d3)
>> +		pm_runtime_put(&pdev->dev);
>> +	return ret;
>>  }
>>  EXPORT_SYMBOL_GPL(vfio_pci_core_enable);
>>  
>> @@ -439,8 +460,11 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
>>  out:
>>  	pci_disable_device(pdev);
>>  
>> -	if (!vfio_pci_dev_set_try_reset(vdev->vdev.dev_set) && !disable_idle_d3)
>> -		vfio_pci_set_power_state(vdev, PCI_D3hot);
>> +	vfio_pci_dev_set_try_reset(vdev->vdev.dev_set);
>> +
>> +	/* Put the pm-runtime usage counter acquired during enable */
>> +	if (!disable_idle_d3)
>> +		pm_runtime_put(&pdev->dev);
>>  }
>>  EXPORT_SYMBOL_GPL(vfio_pci_core_disable);
>>  
>> @@ -1879,19 +1903,24 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev,
>>  
>>  	vfio_pci_probe_power_state(vdev);
>>  
>> -	if (!disable_idle_d3) {
>> -		/*
>> -		 * pci-core sets the device power state to an unknown value at
>> -		 * bootup and after being removed from a driver.  The only
>> -		 * transition it allows from this unknown state is to D0, which
>> -		 * typically happens when a driver calls pci_enable_device().
>> -		 * We're not ready to enable the device yet, but we do want to
>> -		 * be able to get to D3.  Therefore first do a D0 transition
>> -		 * before going to D3.
>> -		 */
>> -		vfio_pci_set_power_state(vdev, PCI_D0);
>> -		vfio_pci_set_power_state(vdev, PCI_D3hot);
>> -	}
>> +	/*
>> +	 * pci-core sets the device power state to an unknown value at
>> +	 * bootup and after being removed from a driver.  The only
>> +	 * transition it allows from this unknown state is to D0, which
>> +	 * typically happens when a driver calls pci_enable_device().
>> +	 * We're not ready to enable the device yet, but we do want to
>> +	 * be able to get to D3.  Therefore first do a D0 transition
>> +	 * before enabling runtime PM.
>> +	 */
>> +	vfio_pci_set_power_state(vdev, PCI_D0);
>> +
>> +#if defined(CONFIG_PM)
>> +	dev->driver->pm = &vfio_pci_core_pm_ops,
>> +#endif
>> +
>> +	pm_runtime_allow(dev);
>> +	if (!disable_idle_d3)
>> +		pm_runtime_put(dev);
>>  
>>  	ret = vfio_register_group_dev(&vdev->vdev);
>>  	if (ret)
>> @@ -1900,7 +1929,9 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev,
>>  
>>  out_power:
>>  	if (!disable_idle_d3)
>> -		vfio_pci_set_power_state(vdev, PCI_D0);
>> +		pm_runtime_get_noresume(dev);
>> +
>> +	pm_runtime_forbid(dev);
>>  out_vf:
>>  	vfio_pci_vf_uninit(vdev);
>>  out_drvdata:
>> @@ -1922,8 +1953,9 @@ void vfio_pci_core_unregister_device(struct vfio_pci_core_device *vdev)
>>  	vfio_pci_vga_uninit(vdev);
>>  
>>  	if (!disable_idle_d3)
>> -		vfio_pci_set_power_state(vdev, PCI_D0);
>> +		pm_runtime_get_noresume(dev);
>>  
>> +	pm_runtime_forbid(dev);
>>  	dev_set_drvdata(dev, NULL);
>>  }
>>  EXPORT_SYMBOL_GPL(vfio_pci_core_unregister_device);
>> @@ -1984,18 +2016,26 @@ int vfio_pci_core_sriov_configure(struct pci_dev *pdev, int nr_virtfn)
>>  
>>  		/*
>>  		 * The PF power state should always be higher than the VF power
>> -		 * state. If PF is in the low power state, then change the
>> -		 * power state to D0 first before enabling SR-IOV.
>> +		 * state. If PF is in the runtime suspended state, then resume
>> +		 * it first before enabling SR-IOV.
>>  		 */
>> -		vfio_pci_set_power_state(vdev, PCI_D0);
>> -		ret = pci_enable_sriov(pdev, nr_virtfn);
>> +		ret = pm_runtime_resume_and_get(&pdev->dev);
>>  		if (ret)
>>  			goto out_del;
>> +
>> +		ret = pci_enable_sriov(pdev, nr_virtfn);
>> +		if (ret) {
>> +			pm_runtime_put(&pdev->dev);
>> +			goto out_del;
>> +		}
>>  		ret = nr_virtfn;
>>  		goto out_put;
>>  	}
>>  
>> -	pci_disable_sriov(pdev);
>> +	if (pci_num_vf(pdev)) {
>> +		pci_disable_sriov(pdev);
>> +		pm_runtime_put(&pdev->dev);
>> +	}
>>  
>>  out_del:
>>  	mutex_lock(&vfio_pci_sriov_pfs_mutex);
>> @@ -2072,6 +2112,30 @@ vfio_pci_dev_set_resettable(struct vfio_device_set *dev_set)
>>  	return pdev;
>>  }
>>  
>> +static int vfio_pci_dev_set_pm_runtime_get(struct vfio_device_set *dev_set)
>> +{
>> +	struct vfio_pci_core_device *cur_pm;
>> +	struct vfio_pci_core_device *cur;
>> +	int ret = 0;
>> +
>> +	list_for_each_entry(cur_pm, &dev_set->device_list, vdev.dev_set_list) {
>> +		ret = pm_runtime_resume_and_get(&cur_pm->pdev->dev);
>> +		if (ret < 0)
>> +			break;
>> +	}
>> +
>> +	if (!ret)
>> +		return 0;
>> +
>> +	list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) {
>> +		if (cur == cur_pm)
>> +			break;
>> +		pm_runtime_put(&cur->pdev->dev);
>> +	}
>> +
>> +	return ret;
>> +}
> 
> The above works, but maybe could be a little cleaner taking advantage
> of list_for_each_entry_continue_reverse as:
> 
> {
> 	struct vfio_pci_core_device *cur;
> 	int ret;
> 
> 	list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) {
> 		ret = pm_runtime_resume_and_get(&cur->pdev->dev);
> 		if (ret)
> 			goto unwind;
> 	}
> 
> 	return 0;
> 
> unwind:
> 	list_for_each_entry_continue_reverse(cur, &dev_set->device_list, vdev.dev_set_list)
> 		pm_runtime_put(&cur->pdev->dev);
> 
> 	return ret;
> }
> 
> Thanks,
> Alex
> 

 Thanks Alex.
 I will make this change.

 Regards,
 Abhishek

>> +
>>  /*
>>   * We need to get memory_lock for each device, but devices can share mmap_lock,
>>   * therefore we need to zap and hold the vma_lock for each device, and only then
>> @@ -2178,43 +2242,38 @@ static bool vfio_pci_dev_set_needs_reset(struct vfio_device_set *dev_set)
>>   *  - At least one of the affected devices is marked dirty via
>>   *    needs_reset (such as by lack of FLR support)
>>   * Then attempt to perform that bus or slot reset.
>> - * Returns true if the dev_set was reset.
>>   */
>> -static bool vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set)
>> +static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set)
>>  {
>>  	struct vfio_pci_core_device *cur;
>>  	struct pci_dev *pdev;
>> -	int ret;
>> +	bool reset_done = false;
>>  
>>  	if (!vfio_pci_dev_set_needs_reset(dev_set))
>> -		return false;
>> +		return;
>>  
>>  	pdev = vfio_pci_dev_set_resettable(dev_set);
>>  	if (!pdev)
>> -		return false;
>> +		return;
>>  
>>  	/*
>> -	 * The pci_reset_bus() will reset all the devices in the bus.
>> -	 * The power state can be non-D0 for some of the devices in the bus.
>> -	 * For these devices, the pci_reset_bus() will internally set
>> -	 * the power state to D0 without vfio driver involvement.
>> -	 * For the devices which have NoSoftRst-, the reset function can
>> -	 * cause the PCI config space reset without restoring the original
>> -	 * state (saved locally in 'vdev->pm_save').
>> +	 * Some of the devices in the bus can be in the runtime suspended
>> +	 * state. Increment the usage count for all the devices in the dev_set
>> +	 * before reset and decrement the same after reset.
>>  	 */
>> -	list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list)
>> -		vfio_pci_set_power_state(cur, PCI_D0);
>> +	if (!disable_idle_d3 && vfio_pci_dev_set_pm_runtime_get(dev_set))
>> +		return;
>>  
>> -	ret = pci_reset_bus(pdev);
>> -	if (ret)
>> -		return false;
>> +	if (!pci_reset_bus(pdev))
>> +		reset_done = true;
>>  
>>  	list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) {
>> -		cur->needs_reset = false;
>> +		if (reset_done)
>> +			cur->needs_reset = false;
>> +
>>  		if (!disable_idle_d3)
>> -			vfio_pci_set_power_state(cur, PCI_D3hot);
>> +			pm_runtime_put(&cur->pdev->dev);
>>  	}
>> -	return true;
>>  }
>>  
>>  void vfio_pci_core_set_params(bool is_nointxmask, bool is_disable_vga,
> 


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 6/8] vfio: Invoke runtime PM API for IOCTL request
  2022-05-04 19:42   ` Alex Williamson
@ 2022-05-05  9:40     ` Abhishek Sahu
  2022-05-09 22:30       ` Alex Williamson
  0 siblings, 1 reply; 41+ messages in thread
From: Abhishek Sahu @ 2022-05-05  9:40 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Cornelia Huck, Yishai Hadas, Jason Gunthorpe, Shameer Kolothum,
	Kevin Tian, Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas,
	linux-kernel, kvm, linux-pm, linux-pci

On 5/5/2022 1:12 AM, Alex Williamson wrote:
> On Mon, 25 Apr 2022 14:56:13 +0530
> Abhishek Sahu <abhsahu@nvidia.com> wrote:
> 
>> The vfio/pci driver will have runtime power management support where the
>> user can put the device low power state and then PCI devices can go into
>> the D3cold state. If the device is in low power state and user issues any
>> IOCTL, then the device should be moved out of low power state first. Once
>> the IOCTL is serviced, then it can go into low power state again. The
>> runtime PM framework manages this with help of usage count. One option
>> was to add the runtime PM related API's inside vfio/pci driver but some
>> IOCTL (like VFIO_DEVICE_FEATURE) can follow a different path and more
>> IOCTL can be added in the future. Also, the runtime PM will be
>> added for vfio/pci based drivers variant currently but the other vfio
>> based drivers can use the same in the future. So, this patch adds the
>> runtime calls runtime related API in the top level IOCTL function itself.
>>
>> For the vfio drivers which do not have runtime power management support
>> currently, the runtime PM API's won't be invoked. Only for vfio/pci
>> based drivers currently, the runtime PM API's will be invoked to increment
>> and decrement the usage count. Taking this usage count incremented while
>> servicing IOCTL will make sure that user won't put the device into low
>> power state when any other IOCTL is being serviced in parallel.
>>
>> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
>> ---
>>  drivers/vfio/vfio.c | 44 +++++++++++++++++++++++++++++++++++++++++---
>>  1 file changed, 41 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
>> index a4555014bd1e..4e65a127744e 100644
>> --- a/drivers/vfio/vfio.c
>> +++ b/drivers/vfio/vfio.c
>> @@ -32,6 +32,7 @@
>>  #include <linux/vfio.h>
>>  #include <linux/wait.h>
>>  #include <linux/sched/signal.h>
>> +#include <linux/pm_runtime.h>
>>  #include "vfio.h"
>>  
>>  #define DRIVER_VERSION	"0.3"
>> @@ -1536,6 +1537,30 @@ static const struct file_operations vfio_group_fops = {
>>  	.release	= vfio_group_fops_release,
>>  };
>>  
>> +/*
>> + * Wrapper around pm_runtime_resume_and_get().
>> + * Return 0, if driver power management callbacks are not present i.e. the driver is not
> 
> Mind the gratuitous long comment line here.
> 
 
 Thanks Alex.
 
 That was a miss. I will fix this.
 
>> + * using runtime power management.
>> + * Return 1 upon success, otherwise -errno
> 
> Changing semantics vs the thing we're wrapping, why not provide a
> wrapper for the `put` as well to avoid?  The only cases where we return
> zero are just as easy to detect on the other side.
> 

 Yes. Using wrapper function for put is better option.
 I will make the changes.

>> + */
>> +static inline int vfio_device_pm_runtime_get(struct device *dev)
> 
> Given some of Jason's recent series, this should probably just accept a
> vfio_device.
> 

 Sorry. I didn't get this part.

 Do I need to change it to

 static inline int vfio_device_pm_runtime_get(struct vfio_device *device)
 {
    struct device *dev = device->dev;
    ...
 }

>> +{
>> +#ifdef CONFIG_PM
>> +	int ret;
>> +
>> +	if (!dev->driver || !dev->driver->pm)
>> +		return 0;
>> +
>> +	ret = pm_runtime_resume_and_get(dev);
>> +	if (ret < 0)
>> +		return ret;
>> +
>> +	return 1;
>> +#else
>> +	return 0;
>> +#endif
>> +}
>> +
>>  /*
>>   * VFIO Device fd
>>   */
>> @@ -1845,15 +1870,28 @@ static long vfio_device_fops_unl_ioctl(struct file *filep,
>>  				       unsigned int cmd, unsigned long arg)
>>  {
>>  	struct vfio_device *device = filep->private_data;
>> +	int pm_ret, ret = 0;
>> +
>> +	pm_ret = vfio_device_pm_runtime_get(device->dev);
>> +	if (pm_ret < 0)
>> +		return pm_ret;
> 
> I wonder if we might simply want to mask pm errors behind -EIO, maybe
> with a rate limited dev_info().  My concern would be that we might mask
> errnos that userspace has come to expect for certain ioctls.  Thanks,
> 
> Alex
> 

  I need to do something like following. Correct ?

  ret = vfio_device_pm_runtime_get(device);
  if (ret < 0) {
     dev_info_ratelimited(device->dev, "vfio: runtime resume failed %d\n", ret);
     return -EIO;
  }
  
  Regards,
  Abhishek
 
>>  
>>  	switch (cmd) {
>>  	case VFIO_DEVICE_FEATURE:
>> -		return vfio_ioctl_device_feature(device, (void __user *)arg);
>> +		ret = vfio_ioctl_device_feature(device, (void __user *)arg);
>> +		break;
>>  	default:
>>  		if (unlikely(!device->ops->ioctl))
>> -			return -EINVAL;
>> -		return device->ops->ioctl(device, cmd, arg);
>> +			ret = -EINVAL;
>> +		else
>> +			ret = device->ops->ioctl(device, cmd, arg);
>> +		break;
>>  	}
>> +
>> +	if (pm_ret)
>> +		pm_runtime_put(device->dev);
>> +
>> +	return ret;
>>  }
>>  
>>  static ssize_t vfio_device_fops_read(struct file *filep, char __user *buf,
> 


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 8/8] vfio/pci: Add the support for PCI D3cold state
  2022-05-04 19:45   ` Alex Williamson
@ 2022-05-05 12:16     ` Abhishek Sahu
  2022-05-09 21:48       ` Alex Williamson
  0 siblings, 1 reply; 41+ messages in thread
From: Abhishek Sahu @ 2022-05-05 12:16 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Cornelia Huck, Yishai Hadas, Jason Gunthorpe, Shameer Kolothum,
	Kevin Tian, Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas,
	linux-kernel, kvm, linux-pm, linux-pci

On 5/5/2022 1:15 AM, Alex Williamson wrote:
> On Mon, 25 Apr 2022 14:56:15 +0530
> Abhishek Sahu <abhsahu@nvidia.com> wrote:
> 
>> Currently, if the runtime power management is enabled for vfio pci
>> based device in the guest OS, then guest OS will do the register
>> write for PCI_PM_CTRL register. This write request will be handled in
>> vfio_pm_config_write() where it will do the actual register write
>> of PCI_PM_CTRL register. With this, the maximum D3hot state can be
>> achieved for low power. If we can use the runtime PM framework,
>> then we can achieve the D3cold state which will help in saving
>> maximum power.
>>
>> 1. Since D3cold state can't be achieved by writing PCI standard
>>    PM config registers, so this patch adds a new feature in the
>>    existing VFIO_DEVICE_FEATURE IOCTL. This IOCTL can be used
>>    to change the PCI device from D3hot to D3cold state and
>>    then D3cold to D0 state. The device feature uses low power term
>>    instead of D3cold so that if other vfio driver wants to implement
>>    low power support, then the same IOCTL can be used.
> 
> How does this enable you to handle the full-off vs memory-refresh modes
> for NVIDIA GPUs?
> 
 
 Thanks Alex.

 This patch series will just enable the full-off for nvidia GPU.
 The self-refresh mode won't work.

 The self-refresh case is nvidia specific and needs driver
 involvement each time before going into d3cold. We are evaluating
 internally if we have enough use case for self-refresh mode and then
 I will plan separate patch series to support self-refresh mode use
 case, if required. But that will be independent of this patch series.

 At the high level, we need some way to disable the PCI device access
 from the host side or forward the event to VM for every access on the
 host side if we want to support NVIDIA self-refresh use case inside VM.
 Otherwise, from the driver side, we can disable self-refresh mode if
 driver is running inside VM. In that case, if memory usage is higher than
 threshold then we don’t engage RTD3 itself. 

> The feature ioctl supports a probe, but here the probe only indicates
> that the ioctl is available, not what degree of low power support
> available.  Even if the host doesn't support d3cold for the device, we
> can still achieve root port d3hot, but can we provide further
> capability info to the user?
>

 I wanted to add more information here but was not sure which
 information will be helpful for user. There is no certain way to
 predict that the runtime suspend will use D3cold state only even
 on the supported systems. User can disable runtime power management from 

 /sys/bus/pci/devices/…/power/control

 Or disable d3cold itself 

 /sys/bus/pci/devices/…/d3cold_allowed


 Even if all these are allowed, then platform_pci_choose_state()
 is the main function where the target low power state is selected
 in runtime.

 Probably we can add pci_pr3_present() status to user which gives
 hint to user that required ACPI methods for d3cold is present in
 the platform. 
  
>> 2. The hypervisors can implement virtual ACPI methods. For
>>    example, in guest linux OS if PCI device ACPI node has _PR3 and _PR0
>>    power resources with _ON/_OFF method, then guest linux OS makes the
>>    _OFF call during D3cold transition and then _ON during D0 transition.
>>    The hypervisor can tap these virtual ACPI calls and then do the D3cold
>>    related IOCTL in the vfio driver.
>>
>> 3. The vfio driver uses runtime PM framework to achieve the
>>    D3cold state. For the D3cold transition, decrement the usage count and
>>    for the D0 transition, increment the usage count.
>>
>> 4. For D3cold, the device current power state should be D3hot.
>>    Then during runtime suspend, the pci_platform_power_transition() is
>>    required for D3cold state. If the D3cold state is not supported, then
>>    the device will still be in D3hot state. But with the runtime PM, the
>>    root port can now also go into suspended state.
> 
> Why do we create this requirement for the device to be in d3hot prior
> to entering low power 

 This is mainly to make integration in the hypervisor with
 the PCI power management code flow.

 If we see the power management steps, then following 2 steps
 are involved 

 1. First move the device from D0 to D3hot state by writing
    into config register.
 2. Then invoke ACPI routines (mainly _PR3 OFF method) to
    move from D3hot to D3cold.

 So, in the guest side, we can follow the same steps. The guest can
 do the config register write and then for step 2, the hypervisor
 can implement the virtual ACPI with _PR3/_PR0 power resources.
 Inside this virtual ACPI implementation, the hypervisor can invoke
 the power management IOCTL.

 Also, if runtime PM has been disabled from the host side,
 then also the device will be in d3hot state. 

> when our pm ops suspend function wakes the device do d0?

 The changing to D0 here is happening due to 2 reasons here,

 1. First to preserve device state for the NoSoftRst-.
 2. To make use of PCI core layer generic code for runtime suspend,
    otherwise we need to do all handling here which is present in
    pci_pm_runtime_suspend().

>> 5. For most of the systems, the D3cold is supported at the root
>>    port level. So, when root port will transition to D3cold state, then
>>    the vfio PCI device will go from D3hot to D3cold state during its
>>    runtime suspend. If root port does not support D3cold, then the root
>>    will go into D3hot state.
>>
>> 6. The runtime suspend callback can now happen for 2 cases: there
>>    are no users of vfio device and the case where user has initiated
>>    D3cold. The 'platform_pm_engaged' flag can help to distinguish
>>    between these 2 cases.
> 
> If this were the only use case we could rely on vfio_device.open_count
> instead.  I don't think it is though.  

 platform_pm_engaged is mainly to track the user initiated
 low power entry with the IOCTL. So even if we use vfio_device.open_count
 here, we will still require platform_pm_engaged.

>> 7. In D3cold, all kind of BAR related access needs to be disabled
>>    like D3hot. Additionally, the config space will also be disabled in
>>    D3cold state. To prevent access of config space in D3cold state, do
>>    increment the runtime PM usage count before doing any config space
>>    access.
> 
> Or we could actually prevent access to config space rather than waking
> the device for the access.  Addressed in further comment below.
>  
>> 8. If user has engaged low power entry through IOCTL, then user should
>>    do low power exit first. The user can issue config access or IOCTL
>>    after low power entry. We can add an explicit error check but since
>>    we are already waking-up device, so IOCTL and config access can be
>>    fulfilled. But 'power_state_d3' won't be cleared without issuing
>>    low power exit so all BAR related access will still return error till
>>    user do low power exit.
> 
> The fact that power_state_d3 no longer tracks the device power state
> when platform_pm_engaged is set is a confusing discontinuity.
> 

 If we refer the power management steps (as mentioned in the above),
 then these 2 variable tracks different things.

 1. power_state_d3 tracks the config space write.  
 2. platform_pm_engaged tracks the IOCTL call. In the IOCTL, we decrement
    the runtime usage count so we need to track that we have decremented
    it. 

>> 9. Since multiple layers are involved, so following is the high level
>>    code flow for D3cold entry and exit.
>>
>> D3cold entry:
>>
>> a. User put the PCI device into D3hot by writing into standard config
>>    register (vfio_pm_config_write() -> vfio_lock_and_set_power_state() ->
>>    vfio_pci_set_power_state()). The device power state will be D3hot and
>>    power_state_d3 will be true.
>> b. Set vfio_device_feature_power_management::low_power_state =
>>    VFIO_DEVICE_LOW_POWER_STATE_ENTER and call VFIO_DEVICE_FEATURE IOCTL.
>> c. Inside vfio_device_fops_unl_ioctl(), pm_runtime_resume_and_get()
>>    will be called first which will make the usage count as 2 and then
>>    vfio_pci_core_ioctl_feature() will be invoked.
>> d. vfio_pci_core_feature_pm() will be called and it will go inside
>>    VFIO_DEVICE_LOW_POWER_STATE_ENTER switch case. platform_pm_engaged will
>>    be true and pm_runtime_put_noidle() will decrement the usage count
>>    to 1.
>> e. Inside vfio_device_fops_unl_ioctl() while returning the
>>    pm_runtime_put() will make the usage count to 0 and the runtime PM
>>    framework will engage the runtime suspend entry.
>> f. pci_pm_runtime_suspend() will be called and invokes driver runtime
>>    suspend callback.
>> g. vfio_pci_core_runtime_suspend() will change the power state to D0
>>    and do the INTx mask related handling.
>> h. pci_pm_runtime_suspend() will take care of saving the PCI state and
>>    all power management handling for D3cold.
>>
>> D3cold exit:
>>
>> a. Set vfio_device_feature_power_management::low_power_state =
>>    VFIO_DEVICE_LOW_POWER_STATE_EXIT and call VFIO_DEVICE_FEATURE IOCTL.
>> b. Inside vfio_device_fops_unl_ioctl(), pm_runtime_resume_and_get()
>>    will be called first which will make the usage count as 1.
>> c. pci_pm_runtime_resume() will take care of moving the device into D0
>>    state again and then vfio_pci_core_runtime_resume() will be called.
>> d. vfio_pci_core_runtime_resume() will do the INTx unmask related
>>    handling.
>> e. vfio_pci_core_ioctl_feature() will be invoked.
>> f. vfio_pci_core_feature_pm() will be called and it will go inside
>>    VFIO_DEVICE_LOW_POWER_STATE_EXIT switch case. platform_pm_engaged and
>>    power_state_d3 will be cleared and pm_runtime_get_noresume() will make
>>    the usage count as 2.
>> g. Inside vfio_device_fops_unl_ioctl() while returning the
>>    pm_runtime_put() will make the usage count to 1 and the device will
>>    be in D0 state only.
>>
>> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
>> ---
>>  drivers/vfio/pci/vfio_pci_config.c |  11 ++-
>>  drivers/vfio/pci/vfio_pci_core.c   | 131 ++++++++++++++++++++++++++++-
>>  include/linux/vfio_pci_core.h      |   1 +
>>  include/uapi/linux/vfio.h          |  18 ++++
>>  4 files changed, 159 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
>> index af0ae80ef324..65b1bc9586ab 100644
>> --- a/drivers/vfio/pci/vfio_pci_config.c
>> +++ b/drivers/vfio/pci/vfio_pci_config.c
>> @@ -25,6 +25,7 @@
>>  #include <linux/uaccess.h>
>>  #include <linux/vfio.h>
>>  #include <linux/slab.h>
>> +#include <linux/pm_runtime.h>
>>  
>>  #include <linux/vfio_pci_core.h>
>>  
>> @@ -1936,16 +1937,23 @@ static ssize_t vfio_config_do_rw(struct vfio_pci_core_device *vdev, char __user
>>  ssize_t vfio_pci_config_rw(struct vfio_pci_core_device *vdev, char __user *buf,
>>  			   size_t count, loff_t *ppos, bool iswrite)
>>  {
>> +	struct device *dev = &vdev->pdev->dev;
>>  	size_t done = 0;
>>  	int ret = 0;
>>  	loff_t pos = *ppos;
>>  
>>  	pos &= VFIO_PCI_OFFSET_MASK;
>>  
>> +	ret = pm_runtime_resume_and_get(dev);
>> +	if (ret < 0)
>> +		return ret;
> 
> Alternatively we could just check platform_pm_engaged here and return
> -EINVAL, right?  Why is waking the device the better option?
> 

 This is mainly to prevent race condition where config space access
 happens parallelly with IOCTL access. So, lets consider the following case.

 1. Config space access happens and vfio_pci_config_rw() will be called.
 2. The IOCTL to move into low power state is called.
 3. The IOCTL will move the device into d3cold.
 4. Exit from vfio_pci_config_rw() happened.

 Now, if we just check platform_pm_engaged, then in the above
 sequence it won’t work. I checked this parallel access by writing
 a small program where I opened the 2 instances and then
 created 2 threads for config space and IOCTL.
 In my case, I got the above sequence.

 The pm_runtime_resume_and_get() will make sure that device
 usage count keep incremented throughout the config space
 access (or IOCTL access in the previous patch) and the
 runtime PM framework will not move the device into suspended
 state.

>> +
>>  	while (count) {
>>  		ret = vfio_config_do_rw(vdev, buf, count, &pos, iswrite);
>> -		if (ret < 0)
>> +		if (ret < 0) {
>> +			pm_runtime_put(dev);
>>  			return ret;
>> +		}
>>  
>>  		count -= ret;
>>  		done += ret;
>> @@ -1953,6 +1961,7 @@ ssize_t vfio_pci_config_rw(struct vfio_pci_core_device *vdev, char __user *buf,
>>  		pos += ret;
>>  	}
>>  
>> +	pm_runtime_put(dev);
>>  	*ppos += done;
>>  
>>  	return done;
>> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
>> index 05a68ca9d9e7..beac6e05f97f 100644
>> --- a/drivers/vfio/pci/vfio_pci_core.c
>> +++ b/drivers/vfio/pci/vfio_pci_core.c
>> @@ -234,7 +234,14 @@ int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t stat
>>  	ret = pci_set_power_state(pdev, state);
>>  
>>  	if (!ret) {
>> -		vdev->power_state_d3 = (pdev->current_state >= PCI_D3hot);
>> +		/*
>> +		 * If 'platform_pm_engaged' is true then 'power_state_d3' can
>> +		 * be cleared only when user makes the explicit request to
>> +		 * move out of low power state by using power management ioctl.
>> +		 */
>> +		if (!vdev->platform_pm_engaged)
>> +			vdev->power_state_d3 =
>> +				(pdev->current_state >= PCI_D3hot);
> 
> power_state_d3 is essentially only used as a secondary test to
> __vfio_pci_memory_enabled() to block r/w access to device regions and
> generate a fault on mmap access.  Its existence already seems a little
> questionable when we could just look at vdev->pdev->current_state, and
> we could incorporate that into __vfio_pci_memory_enabled().  So rather
> than creating this inconsistency, couldn't we just make that function
> return:
> 
> !vdev->platform_pm_enagaged && pdev->current_state < PCI_D3hot &&
> (pdev->no_command_memory || (cmd & PCI_COMMAND_MEMORY))
> 

 The main reason for power_state_d3 is to get it under
 memory_lock semaphore. But pdev->current_state is not
 protected with any lock. So, will use of pdev->current_state
 here be safe?
 
> 
>>  
>>  		/* D3 might be unsupported via quirk, skip unless in D3 */
>>  		if (needs_save && pdev->current_state >= PCI_D3hot) {
>> @@ -266,6 +273,25 @@ static int vfio_pci_core_runtime_suspend(struct device *dev)
>>  {
>>  	struct vfio_pci_core_device *vdev = dev_get_drvdata(dev);
>>  
>> +	down_read(&vdev->memory_lock);
>> +
>> +	/* 'platform_pm_engaged' will be false if there are no users. */
>> +	if (!vdev->platform_pm_engaged) {
>> +		up_read(&vdev->memory_lock);
>> +		return 0;
>> +	}
>> +
>> +	/*
>> +	 * The user will move the device into D3hot state first before invoking
>> +	 * power management ioctl. Move the device into D0 state here and then
>> +	 * the pci-driver core runtime PM suspend will move the device into
>> +	 * low power state. Also, for the devices which have NoSoftRst-,
>> +	 * it will help in restoring the original state (saved locally in
>> +	 * 'vdev->pm_save').
>> +	 */
>> +	vfio_pci_set_power_state(vdev, PCI_D0);
>> +	up_read(&vdev->memory_lock);
>> +
>>  	/*
>>  	 * If INTx is enabled, then mask INTx before going into runtime
>>  	 * suspended state and unmask the same in the runtime resume.
>> @@ -395,6 +421,19 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
>>  
>>  	/*
>>  	 * This function can be invoked while the power state is non-D0.
>> +	 * This non-D0 power state can be with or without runtime PM.
>> +	 * Increment the usage count corresponding to pm_runtime_put()
>> +	 * called during setting of 'platform_pm_engaged'. The device will
>> +	 * wake up if it has already went into suspended state. Otherwise,
>> +	 * the next vfio_pci_set_power_state() will change the
>> +	 * device power state to D0.
>> +	 */
>> +	if (vdev->platform_pm_engaged) {
>> +		pm_runtime_resume_and_get(&pdev->dev);
>> +		vdev->platform_pm_engaged = false;
>> +	}
>> +
>> +	/*
>>  	 * This function calls __pci_reset_function_locked() which internally
>>  	 * can use pci_pm_reset() for the function reset. pci_pm_reset() will
>>  	 * fail if the power state is non-D0. Also, for the devices which
>> @@ -1192,6 +1231,80 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
>>  }
>>  EXPORT_SYMBOL_GPL(vfio_pci_core_ioctl);
>>  
>> +#ifdef CONFIG_PM
>> +static int vfio_pci_core_feature_pm(struct vfio_device *device, u32 flags,
>> +				    void __user *arg, size_t argsz)
>> +{
>> +	struct vfio_pci_core_device *vdev =
>> +		container_of(device, struct vfio_pci_core_device, vdev);
>> +	struct pci_dev *pdev = vdev->pdev;
>> +	struct vfio_device_feature_power_management vfio_pm = { 0 };
>> +	int ret = 0;
>> +
>> +	ret = vfio_check_feature(flags, argsz,
>> +				 VFIO_DEVICE_FEATURE_SET |
>> +				 VFIO_DEVICE_FEATURE_GET,
>> +				 sizeof(vfio_pm));
>> +	if (ret != 1)
>> +		return ret;
>> +
>> +	if (flags & VFIO_DEVICE_FEATURE_GET) {
>> +		down_read(&vdev->memory_lock);
>> +		vfio_pm.low_power_state = vdev->platform_pm_engaged ?
>> +				VFIO_DEVICE_LOW_POWER_STATE_ENTER :
>> +				VFIO_DEVICE_LOW_POWER_STATE_EXIT;
>> +		up_read(&vdev->memory_lock);
>> +		if (copy_to_user(arg, &vfio_pm, sizeof(vfio_pm)))
>> +			return -EFAULT;
>> +		return 0;
>> +	}
>> +
>> +	if (copy_from_user(&vfio_pm, arg, sizeof(vfio_pm)))
>> +		return -EFAULT;
>> +
>> +	/*
>> +	 * The vdev power related fields are protected with memory_lock
>> +	 * semaphore.
>> +	 */
>> +	down_write(&vdev->memory_lock);
>> +	switch (vfio_pm.low_power_state) {
>> +	case VFIO_DEVICE_LOW_POWER_STATE_ENTER:
>> +		if (!vdev->power_state_d3 || vdev->platform_pm_engaged) {
>> +			ret = EINVAL;
>> +			break;
>> +		}
>> +
>> +		vdev->platform_pm_engaged = true;
>> +
>> +		/*
>> +		 * The pm_runtime_put() will be called again while returning
>> +		 * from ioctl after which the device can go into runtime
>> +		 * suspended.
>> +		 */
>> +		pm_runtime_put_noidle(&pdev->dev);
>> +		break;
>> +
>> +	case VFIO_DEVICE_LOW_POWER_STATE_EXIT:
>> +		if (!vdev->platform_pm_engaged) {
>> +			ret = EINVAL;
>> +			break;
>> +		}
>> +
>> +		vdev->platform_pm_engaged = false;
>> +		vdev->power_state_d3 = false;
>> +		pm_runtime_get_noresume(&pdev->dev);
>> +		break;
>> +
>> +	default:
>> +		ret = EINVAL;
>> +		break;
>> +	}
>> +
>> +	up_write(&vdev->memory_lock);
>> +	return ret;
>> +}
>> +#endif
>> +
>>  static int vfio_pci_core_feature_token(struct vfio_device *device, u32 flags,
>>  				       void __user *arg, size_t argsz)
>>  {
>> @@ -1226,6 +1339,10 @@ int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
>>  	switch (flags & VFIO_DEVICE_FEATURE_MASK) {
>>  	case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN:
>>  		return vfio_pci_core_feature_token(device, flags, arg, argsz);
>> +#ifdef CONFIG_PM
>> +	case VFIO_DEVICE_FEATURE_POWER_MANAGEMENT:
>> +		return vfio_pci_core_feature_pm(device, flags, arg, argsz);
>> +#endif
>>  	default:
>>  		return -ENOTTY;
>>  	}
>> @@ -2189,6 +2306,15 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
>>  		goto err_unlock;
>>  	}
>>  
>> +	/*
>> +	 * Some of the devices in the dev_set can be in the runtime suspended
>> +	 * state. Increment the usage count for all the devices in the dev_set
>> +	 * before reset and decrement the same after reset.
>> +	 */
>> +	ret = vfio_pci_dev_set_pm_runtime_get(dev_set);
>> +	if (ret)
>> +		goto err_unlock;
>> +
>>  	list_for_each_entry(cur_vma, &dev_set->device_list, vdev.dev_set_list) {
>>  		/*
>>  		 * Test whether all the affected devices are contained by the
>> @@ -2244,6 +2370,9 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
>>  		else
>>  			mutex_unlock(&cur->vma_lock);
>>  	}
>> +
>> +	list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list)
>> +		pm_runtime_put(&cur->pdev->dev);
>>  err_unlock:
>>  	mutex_unlock(&dev_set->lock);
>>  	return ret;
>> diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
>> index e84f31e44238..337983a877d6 100644
>> --- a/include/linux/vfio_pci_core.h
>> +++ b/include/linux/vfio_pci_core.h
>> @@ -126,6 +126,7 @@ struct vfio_pci_core_device {
>>  	bool			needs_pm_restore;
>>  	bool			power_state_d3;
>>  	bool			pm_intx_masked;
>> +	bool			platform_pm_engaged;
>>  	struct pci_saved_state	*pci_saved_state;
>>  	struct pci_saved_state	*pm_save;
>>  	int			ioeventfds_nr;
>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>> index fea86061b44e..53ff890dbd27 100644
>> --- a/include/uapi/linux/vfio.h
>> +++ b/include/uapi/linux/vfio.h
>> @@ -986,6 +986,24 @@ enum vfio_device_mig_state {
>>  	VFIO_DEVICE_STATE_RUNNING_P2P = 5,
>>  };
>>  
>> +/*
>> + * Use platform-based power management for moving the device into low power
>> + * state.  This low power state is device specific.
>> + *
>> + * For PCI, this low power state is D3cold.  The native PCI power management
>> + * does not support the D3cold power state.  For moving the device into D3cold
>> + * state, change the PCI state to D3hot with standard configuration registers
>> + * and then call this IOCTL to setting the D3cold state.  Similarly, if the
>> + * device in D3cold state, then call this IOCTL to exit from D3cold state.
>> + */
>> +struct vfio_device_feature_power_management {
>> +#define VFIO_DEVICE_LOW_POWER_STATE_EXIT	0x0
>> +#define VFIO_DEVICE_LOW_POWER_STATE_ENTER	0x1
>> +	__u64	low_power_state;
>> +};
>> +
>> +#define VFIO_DEVICE_FEATURE_POWER_MANAGEMENT	3
> 
> __u8 seems more than sufficient here.  Thanks,
> 
> Alex
>

 I have used __u64 mainly to get this structure 64 bit aligned.
 I was impression that the ioctl structure should be 64 bit aligned
 but in this case since we will have just have __u8 member so
 alignment should not be required?
 
 Regards,
 Abhishek

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 8/8] vfio/pci: Add the support for PCI D3cold state
  2022-05-05 12:16     ` Abhishek Sahu
@ 2022-05-09 21:48       ` Alex Williamson
  2022-05-10 13:26         ` Abhishek Sahu
  0 siblings, 1 reply; 41+ messages in thread
From: Alex Williamson @ 2022-05-09 21:48 UTC (permalink / raw)
  To: Abhishek Sahu
  Cc: Cornelia Huck, Yishai Hadas, Jason Gunthorpe, Shameer Kolothum,
	Kevin Tian, Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas,
	linux-kernel, kvm, linux-pm, linux-pci

On Thu, 5 May 2022 17:46:20 +0530
Abhishek Sahu <abhsahu@nvidia.com> wrote:

> On 5/5/2022 1:15 AM, Alex Williamson wrote:
> > On Mon, 25 Apr 2022 14:56:15 +0530
> > Abhishek Sahu <abhsahu@nvidia.com> wrote:
> >   
> >> Currently, if the runtime power management is enabled for vfio pci
> >> based device in the guest OS, then guest OS will do the register
> >> write for PCI_PM_CTRL register. This write request will be handled in
> >> vfio_pm_config_write() where it will do the actual register write
> >> of PCI_PM_CTRL register. With this, the maximum D3hot state can be
> >> achieved for low power. If we can use the runtime PM framework,
> >> then we can achieve the D3cold state which will help in saving
> >> maximum power.
> >>
> >> 1. Since D3cold state can't be achieved by writing PCI standard
> >>    PM config registers, so this patch adds a new feature in the
> >>    existing VFIO_DEVICE_FEATURE IOCTL. This IOCTL can be used
> >>    to change the PCI device from D3hot to D3cold state and
> >>    then D3cold to D0 state. The device feature uses low power term
> >>    instead of D3cold so that if other vfio driver wants to implement
> >>    low power support, then the same IOCTL can be used.  
> > 
> > How does this enable you to handle the full-off vs memory-refresh modes
> > for NVIDIA GPUs?
> >   
>  
>  Thanks Alex.
> 
>  This patch series will just enable the full-off for nvidia GPU.
>  The self-refresh mode won't work.
> 
>  The self-refresh case is nvidia specific and needs driver
>  involvement each time before going into d3cold. We are evaluating
>  internally if we have enough use case for self-refresh mode and then
>  I will plan separate patch series to support self-refresh mode use
>  case, if required. But that will be independent of this patch series.
> 
>  At the high level, we need some way to disable the PCI device access
>  from the host side or forward the event to VM for every access on the
>  host side if we want to support NVIDIA self-refresh use case inside VM.
>  Otherwise, from the driver side, we can disable self-refresh mode if
>  driver is running inside VM. In that case, if memory usage is higher than
>  threshold then we don’t engage RTD3 itself. 

Disabling PCI access on the host seems impractical to me, but PM and
PCI folks are welcome to weigh in.

We've also discussed that the GPU memory could exceed RAM + swap for a
VM, leaving them with no practical means to make use of d3cold if we
don't support this capability.  Also, existing drivers expect to have
this capability and it's not uncommon for those in the gaming community
making use of GPU assignment to attempt to hide the fact that they're
running in a VM to avoid falsely triggering anti-cheat detection, DRM,
or working around certain GPU vendors who previously restricted use of
consumer GPUs in VMs.

That seems to suggest to me that our only option is along the lines of
notifying the VM when the device returns to D0 and by default only
re-entering d3cold under the direction of the VM.  We might also do some
sort of negotiation based on device vendor and class code where we
could enable the kernel to perform the transition back to d3cold.
There's a fair chance that an AMD GPU might have similar requirements,
do we know if they do?

I'd suggest perhaps splitting this patch series so that we can start
taking advantage of using d3cold for idle devices while we figure out
how to make use of VM directed d3cold without creating scenarios that
don't break existing drivers.
 
> > The feature ioctl supports a probe, but here the probe only indicates
> > that the ioctl is available, not what degree of low power support
> > available.  Even if the host doesn't support d3cold for the device, we
> > can still achieve root port d3hot, but can we provide further
> > capability info to the user?
> >  
> 
>  I wanted to add more information here but was not sure which
>  information will be helpful for user. There is no certain way to
>  predict that the runtime suspend will use D3cold state only even
>  on the supported systems. User can disable runtime power management from 
> 
>  /sys/bus/pci/devices/…/power/control
> 
>  Or disable d3cold itself 
> 
>  /sys/bus/pci/devices/…/d3cold_allowed
> 
> 
>  Even if all these are allowed, then platform_pci_choose_state()
>  is the main function where the target low power state is selected
>  in runtime.
> 
>  Probably we can add pci_pr3_present() status to user which gives
>  hint to user that required ACPI methods for d3cold is present in
>  the platform. 

I expected that might be the answer.  The proposed interface name also
avoids tying us directly to an ACPI implementation, so I imagine there
could be a variety of backends supporting runtime power management in
the host kernel.

In the VM I think the ACPI controls are at the root port, so we
probably need to add power control to each root port regardless of what
happens to be plugged into it at the time.  Maybe that means we can't
really take advantage of knowing the degree of device support, we just
need to wire it up as if it works regardless.

We might also want to consider parallels to device hotplug here.  For
example, if QEMU could know that a device does not retain state in
d3cold, it might choose to unplug the device backend so that the device
could be used elsewhere in the interim, or simply use the idle device
handling for d3cold in vfio-pci.  That opens up a lot of questions
regarding SLA contracts with management tools to be able to replace the
device with a fungible substitute on demand, but I can imagine data
center logistics might rather have that problem than VMs sitting on
powered-off devices.

> >> 2. The hypervisors can implement virtual ACPI methods. For
> >>    example, in guest linux OS if PCI device ACPI node has _PR3 and _PR0
> >>    power resources with _ON/_OFF method, then guest linux OS makes the
> >>    _OFF call during D3cold transition and then _ON during D0 transition.
> >>    The hypervisor can tap these virtual ACPI calls and then do the D3cold
> >>    related IOCTL in the vfio driver.
> >>
> >> 3. The vfio driver uses runtime PM framework to achieve the
> >>    D3cold state. For the D3cold transition, decrement the usage count and
> >>    for the D0 transition, increment the usage count.
> >>
> >> 4. For D3cold, the device current power state should be D3hot.
> >>    Then during runtime suspend, the pci_platform_power_transition() is
> >>    required for D3cold state. If the D3cold state is not supported, then
> >>    the device will still be in D3hot state. But with the runtime PM, the
> >>    root port can now also go into suspended state.  
> > 
> > Why do we create this requirement for the device to be in d3hot prior
> > to entering low power   
> 
>  This is mainly to make integration in the hypervisor with
>  the PCI power management code flow.
> 
>  If we see the power management steps, then following 2 steps
>  are involved 
> 
>  1. First move the device from D0 to D3hot state by writing
>     into config register.
>  2. Then invoke ACPI routines (mainly _PR3 OFF method) to
>     move from D3hot to D3cold.
> 
>  So, in the guest side, we can follow the same steps. The guest can
>  do the config register write and then for step 2, the hypervisor
>  can implement the virtual ACPI with _PR3/_PR0 power resources.
>  Inside this virtual ACPI implementation, the hypervisor can invoke
>  the power management IOCTL.
> 
>  Also, if runtime PM has been disabled from the host side,
>  then also the device will be in d3hot state. 

That's true regardless of us making it a requirement.  I don't see what
it buys us to make this a requirement though.  If I trigger the _PR3
method on bare metal, does ACPI care if the device is in D3hot first?
At best that seems dependent on the ACPI implementation.
 
> > when our pm ops suspend function wakes the device do d0?  
> 
>  The changing to D0 here is happening due to 2 reasons here,
> 
>  1. First to preserve device state for the NoSoftRst-.
>  2. To make use of PCI core layer generic code for runtime suspend,
>     otherwise we need to do all handling here which is present in
>     pci_pm_runtime_suspend().

What problem do we cause if we allow the user to trigger this ioctl
from D0?  The restriction follows the expected use case, but otherwise
imposing the restriction is arbitrary.

 
> >> 5. For most of the systems, the D3cold is supported at the root
> >>    port level. So, when root port will transition to D3cold state, then
> >>    the vfio PCI device will go from D3hot to D3cold state during its
> >>    runtime suspend. If root port does not support D3cold, then the root
> >>    will go into D3hot state.
> >>
> >> 6. The runtime suspend callback can now happen for 2 cases: there
> >>    are no users of vfio device and the case where user has initiated
> >>    D3cold. The 'platform_pm_engaged' flag can help to distinguish
> >>    between these 2 cases.  
> > 
> > If this were the only use case we could rely on vfio_device.open_count
> > instead.  I don't think it is though.    
> 
>  platform_pm_engaged is mainly to track the user initiated
>  low power entry with the IOCTL. So even if we use vfio_device.open_count
>  here, we will still require platform_pm_engaged.
> 
> >> 7. In D3cold, all kind of BAR related access needs to be disabled
> >>    like D3hot. Additionally, the config space will also be disabled in
> >>    D3cold state. To prevent access of config space in D3cold state, do
> >>    increment the runtime PM usage count before doing any config space
> >>    access.  
> > 
> > Or we could actually prevent access to config space rather than waking
> > the device for the access.  Addressed in further comment below.
> >    
> >> 8. If user has engaged low power entry through IOCTL, then user should
> >>    do low power exit first. The user can issue config access or IOCTL
> >>    after low power entry. We can add an explicit error check but since
> >>    we are already waking-up device, so IOCTL and config access can be
> >>    fulfilled. But 'power_state_d3' won't be cleared without issuing
> >>    low power exit so all BAR related access will still return error till
> >>    user do low power exit.  
> > 
> > The fact that power_state_d3 no longer tracks the device power state
> > when platform_pm_engaged is set is a confusing discontinuity.
> >   
> 
>  If we refer the power management steps (as mentioned in the above),
>  then these 2 variable tracks different things.
> 
>  1. power_state_d3 tracks the config space write.  
>  2. platform_pm_engaged tracks the IOCTL call. In the IOCTL, we decrement
>     the runtime usage count so we need to track that we have decremented
>     it. 
> 
> >> 9. Since multiple layers are involved, so following is the high level
> >>    code flow for D3cold entry and exit.
> >>
> >> D3cold entry:
> >>
> >> a. User put the PCI device into D3hot by writing into standard config
> >>    register (vfio_pm_config_write() -> vfio_lock_and_set_power_state() ->
> >>    vfio_pci_set_power_state()). The device power state will be D3hot and
> >>    power_state_d3 will be true.
> >> b. Set vfio_device_feature_power_management::low_power_state =
> >>    VFIO_DEVICE_LOW_POWER_STATE_ENTER and call VFIO_DEVICE_FEATURE IOCTL.
> >> c. Inside vfio_device_fops_unl_ioctl(), pm_runtime_resume_and_get()
> >>    will be called first which will make the usage count as 2 and then
> >>    vfio_pci_core_ioctl_feature() will be invoked.
> >> d. vfio_pci_core_feature_pm() will be called and it will go inside
> >>    VFIO_DEVICE_LOW_POWER_STATE_ENTER switch case. platform_pm_engaged will
> >>    be true and pm_runtime_put_noidle() will decrement the usage count
> >>    to 1.
> >> e. Inside vfio_device_fops_unl_ioctl() while returning the
> >>    pm_runtime_put() will make the usage count to 0 and the runtime PM
> >>    framework will engage the runtime suspend entry.
> >> f. pci_pm_runtime_suspend() will be called and invokes driver runtime
> >>    suspend callback.
> >> g. vfio_pci_core_runtime_suspend() will change the power state to D0
> >>    and do the INTx mask related handling.
> >> h. pci_pm_runtime_suspend() will take care of saving the PCI state and
> >>    all power management handling for D3cold.
> >>
> >> D3cold exit:
> >>
> >> a. Set vfio_device_feature_power_management::low_power_state =
> >>    VFIO_DEVICE_LOW_POWER_STATE_EXIT and call VFIO_DEVICE_FEATURE IOCTL.
> >> b. Inside vfio_device_fops_unl_ioctl(), pm_runtime_resume_and_get()
> >>    will be called first which will make the usage count as 1.
> >> c. pci_pm_runtime_resume() will take care of moving the device into D0
> >>    state again and then vfio_pci_core_runtime_resume() will be called.
> >> d. vfio_pci_core_runtime_resume() will do the INTx unmask related
> >>    handling.
> >> e. vfio_pci_core_ioctl_feature() will be invoked.
> >> f. vfio_pci_core_feature_pm() will be called and it will go inside
> >>    VFIO_DEVICE_LOW_POWER_STATE_EXIT switch case. platform_pm_engaged and
> >>    power_state_d3 will be cleared and pm_runtime_get_noresume() will make
> >>    the usage count as 2.
> >> g. Inside vfio_device_fops_unl_ioctl() while returning the
> >>    pm_runtime_put() will make the usage count to 1 and the device will
> >>    be in D0 state only.
> >>
> >> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
> >> ---
> >>  drivers/vfio/pci/vfio_pci_config.c |  11 ++-
> >>  drivers/vfio/pci/vfio_pci_core.c   | 131 ++++++++++++++++++++++++++++-
> >>  include/linux/vfio_pci_core.h      |   1 +
> >>  include/uapi/linux/vfio.h          |  18 ++++
> >>  4 files changed, 159 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
> >> index af0ae80ef324..65b1bc9586ab 100644
> >> --- a/drivers/vfio/pci/vfio_pci_config.c
> >> +++ b/drivers/vfio/pci/vfio_pci_config.c
> >> @@ -25,6 +25,7 @@
> >>  #include <linux/uaccess.h>
> >>  #include <linux/vfio.h>
> >>  #include <linux/slab.h>
> >> +#include <linux/pm_runtime.h>
> >>  
> >>  #include <linux/vfio_pci_core.h>
> >>  
> >> @@ -1936,16 +1937,23 @@ static ssize_t vfio_config_do_rw(struct vfio_pci_core_device *vdev, char __user
> >>  ssize_t vfio_pci_config_rw(struct vfio_pci_core_device *vdev, char __user *buf,
> >>  			   size_t count, loff_t *ppos, bool iswrite)
> >>  {
> >> +	struct device *dev = &vdev->pdev->dev;
> >>  	size_t done = 0;
> >>  	int ret = 0;
> >>  	loff_t pos = *ppos;
> >>  
> >>  	pos &= VFIO_PCI_OFFSET_MASK;
> >>  
> >> +	ret = pm_runtime_resume_and_get(dev);
> >> +	if (ret < 0)
> >> +		return ret;  
> > 
> > Alternatively we could just check platform_pm_engaged here and return
> > -EINVAL, right?  Why is waking the device the better option?
> >   
> 
>  This is mainly to prevent race condition where config space access
>  happens parallelly with IOCTL access. So, lets consider the following case.
> 
>  1. Config space access happens and vfio_pci_config_rw() will be called.
>  2. The IOCTL to move into low power state is called.
>  3. The IOCTL will move the device into d3cold.
>  4. Exit from vfio_pci_config_rw() happened.
> 
>  Now, if we just check platform_pm_engaged, then in the above
>  sequence it won’t work. I checked this parallel access by writing
>  a small program where I opened the 2 instances and then
>  created 2 threads for config space and IOCTL.
>  In my case, I got the above sequence.
> 
>  The pm_runtime_resume_and_get() will make sure that device
>  usage count keep incremented throughout the config space
>  access (or IOCTL access in the previous patch) and the
>  runtime PM framework will not move the device into suspended
>  state.

I think we're inventing problems here.  If we define that config space
is not accessible while the device is in low power and the only way to
get the device out of low power is via ioctl, then we should be denying
access to the device while in low power.  If the user races exiting the
device from low power and a config space access, that's their problem.

> >> +
> >>  	while (count) {
> >>  		ret = vfio_config_do_rw(vdev, buf, count, &pos, iswrite);
> >> -		if (ret < 0)
> >> +		if (ret < 0) {
> >> +			pm_runtime_put(dev);
> >>  			return ret;
> >> +		}
> >>  
> >>  		count -= ret;
> >>  		done += ret;
> >> @@ -1953,6 +1961,7 @@ ssize_t vfio_pci_config_rw(struct vfio_pci_core_device *vdev, char __user *buf,
> >>  		pos += ret;
> >>  	}
> >>  
> >> +	pm_runtime_put(dev);
> >>  	*ppos += done;
> >>  
> >>  	return done;
> >> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> >> index 05a68ca9d9e7..beac6e05f97f 100644
> >> --- a/drivers/vfio/pci/vfio_pci_core.c
> >> +++ b/drivers/vfio/pci/vfio_pci_core.c
> >> @@ -234,7 +234,14 @@ int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t stat
> >>  	ret = pci_set_power_state(pdev, state);
> >>  
> >>  	if (!ret) {
> >> -		vdev->power_state_d3 = (pdev->current_state >= PCI_D3hot);
> >> +		/*
> >> +		 * If 'platform_pm_engaged' is true then 'power_state_d3' can
> >> +		 * be cleared only when user makes the explicit request to
> >> +		 * move out of low power state by using power management ioctl.
> >> +		 */
> >> +		if (!vdev->platform_pm_engaged)
> >> +			vdev->power_state_d3 =
> >> +				(pdev->current_state >= PCI_D3hot);  
> > 
> > power_state_d3 is essentially only used as a secondary test to
> > __vfio_pci_memory_enabled() to block r/w access to device regions and
> > generate a fault on mmap access.  Its existence already seems a little
> > questionable when we could just look at vdev->pdev->current_state, and
> > we could incorporate that into __vfio_pci_memory_enabled().  So rather
> > than creating this inconsistency, couldn't we just make that function
> > return:
> > 
> > !vdev->platform_pm_enagaged && pdev->current_state < PCI_D3hot &&
> > (pdev->no_command_memory || (cmd & PCI_COMMAND_MEMORY))
> >   
> 
>  The main reason for power_state_d3 is to get it under
>  memory_lock semaphore. But pdev->current_state is not
>  protected with any lock. So, will use of pdev->current_state
>  here be safe?

If we're only testing and modifying pdev->current_state under
memory_lock, isn't it equivalent?
 
> >>  
> >>  		/* D3 might be unsupported via quirk, skip unless in D3 */
> >>  		if (needs_save && pdev->current_state >= PCI_D3hot) {
> >> @@ -266,6 +273,25 @@ static int vfio_pci_core_runtime_suspend(struct device *dev)
> >>  {
> >>  	struct vfio_pci_core_device *vdev = dev_get_drvdata(dev);
> >>  
> >> +	down_read(&vdev->memory_lock);
> >> +
> >> +	/* 'platform_pm_engaged' will be false if there are no users. */
> >> +	if (!vdev->platform_pm_engaged) {
> >> +		up_read(&vdev->memory_lock);
> >> +		return 0;
> >> +	}
> >> +
> >> +	/*
> >> +	 * The user will move the device into D3hot state first before invoking
> >> +	 * power management ioctl. Move the device into D0 state here and then
> >> +	 * the pci-driver core runtime PM suspend will move the device into
> >> +	 * low power state. Also, for the devices which have NoSoftRst-,
> >> +	 * it will help in restoring the original state (saved locally in
> >> +	 * 'vdev->pm_save').
> >> +	 */
> >> +	vfio_pci_set_power_state(vdev, PCI_D0);
> >> +	up_read(&vdev->memory_lock);
> >> +
> >>  	/*
> >>  	 * If INTx is enabled, then mask INTx before going into runtime
> >>  	 * suspended state and unmask the same in the runtime resume.
> >> @@ -395,6 +421,19 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
> >>  
> >>  	/*
> >>  	 * This function can be invoked while the power state is non-D0.
> >> +	 * This non-D0 power state can be with or without runtime PM.
> >> +	 * Increment the usage count corresponding to pm_runtime_put()
> >> +	 * called during setting of 'platform_pm_engaged'. The device will
> >> +	 * wake up if it has already went into suspended state. Otherwise,
> >> +	 * the next vfio_pci_set_power_state() will change the
> >> +	 * device power state to D0.
> >> +	 */
> >> +	if (vdev->platform_pm_engaged) {
> >> +		pm_runtime_resume_and_get(&pdev->dev);
> >> +		vdev->platform_pm_engaged = false;
> >> +	}
> >> +
> >> +	/*
> >>  	 * This function calls __pci_reset_function_locked() which internally
> >>  	 * can use pci_pm_reset() for the function reset. pci_pm_reset() will
> >>  	 * fail if the power state is non-D0. Also, for the devices which
> >> @@ -1192,6 +1231,80 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
> >>  }
> >>  EXPORT_SYMBOL_GPL(vfio_pci_core_ioctl);
> >>  
> >> +#ifdef CONFIG_PM
> >> +static int vfio_pci_core_feature_pm(struct vfio_device *device, u32 flags,
> >> +				    void __user *arg, size_t argsz)
> >> +{
> >> +	struct vfio_pci_core_device *vdev =
> >> +		container_of(device, struct vfio_pci_core_device, vdev);
> >> +	struct pci_dev *pdev = vdev->pdev;
> >> +	struct vfio_device_feature_power_management vfio_pm = { 0 };
> >> +	int ret = 0;
> >> +
> >> +	ret = vfio_check_feature(flags, argsz,
> >> +				 VFIO_DEVICE_FEATURE_SET |
> >> +				 VFIO_DEVICE_FEATURE_GET,
> >> +				 sizeof(vfio_pm));
> >> +	if (ret != 1)
> >> +		return ret;
> >> +
> >> +	if (flags & VFIO_DEVICE_FEATURE_GET) {
> >> +		down_read(&vdev->memory_lock);
> >> +		vfio_pm.low_power_state = vdev->platform_pm_engaged ?
> >> +				VFIO_DEVICE_LOW_POWER_STATE_ENTER :
> >> +				VFIO_DEVICE_LOW_POWER_STATE_EXIT;
> >> +		up_read(&vdev->memory_lock);
> >> +		if (copy_to_user(arg, &vfio_pm, sizeof(vfio_pm)))
> >> +			return -EFAULT;
> >> +		return 0;
> >> +	}
> >> +
> >> +	if (copy_from_user(&vfio_pm, arg, sizeof(vfio_pm)))
> >> +		return -EFAULT;
> >> +
> >> +	/*
> >> +	 * The vdev power related fields are protected with memory_lock
> >> +	 * semaphore.
> >> +	 */
> >> +	down_write(&vdev->memory_lock);
> >> +	switch (vfio_pm.low_power_state) {
> >> +	case VFIO_DEVICE_LOW_POWER_STATE_ENTER:
> >> +		if (!vdev->power_state_d3 || vdev->platform_pm_engaged) {
> >> +			ret = EINVAL;
> >> +			break;
> >> +		}
> >> +
> >> +		vdev->platform_pm_engaged = true;
> >> +
> >> +		/*
> >> +		 * The pm_runtime_put() will be called again while returning
> >> +		 * from ioctl after which the device can go into runtime
> >> +		 * suspended.
> >> +		 */
> >> +		pm_runtime_put_noidle(&pdev->dev);
> >> +		break;
> >> +
> >> +	case VFIO_DEVICE_LOW_POWER_STATE_EXIT:
> >> +		if (!vdev->platform_pm_engaged) {
> >> +			ret = EINVAL;
> >> +			break;
> >> +		}
> >> +
> >> +		vdev->platform_pm_engaged = false;
> >> +		vdev->power_state_d3 = false;
> >> +		pm_runtime_get_noresume(&pdev->dev);
> >> +		break;
> >> +
> >> +	default:
> >> +		ret = EINVAL;
> >> +		break;
> >> +	}
> >> +
> >> +	up_write(&vdev->memory_lock);
> >> +	return ret;
> >> +}
> >> +#endif
> >> +
> >>  static int vfio_pci_core_feature_token(struct vfio_device *device, u32 flags,
> >>  				       void __user *arg, size_t argsz)
> >>  {
> >> @@ -1226,6 +1339,10 @@ int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
> >>  	switch (flags & VFIO_DEVICE_FEATURE_MASK) {
> >>  	case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN:
> >>  		return vfio_pci_core_feature_token(device, flags, arg, argsz);
> >> +#ifdef CONFIG_PM
> >> +	case VFIO_DEVICE_FEATURE_POWER_MANAGEMENT:
> >> +		return vfio_pci_core_feature_pm(device, flags, arg, argsz);
> >> +#endif
> >>  	default:
> >>  		return -ENOTTY;
> >>  	}
> >> @@ -2189,6 +2306,15 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
> >>  		goto err_unlock;
> >>  	}
> >>  
> >> +	/*
> >> +	 * Some of the devices in the dev_set can be in the runtime suspended
> >> +	 * state. Increment the usage count for all the devices in the dev_set
> >> +	 * before reset and decrement the same after reset.
> >> +	 */
> >> +	ret = vfio_pci_dev_set_pm_runtime_get(dev_set);
> >> +	if (ret)
> >> +		goto err_unlock;
> >> +
> >>  	list_for_each_entry(cur_vma, &dev_set->device_list, vdev.dev_set_list) {
> >>  		/*
> >>  		 * Test whether all the affected devices are contained by the
> >> @@ -2244,6 +2370,9 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
> >>  		else
> >>  			mutex_unlock(&cur->vma_lock);
> >>  	}
> >> +
> >> +	list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list)
> >> +		pm_runtime_put(&cur->pdev->dev);
> >>  err_unlock:
> >>  	mutex_unlock(&dev_set->lock);
> >>  	return ret;
> >> diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
> >> index e84f31e44238..337983a877d6 100644
> >> --- a/include/linux/vfio_pci_core.h
> >> +++ b/include/linux/vfio_pci_core.h
> >> @@ -126,6 +126,7 @@ struct vfio_pci_core_device {
> >>  	bool			needs_pm_restore;
> >>  	bool			power_state_d3;
> >>  	bool			pm_intx_masked;
> >> +	bool			platform_pm_engaged;
> >>  	struct pci_saved_state	*pci_saved_state;
> >>  	struct pci_saved_state	*pm_save;
> >>  	int			ioeventfds_nr;
> >> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> >> index fea86061b44e..53ff890dbd27 100644
> >> --- a/include/uapi/linux/vfio.h
> >> +++ b/include/uapi/linux/vfio.h
> >> @@ -986,6 +986,24 @@ enum vfio_device_mig_state {
> >>  	VFIO_DEVICE_STATE_RUNNING_P2P = 5,
> >>  };
> >>  
> >> +/*
> >> + * Use platform-based power management for moving the device into low power
> >> + * state.  This low power state is device specific.
> >> + *
> >> + * For PCI, this low power state is D3cold.  The native PCI power management
> >> + * does not support the D3cold power state.  For moving the device into D3cold
> >> + * state, change the PCI state to D3hot with standard configuration registers
> >> + * and then call this IOCTL to setting the D3cold state.  Similarly, if the
> >> + * device in D3cold state, then call this IOCTL to exit from D3cold state.
> >> + */
> >> +struct vfio_device_feature_power_management {
> >> +#define VFIO_DEVICE_LOW_POWER_STATE_EXIT	0x0
> >> +#define VFIO_DEVICE_LOW_POWER_STATE_ENTER	0x1
> >> +	__u64	low_power_state;
> >> +};
> >> +
> >> +#define VFIO_DEVICE_FEATURE_POWER_MANAGEMENT	3  
> > 
> > __u8 seems more than sufficient here.  Thanks,
> > 
> > Alex
> >  
> 
>  I have used __u64 mainly to get this structure 64 bit aligned.
>  I was impression that the ioctl structure should be 64 bit aligned
>  but in this case since we will have just have __u8 member so
>  alignment should not be required?

We can add a directive to enforce an alignment regardless of the field
size.  I believe the feature ioctl header is already going to be eight
byte aligned, so it's probably not strictly necessary, but Jason seems
to be adding more of these directives elsewhere, so probably a good
idea regardless.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 6/8] vfio: Invoke runtime PM API for IOCTL request
  2022-05-05  9:40     ` Abhishek Sahu
@ 2022-05-09 22:30       ` Alex Williamson
  0 siblings, 0 replies; 41+ messages in thread
From: Alex Williamson @ 2022-05-09 22:30 UTC (permalink / raw)
  To: Abhishek Sahu
  Cc: Cornelia Huck, Yishai Hadas, Jason Gunthorpe, Shameer Kolothum,
	Kevin Tian, Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas,
	linux-kernel, kvm, linux-pm, linux-pci

On Thu, 5 May 2022 15:10:43 +0530
Abhishek Sahu <abhsahu@nvidia.com> wrote:

> On 5/5/2022 1:12 AM, Alex Williamson wrote:
> > On Mon, 25 Apr 2022 14:56:13 +0530
> > Abhishek Sahu <abhsahu@nvidia.com> wrote:
> >   
> >> The vfio/pci driver will have runtime power management support where the
> >> user can put the device low power state and then PCI devices can go into
> >> the D3cold state. If the device is in low power state and user issues any
> >> IOCTL, then the device should be moved out of low power state first. Once
> >> the IOCTL is serviced, then it can go into low power state again. The
> >> runtime PM framework manages this with help of usage count. One option
> >> was to add the runtime PM related API's inside vfio/pci driver but some
> >> IOCTL (like VFIO_DEVICE_FEATURE) can follow a different path and more
> >> IOCTL can be added in the future. Also, the runtime PM will be
> >> added for vfio/pci based drivers variant currently but the other vfio
> >> based drivers can use the same in the future. So, this patch adds the
> >> runtime calls runtime related API in the top level IOCTL function itself.
> >>
> >> For the vfio drivers which do not have runtime power management support
> >> currently, the runtime PM API's won't be invoked. Only for vfio/pci
> >> based drivers currently, the runtime PM API's will be invoked to increment
> >> and decrement the usage count. Taking this usage count incremented while
> >> servicing IOCTL will make sure that user won't put the device into low
> >> power state when any other IOCTL is being serviced in parallel.
> >>
> >> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
> >> ---
> >>  drivers/vfio/vfio.c | 44 +++++++++++++++++++++++++++++++++++++++++---
> >>  1 file changed, 41 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> >> index a4555014bd1e..4e65a127744e 100644
> >> --- a/drivers/vfio/vfio.c
> >> +++ b/drivers/vfio/vfio.c
> >> @@ -32,6 +32,7 @@
> >>  #include <linux/vfio.h>
> >>  #include <linux/wait.h>
> >>  #include <linux/sched/signal.h>
> >> +#include <linux/pm_runtime.h>
> >>  #include "vfio.h"
> >>  
> >>  #define DRIVER_VERSION	"0.3"
> >> @@ -1536,6 +1537,30 @@ static const struct file_operations vfio_group_fops = {
> >>  	.release	= vfio_group_fops_release,
> >>  };
> >>  
> >> +/*
> >> + * Wrapper around pm_runtime_resume_and_get().
> >> + * Return 0, if driver power management callbacks are not present i.e. the driver is not  
> > 
> > Mind the gratuitous long comment line here.
> >   
>  
>  Thanks Alex.
>  
>  That was a miss. I will fix this.
>  
> >> + * using runtime power management.
> >> + * Return 1 upon success, otherwise -errno  
> > 
> > Changing semantics vs the thing we're wrapping, why not provide a
> > wrapper for the `put` as well to avoid?  The only cases where we return
> > zero are just as easy to detect on the other side.
> >   
> 
>  Yes. Using wrapper function for put is better option.
>  I will make the changes.
> 
> >> + */
> >> +static inline int vfio_device_pm_runtime_get(struct device *dev)  
> > 
> > Given some of Jason's recent series, this should probably just accept a
> > vfio_device.
> >   
> 
>  Sorry. I didn't get this part.
> 
>  Do I need to change it to
> 
>  static inline int vfio_device_pm_runtime_get(struct vfio_device *device)
>  {
>     struct device *dev = device->dev;
>     ...
>  }

Yes.

> >> +{
> >> +#ifdef CONFIG_PM
> >> +	int ret;
> >> +
> >> +	if (!dev->driver || !dev->driver->pm)
> >> +		return 0;

I'm also wondering how we could ever get here with dev->driver == NULL.
If that were actually possible, the above would at best be racy.  It
also really seems like there ought to be a better test than the
driver->pm pointer to check if runtime pm is enabled, but I haven't
spotted it yet.

> >> +
> >> +	ret = pm_runtime_resume_and_get(dev);
> >> +	if (ret < 0)
> >> +		return ret;
> >> +
> >> +	return 1;
> >> +#else
> >> +	return 0;
> >> +#endif
> >> +}
> >> +
> >>  /*
> >>   * VFIO Device fd
> >>   */
> >> @@ -1845,15 +1870,28 @@ static long vfio_device_fops_unl_ioctl(struct file *filep,
> >>  				       unsigned int cmd, unsigned long arg)
> >>  {
> >>  	struct vfio_device *device = filep->private_data;
> >> +	int pm_ret, ret = 0;
> >> +
> >> +	pm_ret = vfio_device_pm_runtime_get(device->dev);
> >> +	if (pm_ret < 0)
> >> +		return pm_ret;  
> > 
> > I wonder if we might simply want to mask pm errors behind -EIO, maybe
> > with a rate limited dev_info().  My concern would be that we might mask
> > errnos that userspace has come to expect for certain ioctls.  Thanks,
> > 
> > Alex
> >   
> 
>   I need to do something like following. Correct ?
> 
>   ret = vfio_device_pm_runtime_get(device);
>   if (ret < 0) {
>      dev_info_ratelimited(device->dev, "vfio: runtime resume failed %d\n", ret);
>      return -EIO;
>   }

Yeah, though I'd welcome other thoughts here.  I don't necessarily like
the idea of squashing the errno, but at the same time, if
pm_runtime_resume_and_get() returns -EINVAL on user ioctl, that's not
really describing an invalid parameter relative to the ioctl itself.
Thanks,

Alex


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 8/8] vfio/pci: Add the support for PCI D3cold state
  2022-05-09 21:48       ` Alex Williamson
@ 2022-05-10 13:26         ` Abhishek Sahu
  2022-05-10 13:30           ` Jason Gunthorpe
  2022-05-30 11:15           ` Abhishek Sahu
  0 siblings, 2 replies; 41+ messages in thread
From: Abhishek Sahu @ 2022-05-10 13:26 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Cornelia Huck, Yishai Hadas, Jason Gunthorpe, Shameer Kolothum,
	Kevin Tian, Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas,
	linux-kernel, kvm, linux-pm, linux-pci

On 5/10/2022 3:18 AM, Alex Williamson wrote:
> On Thu, 5 May 2022 17:46:20 +0530
> Abhishek Sahu <abhsahu@nvidia.com> wrote:
> 
>> On 5/5/2022 1:15 AM, Alex Williamson wrote:
>>> On Mon, 25 Apr 2022 14:56:15 +0530
>>> Abhishek Sahu <abhsahu@nvidia.com> wrote:
>>>   
>>>> Currently, if the runtime power management is enabled for vfio pci
>>>> based device in the guest OS, then guest OS will do the register
>>>> write for PCI_PM_CTRL register. This write request will be handled in
>>>> vfio_pm_config_write() where it will do the actual register write
>>>> of PCI_PM_CTRL register. With this, the maximum D3hot state can be
>>>> achieved for low power. If we can use the runtime PM framework,
>>>> then we can achieve the D3cold state which will help in saving
>>>> maximum power.
>>>>
>>>> 1. Since D3cold state can't be achieved by writing PCI standard
>>>>    PM config registers, so this patch adds a new feature in the
>>>>    existing VFIO_DEVICE_FEATURE IOCTL. This IOCTL can be used
>>>>    to change the PCI device from D3hot to D3cold state and
>>>>    then D3cold to D0 state. The device feature uses low power term
>>>>    instead of D3cold so that if other vfio driver wants to implement
>>>>    low power support, then the same IOCTL can be used.  
>>>
>>> How does this enable you to handle the full-off vs memory-refresh modes
>>> for NVIDIA GPUs?
>>>   
>>  
>>  Thanks Alex.
>>
>>  This patch series will just enable the full-off for nvidia GPU.
>>  The self-refresh mode won't work.
>>
>>  The self-refresh case is nvidia specific and needs driver
>>  involvement each time before going into d3cold. We are evaluating
>>  internally if we have enough use case for self-refresh mode and then
>>  I will plan separate patch series to support self-refresh mode use
>>  case, if required. But that will be independent of this patch series.
>>
>>  At the high level, we need some way to disable the PCI device access
>>  from the host side or forward the event to VM for every access on the
>>  host side if we want to support NVIDIA self-refresh use case inside VM.
>>  Otherwise, from the driver side, we can disable self-refresh mode if
>>  driver is running inside VM. In that case, if memory usage is higher than
>>  threshold then we don’t engage RTD3 itself. 
> 
> Disabling PCI access on the host seems impractical to me, but PM and
> PCI folks are welcome to weigh in.
> 
> We've also discussed that the GPU memory could exceed RAM + swap for a
> VM, leaving them with no practical means to make use of d3cold if we
> don't support this capability.  Also, existing drivers expect to have
> this capability and it's not uncommon for those in the gaming community
> making use of GPU assignment to attempt to hide the fact that they're
> running in a VM to avoid falsely triggering anti-cheat detection, DRM,
> or working around certain GPU vendors who previously restricted use of
> consumer GPUs in VMs.
> 
> That seems to suggest to me that our only option is along the lines of
> notifying the VM when the device returns to D0 and by default only
> re-entering d3cold under the direction of the VM.  We might also do some
> sort of negotiation based on device vendor and class code where we
> could enable the kernel to perform the transition back to d3cold.
> There's a fair chance that an AMD GPU might have similar requirements,
> do we know if they do?
> 

 That SW involvement before going into D3cold can be possible for
 other devices as well although I am not sure about the current
 AMD GPU implementation. For NVIDIA GPU, the firmware running on the
 GPU listens for PME_turn_Off and then do the handling for self-refresh.
 For other devices also, if they have firmware involvement before
 D3cold entry then the similar issue can come there also.

> I'd suggest perhaps splitting this patch series so that we can start
> taking advantage of using d3cold for idle devices while we figure out
> how to make use of VM directed d3cold without creating scenarios that
> don't break existing drivers.
>  

 Sure. I can make this patch series and will move the last 3
 patches in separate patch series along with the VM notification
 support for the wake-up triggered by host.

>>> The feature ioctl supports a probe, but here the probe only indicates
>>> that the ioctl is available, not what degree of low power support
>>> available.  Even if the host doesn't support d3cold for the device, we
>>> can still achieve root port d3hot, but can we provide further
>>> capability info to the user?
>>>  
>>
>>  I wanted to add more information here but was not sure which
>>  information will be helpful for user. There is no certain way to
>>  predict that the runtime suspend will use D3cold state only even
>>  on the supported systems. User can disable runtime power management from 
>>
>>  /sys/bus/pci/devices/…/power/control
>>
>>  Or disable d3cold itself 
>>
>>  /sys/bus/pci/devices/…/d3cold_allowed
>>
>>
>>  Even if all these are allowed, then platform_pci_choose_state()
>>  is the main function where the target low power state is selected
>>  in runtime.
>>
>>  Probably we can add pci_pr3_present() status to user which gives
>>  hint to user that required ACPI methods for d3cold is present in
>>  the platform. 
> 
> I expected that might be the answer.  The proposed interface name also
> avoids tying us directly to an ACPI implementation, so I imagine there
> could be a variety of backends supporting runtime power management in
> the host kernel.
> 
> In the VM I think the ACPI controls are at the root port, so we
> probably need to add power control to each root port regardless of what
> happens to be plugged into it at the time.  Maybe that means we can't
> really take advantage of knowing the degree of device support, we just
> need to wire it up as if it works regardless.
> 

 In the host side ACPI, the power resources will be mostly associated
 with root port but from the ACPI specification side, the power resources
 can be associated with the device itself. In the guest side,
 we need to do virtual implementation so either it can be associated
 with virtual root port or from the device itself.

 But with that also, the host level degree of support information
 won’t help much.

> We might also want to consider parallels to device hotplug here.  For
> example, if QEMU could know that a device does not retain state in
> d3cold, it might choose to unplug the device backend so that the device
> could be used elsewhere in the interim, or simply use the idle device
> handling for d3cold in vfio-pci.  That opens up a lot of questions
> regarding SLA contracts with management tools to be able to replace the
> device with a fungible substitute on demand, but I can imagine data
> center logistics might rather have that problem than VMs sitting on
> powered-off devices.
> 
>>>> 2. The hypervisors can implement virtual ACPI methods. For
>>>>    example, in guest linux OS if PCI device ACPI node has _PR3 and _PR0
>>>>    power resources with _ON/_OFF method, then guest linux OS makes the
>>>>    _OFF call during D3cold transition and then _ON during D0 transition.
>>>>    The hypervisor can tap these virtual ACPI calls and then do the D3cold
>>>>    related IOCTL in the vfio driver.
>>>>
>>>> 3. The vfio driver uses runtime PM framework to achieve the
>>>>    D3cold state. For the D3cold transition, decrement the usage count and
>>>>    for the D0 transition, increment the usage count.
>>>>
>>>> 4. For D3cold, the device current power state should be D3hot.
>>>>    Then during runtime suspend, the pci_platform_power_transition() is
>>>>    required for D3cold state. If the D3cold state is not supported, then
>>>>    the device will still be in D3hot state. But with the runtime PM, the
>>>>    root port can now also go into suspended state.  
>>>
>>> Why do we create this requirement for the device to be in d3hot prior
>>> to entering low power   
>>
>>  This is mainly to make integration in the hypervisor with
>>  the PCI power management code flow.
>>
>>  If we see the power management steps, then following 2 steps
>>  are involved 
>>
>>  1. First move the device from D0 to D3hot state by writing
>>     into config register.
>>  2. Then invoke ACPI routines (mainly _PR3 OFF method) to
>>     move from D3hot to D3cold.
>>
>>  So, in the guest side, we can follow the same steps. The guest can
>>  do the config register write and then for step 2, the hypervisor
>>  can implement the virtual ACPI with _PR3/_PR0 power resources.
>>  Inside this virtual ACPI implementation, the hypervisor can invoke
>>  the power management IOCTL.
>>
>>  Also, if runtime PM has been disabled from the host side,
>>  then also the device will be in d3hot state. 
> 
> That's true regardless of us making it a requirement.  I don't see what
> it buys us to make this a requirement though.  If I trigger the _PR3
> method on bare metal, does ACPI care if the device is in D3hot first?
> At best that seems dependent on the ACPI implementation.
>

 Yes. That depends upon the ACPI implementation. 

>>> when our pm ops suspend function wakes the device do d0?  
>>
>>  The changing to D0 here is happening due to 2 reasons here,
>>
>>  1. First to preserve device state for the NoSoftRst-.
>>  2. To make use of PCI core layer generic code for runtime suspend,
>>     otherwise we need to do all handling here which is present in
>>     pci_pm_runtime_suspend().
> 
> What problem do we cause if we allow the user to trigger this ioctl
> from D0?  The restriction follows the expected use case, but otherwise
> imposing the restriction is arbitrary.
> 

 It seems then we can remove this restriction. It should be fine
 if user triggers this IOCTL from D0 and then the runtime power
 management itself will take care of device state itself.

>  
>>>> 5. For most of the systems, the D3cold is supported at the root
>>>>    port level. So, when root port will transition to D3cold state, then
>>>>    the vfio PCI device will go from D3hot to D3cold state during its
>>>>    runtime suspend. If root port does not support D3cold, then the root
>>>>    will go into D3hot state.
>>>>
>>>> 6. The runtime suspend callback can now happen for 2 cases: there
>>>>    are no users of vfio device and the case where user has initiated
>>>>    D3cold. The 'platform_pm_engaged' flag can help to distinguish
>>>>    between these 2 cases.  
>>>
>>> If this were the only use case we could rely on vfio_device.open_count
>>> instead.  I don't think it is though.    
>>
>>  platform_pm_engaged is mainly to track the user initiated
>>  low power entry with the IOCTL. So even if we use vfio_device.open_count
>>  here, we will still require platform_pm_engaged.
>>
>>>> 7. In D3cold, all kind of BAR related access needs to be disabled
>>>>    like D3hot. Additionally, the config space will also be disabled in
>>>>    D3cold state. To prevent access of config space in D3cold state, do
>>>>    increment the runtime PM usage count before doing any config space
>>>>    access.  
>>>
>>> Or we could actually prevent access to config space rather than waking
>>> the device for the access.  Addressed in further comment below.
>>>    
>>>> 8. If user has engaged low power entry through IOCTL, then user should
>>>>    do low power exit first. The user can issue config access or IOCTL
>>>>    after low power entry. We can add an explicit error check but since
>>>>    we are already waking-up device, so IOCTL and config access can be
>>>>    fulfilled. But 'power_state_d3' won't be cleared without issuing
>>>>    low power exit so all BAR related access will still return error till
>>>>    user do low power exit.  
>>>
>>> The fact that power_state_d3 no longer tracks the device power state
>>> when platform_pm_engaged is set is a confusing discontinuity.
>>>   
>>
>>  If we refer the power management steps (as mentioned in the above),
>>  then these 2 variable tracks different things.
>>
>>  1. power_state_d3 tracks the config space write.  
>>  2. platform_pm_engaged tracks the IOCTL call. In the IOCTL, we decrement
>>     the runtime usage count so we need to track that we have decremented
>>     it. 
>>
>>>> 9. Since multiple layers are involved, so following is the high level
>>>>    code flow for D3cold entry and exit.
>>>>
>>>> D3cold entry:
>>>>
>>>> a. User put the PCI device into D3hot by writing into standard config
>>>>    register (vfio_pm_config_write() -> vfio_lock_and_set_power_state() ->
>>>>    vfio_pci_set_power_state()). The device power state will be D3hot and
>>>>    power_state_d3 will be true.
>>>> b. Set vfio_device_feature_power_management::low_power_state =
>>>>    VFIO_DEVICE_LOW_POWER_STATE_ENTER and call VFIO_DEVICE_FEATURE IOCTL.
>>>> c. Inside vfio_device_fops_unl_ioctl(), pm_runtime_resume_and_get()
>>>>    will be called first which will make the usage count as 2 and then
>>>>    vfio_pci_core_ioctl_feature() will be invoked.
>>>> d. vfio_pci_core_feature_pm() will be called and it will go inside
>>>>    VFIO_DEVICE_LOW_POWER_STATE_ENTER switch case. platform_pm_engaged will
>>>>    be true and pm_runtime_put_noidle() will decrement the usage count
>>>>    to 1.
>>>> e. Inside vfio_device_fops_unl_ioctl() while returning the
>>>>    pm_runtime_put() will make the usage count to 0 and the runtime PM
>>>>    framework will engage the runtime suspend entry.
>>>> f. pci_pm_runtime_suspend() will be called and invokes driver runtime
>>>>    suspend callback.
>>>> g. vfio_pci_core_runtime_suspend() will change the power state to D0
>>>>    and do the INTx mask related handling.
>>>> h. pci_pm_runtime_suspend() will take care of saving the PCI state and
>>>>    all power management handling for D3cold.
>>>>
>>>> D3cold exit:
>>>>
>>>> a. Set vfio_device_feature_power_management::low_power_state =
>>>>    VFIO_DEVICE_LOW_POWER_STATE_EXIT and call VFIO_DEVICE_FEATURE IOCTL.
>>>> b. Inside vfio_device_fops_unl_ioctl(), pm_runtime_resume_and_get()
>>>>    will be called first which will make the usage count as 1.
>>>> c. pci_pm_runtime_resume() will take care of moving the device into D0
>>>>    state again and then vfio_pci_core_runtime_resume() will be called.
>>>> d. vfio_pci_core_runtime_resume() will do the INTx unmask related
>>>>    handling.
>>>> e. vfio_pci_core_ioctl_feature() will be invoked.
>>>> f. vfio_pci_core_feature_pm() will be called and it will go inside
>>>>    VFIO_DEVICE_LOW_POWER_STATE_EXIT switch case. platform_pm_engaged and
>>>>    power_state_d3 will be cleared and pm_runtime_get_noresume() will make
>>>>    the usage count as 2.
>>>> g. Inside vfio_device_fops_unl_ioctl() while returning the
>>>>    pm_runtime_put() will make the usage count to 1 and the device will
>>>>    be in D0 state only.
>>>>
>>>> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
>>>> ---
>>>>  drivers/vfio/pci/vfio_pci_config.c |  11 ++-
>>>>  drivers/vfio/pci/vfio_pci_core.c   | 131 ++++++++++++++++++++++++++++-
>>>>  include/linux/vfio_pci_core.h      |   1 +
>>>>  include/uapi/linux/vfio.h          |  18 ++++
>>>>  4 files changed, 159 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
>>>> index af0ae80ef324..65b1bc9586ab 100644
>>>> --- a/drivers/vfio/pci/vfio_pci_config.c
>>>> +++ b/drivers/vfio/pci/vfio_pci_config.c
>>>> @@ -25,6 +25,7 @@
>>>>  #include <linux/uaccess.h>
>>>>  #include <linux/vfio.h>
>>>>  #include <linux/slab.h>
>>>> +#include <linux/pm_runtime.h>
>>>>  
>>>>  #include <linux/vfio_pci_core.h>
>>>>  
>>>> @@ -1936,16 +1937,23 @@ static ssize_t vfio_config_do_rw(struct vfio_pci_core_device *vdev, char __user
>>>>  ssize_t vfio_pci_config_rw(struct vfio_pci_core_device *vdev, char __user *buf,
>>>>  			   size_t count, loff_t *ppos, bool iswrite)
>>>>  {
>>>> +	struct device *dev = &vdev->pdev->dev;
>>>>  	size_t done = 0;
>>>>  	int ret = 0;
>>>>  	loff_t pos = *ppos;
>>>>  
>>>>  	pos &= VFIO_PCI_OFFSET_MASK;
>>>>  
>>>> +	ret = pm_runtime_resume_and_get(dev);
>>>> +	if (ret < 0)
>>>> +		return ret;  
>>>
>>> Alternatively we could just check platform_pm_engaged here and return
>>> -EINVAL, right?  Why is waking the device the better option?
>>>   
>>
>>  This is mainly to prevent race condition where config space access
>>  happens parallelly with IOCTL access. So, lets consider the following case.
>>
>>  1. Config space access happens and vfio_pci_config_rw() will be called.
>>  2. The IOCTL to move into low power state is called.
>>  3. The IOCTL will move the device into d3cold.
>>  4. Exit from vfio_pci_config_rw() happened.
>>
>>  Now, if we just check platform_pm_engaged, then in the above
>>  sequence it won’t work. I checked this parallel access by writing
>>  a small program where I opened the 2 instances and then
>>  created 2 threads for config space and IOCTL.
>>  In my case, I got the above sequence.
>>
>>  The pm_runtime_resume_and_get() will make sure that device
>>  usage count keep incremented throughout the config space
>>  access (or IOCTL access in the previous patch) and the
>>  runtime PM framework will not move the device into suspended
>>  state.
> 
> I think we're inventing problems here.  If we define that config space
> is not accessible while the device is in low power and the only way to
> get the device out of low power is via ioctl, then we should be denying
> access to the device while in low power.  If the user races exiting the
> device from low power and a config space access, that's their problem.
> 

 But what about malicious user who intentionally tries to create
 this sequence. If the platform_pm_engaged check passed and
 then user put the device into low power state. In that case,
 there may be chances where config read happens while the device
 is in low power state.

 Can we prevent this concurrent access somehow or make sure
 that nothing else is running when the low power ioctl runs?

>>>> +
>>>>  	while (count) {
>>>>  		ret = vfio_config_do_rw(vdev, buf, count, &pos, iswrite);
>>>> -		if (ret < 0)
>>>> +		if (ret < 0) {
>>>> +			pm_runtime_put(dev);
>>>>  			return ret;
>>>> +		}
>>>>  
>>>>  		count -= ret;
>>>>  		done += ret;
>>>> @@ -1953,6 +1961,7 @@ ssize_t vfio_pci_config_rw(struct vfio_pci_core_device *vdev, char __user *buf,
>>>>  		pos += ret;
>>>>  	}
>>>>  
>>>> +	pm_runtime_put(dev);
>>>>  	*ppos += done;
>>>>  
>>>>  	return done;
>>>> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
>>>> index 05a68ca9d9e7..beac6e05f97f 100644
>>>> --- a/drivers/vfio/pci/vfio_pci_core.c
>>>> +++ b/drivers/vfio/pci/vfio_pci_core.c
>>>> @@ -234,7 +234,14 @@ int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t stat
>>>>  	ret = pci_set_power_state(pdev, state);
>>>>  
>>>>  	if (!ret) {
>>>> -		vdev->power_state_d3 = (pdev->current_state >= PCI_D3hot);
>>>> +		/*
>>>> +		 * If 'platform_pm_engaged' is true then 'power_state_d3' can
>>>> +		 * be cleared only when user makes the explicit request to
>>>> +		 * move out of low power state by using power management ioctl.
>>>> +		 */
>>>> +		if (!vdev->platform_pm_engaged)
>>>> +			vdev->power_state_d3 =
>>>> +				(pdev->current_state >= PCI_D3hot);  
>>>
>>> power_state_d3 is essentially only used as a secondary test to
>>> __vfio_pci_memory_enabled() to block r/w access to device regions and
>>> generate a fault on mmap access.  Its existence already seems a little
>>> questionable when we could just look at vdev->pdev->current_state, and
>>> we could incorporate that into __vfio_pci_memory_enabled().  So rather
>>> than creating this inconsistency, couldn't we just make that function
>>> return:
>>>
>>> !vdev->platform_pm_enagaged && pdev->current_state < PCI_D3hot &&
>>> (pdev->no_command_memory || (cmd & PCI_COMMAND_MEMORY))
>>>   
>>
>>  The main reason for power_state_d3 is to get it under
>>  memory_lock semaphore. But pdev->current_state is not
>>  protected with any lock. So, will use of pdev->current_state
>>  here be safe?
> 
> If we're only testing and modifying pdev->current_state under
> memory_lock, isn't it equivalent?
>

 pdev->current_state can be modified by PCI runtime PM core
 layer itself like when user invokes lspci, config dump command
 but in that case, platform_pm_enagaged should block this access.
 While for config space writes, the PM core layer code should not
 touch the pdev->current_state. So, yes we can use pdev->current_state.
 I will make this change and update the other patch in this series.

>>>>  
>>>>  		/* D3 might be unsupported via quirk, skip unless in D3 */
>>>>  		if (needs_save && pdev->current_state >= PCI_D3hot) {
>>>> @@ -266,6 +273,25 @@ static int vfio_pci_core_runtime_suspend(struct device *dev)
>>>>  {
>>>>  	struct vfio_pci_core_device *vdev = dev_get_drvdata(dev);
>>>>  
>>>> +	down_read(&vdev->memory_lock);
>>>> +
>>>> +	/* 'platform_pm_engaged' will be false if there are no users. */
>>>> +	if (!vdev->platform_pm_engaged) {
>>>> +		up_read(&vdev->memory_lock);
>>>> +		return 0;
>>>> +	}
>>>> +
>>>> +	/*
>>>> +	 * The user will move the device into D3hot state first before invoking
>>>> +	 * power management ioctl. Move the device into D0 state here and then
>>>> +	 * the pci-driver core runtime PM suspend will move the device into
>>>> +	 * low power state. Also, for the devices which have NoSoftRst-,
>>>> +	 * it will help in restoring the original state (saved locally in
>>>> +	 * 'vdev->pm_save').
>>>> +	 */
>>>> +	vfio_pci_set_power_state(vdev, PCI_D0);
>>>> +	up_read(&vdev->memory_lock);
>>>> +
>>>>  	/*
>>>>  	 * If INTx is enabled, then mask INTx before going into runtime
>>>>  	 * suspended state and unmask the same in the runtime resume.
>>>> @@ -395,6 +421,19 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
>>>>  
>>>>  	/*
>>>>  	 * This function can be invoked while the power state is non-D0.
>>>> +	 * This non-D0 power state can be with or without runtime PM.
>>>> +	 * Increment the usage count corresponding to pm_runtime_put()
>>>> +	 * called during setting of 'platform_pm_engaged'. The device will
>>>> +	 * wake up if it has already went into suspended state. Otherwise,
>>>> +	 * the next vfio_pci_set_power_state() will change the
>>>> +	 * device power state to D0.
>>>> +	 */
>>>> +	if (vdev->platform_pm_engaged) {
>>>> +		pm_runtime_resume_and_get(&pdev->dev);
>>>> +		vdev->platform_pm_engaged = false;
>>>> +	}
>>>> +
>>>> +	/*
>>>>  	 * This function calls __pci_reset_function_locked() which internally
>>>>  	 * can use pci_pm_reset() for the function reset. pci_pm_reset() will
>>>>  	 * fail if the power state is non-D0. Also, for the devices which
>>>> @@ -1192,6 +1231,80 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
>>>>  }
>>>>  EXPORT_SYMBOL_GPL(vfio_pci_core_ioctl);
>>>>  
>>>> +#ifdef CONFIG_PM
>>>> +static int vfio_pci_core_feature_pm(struct vfio_device *device, u32 flags,
>>>> +				    void __user *arg, size_t argsz)
>>>> +{
>>>> +	struct vfio_pci_core_device *vdev =
>>>> +		container_of(device, struct vfio_pci_core_device, vdev);
>>>> +	struct pci_dev *pdev = vdev->pdev;
>>>> +	struct vfio_device_feature_power_management vfio_pm = { 0 };
>>>> +	int ret = 0;
>>>> +
>>>> +	ret = vfio_check_feature(flags, argsz,
>>>> +				 VFIO_DEVICE_FEATURE_SET |
>>>> +				 VFIO_DEVICE_FEATURE_GET,
>>>> +				 sizeof(vfio_pm));
>>>> +	if (ret != 1)
>>>> +		return ret;
>>>> +
>>>> +	if (flags & VFIO_DEVICE_FEATURE_GET) {
>>>> +		down_read(&vdev->memory_lock);
>>>> +		vfio_pm.low_power_state = vdev->platform_pm_engaged ?
>>>> +				VFIO_DEVICE_LOW_POWER_STATE_ENTER :
>>>> +				VFIO_DEVICE_LOW_POWER_STATE_EXIT;
>>>> +		up_read(&vdev->memory_lock);
>>>> +		if (copy_to_user(arg, &vfio_pm, sizeof(vfio_pm)))
>>>> +			return -EFAULT;
>>>> +		return 0;
>>>> +	}
>>>> +
>>>> +	if (copy_from_user(&vfio_pm, arg, sizeof(vfio_pm)))
>>>> +		return -EFAULT;
>>>> +
>>>> +	/*
>>>> +	 * The vdev power related fields are protected with memory_lock
>>>> +	 * semaphore.
>>>> +	 */
>>>> +	down_write(&vdev->memory_lock);
>>>> +	switch (vfio_pm.low_power_state) {
>>>> +	case VFIO_DEVICE_LOW_POWER_STATE_ENTER:
>>>> +		if (!vdev->power_state_d3 || vdev->platform_pm_engaged) {
>>>> +			ret = EINVAL;
>>>> +			break;
>>>> +		}
>>>> +
>>>> +		vdev->platform_pm_engaged = true;
>>>> +
>>>> +		/*
>>>> +		 * The pm_runtime_put() will be called again while returning
>>>> +		 * from ioctl after which the device can go into runtime
>>>> +		 * suspended.
>>>> +		 */
>>>> +		pm_runtime_put_noidle(&pdev->dev);
>>>> +		break;
>>>> +
>>>> +	case VFIO_DEVICE_LOW_POWER_STATE_EXIT:
>>>> +		if (!vdev->platform_pm_engaged) {
>>>> +			ret = EINVAL;
>>>> +			break;
>>>> +		}
>>>> +
>>>> +		vdev->platform_pm_engaged = false;
>>>> +		vdev->power_state_d3 = false;
>>>> +		pm_runtime_get_noresume(&pdev->dev);
>>>> +		break;
>>>> +
>>>> +	default:
>>>> +		ret = EINVAL;
>>>> +		break;
>>>> +	}
>>>> +
>>>> +	up_write(&vdev->memory_lock);
>>>> +	return ret;
>>>> +}
>>>> +#endif
>>>> +
>>>>  static int vfio_pci_core_feature_token(struct vfio_device *device, u32 flags,
>>>>  				       void __user *arg, size_t argsz)
>>>>  {
>>>> @@ -1226,6 +1339,10 @@ int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
>>>>  	switch (flags & VFIO_DEVICE_FEATURE_MASK) {
>>>>  	case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN:
>>>>  		return vfio_pci_core_feature_token(device, flags, arg, argsz);
>>>> +#ifdef CONFIG_PM
>>>> +	case VFIO_DEVICE_FEATURE_POWER_MANAGEMENT:
>>>> +		return vfio_pci_core_feature_pm(device, flags, arg, argsz);
>>>> +#endif
>>>>  	default:
>>>>  		return -ENOTTY;
>>>>  	}
>>>> @@ -2189,6 +2306,15 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
>>>>  		goto err_unlock;
>>>>  	}
>>>>  
>>>> +	/*
>>>> +	 * Some of the devices in the dev_set can be in the runtime suspended
>>>> +	 * state. Increment the usage count for all the devices in the dev_set
>>>> +	 * before reset and decrement the same after reset.
>>>> +	 */
>>>> +	ret = vfio_pci_dev_set_pm_runtime_get(dev_set);
>>>> +	if (ret)
>>>> +		goto err_unlock;
>>>> +
>>>>  	list_for_each_entry(cur_vma, &dev_set->device_list, vdev.dev_set_list) {
>>>>  		/*
>>>>  		 * Test whether all the affected devices are contained by the
>>>> @@ -2244,6 +2370,9 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
>>>>  		else
>>>>  			mutex_unlock(&cur->vma_lock);
>>>>  	}
>>>> +
>>>> +	list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list)
>>>> +		pm_runtime_put(&cur->pdev->dev);
>>>>  err_unlock:
>>>>  	mutex_unlock(&dev_set->lock);
>>>>  	return ret;
>>>> diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
>>>> index e84f31e44238..337983a877d6 100644
>>>> --- a/include/linux/vfio_pci_core.h
>>>> +++ b/include/linux/vfio_pci_core.h
>>>> @@ -126,6 +126,7 @@ struct vfio_pci_core_device {
>>>>  	bool			needs_pm_restore;
>>>>  	bool			power_state_d3;
>>>>  	bool			pm_intx_masked;
>>>> +	bool			platform_pm_engaged;
>>>>  	struct pci_saved_state	*pci_saved_state;
>>>>  	struct pci_saved_state	*pm_save;
>>>>  	int			ioeventfds_nr;
>>>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>>>> index fea86061b44e..53ff890dbd27 100644
>>>> --- a/include/uapi/linux/vfio.h
>>>> +++ b/include/uapi/linux/vfio.h
>>>> @@ -986,6 +986,24 @@ enum vfio_device_mig_state {
>>>>  	VFIO_DEVICE_STATE_RUNNING_P2P = 5,
>>>>  };
>>>>  
>>>> +/*
>>>> + * Use platform-based power management for moving the device into low power
>>>> + * state.  This low power state is device specific.
>>>> + *
>>>> + * For PCI, this low power state is D3cold.  The native PCI power management
>>>> + * does not support the D3cold power state.  For moving the device into D3cold
>>>> + * state, change the PCI state to D3hot with standard configuration registers
>>>> + * and then call this IOCTL to setting the D3cold state.  Similarly, if the
>>>> + * device in D3cold state, then call this IOCTL to exit from D3cold state.
>>>> + */
>>>> +struct vfio_device_feature_power_management {
>>>> +#define VFIO_DEVICE_LOW_POWER_STATE_EXIT	0x0
>>>> +#define VFIO_DEVICE_LOW_POWER_STATE_ENTER	0x1
>>>> +	__u64	low_power_state;
>>>> +};
>>>> +
>>>> +#define VFIO_DEVICE_FEATURE_POWER_MANAGEMENT	3  
>>>
>>> __u8 seems more than sufficient here.  Thanks,
>>>
>>> Alex
>>>  
>>
>>  I have used __u64 mainly to get this structure 64 bit aligned.
>>  I was impression that the ioctl structure should be 64 bit aligned
>>  but in this case since we will have just have __u8 member so
>>  alignment should not be required?
> 
> We can add a directive to enforce an alignment regardless of the field
> size.  I believe the feature ioctl header is already going to be eight
> byte aligned, so it's probably not strictly necessary, but Jason seems
> to be adding more of these directives elsewhere, so probably a good
> idea regardless.  Thanks,
> 
> Alex
> 

So, should I change it like

__u8    low_power_state __attribute__((aligned(8)));

 Or

__aligned_u64 low_power_state

In the existing code, there are very few references for the
first one.

Thanks,
Abhishek

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 8/8] vfio/pci: Add the support for PCI D3cold state
  2022-05-10 13:26         ` Abhishek Sahu
@ 2022-05-10 13:30           ` Jason Gunthorpe
  2022-05-12 12:27             ` Abhishek Sahu
  2022-05-30 11:15           ` Abhishek Sahu
  1 sibling, 1 reply; 41+ messages in thread
From: Jason Gunthorpe @ 2022-05-10 13:30 UTC (permalink / raw)
  To: Abhishek Sahu
  Cc: Alex Williamson, Cornelia Huck, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas,
	linux-kernel, kvm, linux-pm, linux-pci

On Tue, May 10, 2022 at 06:56:02PM +0530, Abhishek Sahu wrote:
> > We can add a directive to enforce an alignment regardless of the field
> > size.  I believe the feature ioctl header is already going to be eight
> > byte aligned, so it's probably not strictly necessary, but Jason seems
> > to be adding more of these directives elsewhere, so probably a good
> > idea regardless.  Thanks,

> So, should I change it like
> 
> __u8    low_power_state __attribute__((aligned(8)));
> 
>  Or
> 
> __aligned_u64 low_power_state

You should be explicit about padding, add a reserved to cover the gap.

Jasno

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 8/8] vfio/pci: Add the support for PCI D3cold state
  2022-05-10 13:30           ` Jason Gunthorpe
@ 2022-05-12 12:27             ` Abhishek Sahu
  2022-05-12 12:47               ` Jason Gunthorpe
  0 siblings, 1 reply; 41+ messages in thread
From: Abhishek Sahu @ 2022-05-12 12:27 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Cornelia Huck, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas,
	linux-kernel, kvm, linux-pm, linux-pci

On 5/10/2022 7:00 PM, Jason Gunthorpe wrote:
> On Tue, May 10, 2022 at 06:56:02PM +0530, Abhishek Sahu wrote:
>>> We can add a directive to enforce an alignment regardless of the field
>>> size.  I believe the feature ioctl header is already going to be eight
>>> byte aligned, so it's probably not strictly necessary, but Jason seems
>>> to be adding more of these directives elsewhere, so probably a good
>>> idea regardless.  Thanks,
> 
>> So, should I change it like
>>
>> __u8    low_power_state __attribute__((aligned(8)));
>>
>>  Or
>>
>> __aligned_u64 low_power_state
> 
> You should be explicit about padding, add a reserved to cover the gap.
> 
> Jasno


 Thanks Jason.

 So, I need to make it like following. Correct ?

 __u8 low_power_state;
 __u8 reserved[7];

 It seems, then this aligned attribute should not be required.

 Thanks,
 Abhishek

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 8/8] vfio/pci: Add the support for PCI D3cold state
  2022-05-12 12:27             ` Abhishek Sahu
@ 2022-05-12 12:47               ` Jason Gunthorpe
  0 siblings, 0 replies; 41+ messages in thread
From: Jason Gunthorpe @ 2022-05-12 12:47 UTC (permalink / raw)
  To: Abhishek Sahu
  Cc: Alex Williamson, Cornelia Huck, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas,
	linux-kernel, kvm, linux-pm, linux-pci

On Thu, May 12, 2022 at 05:57:05PM +0530, Abhishek Sahu wrote:
> On 5/10/2022 7:00 PM, Jason Gunthorpe wrote:
> > On Tue, May 10, 2022 at 06:56:02PM +0530, Abhishek Sahu wrote:
> >>> We can add a directive to enforce an alignment regardless of the field
> >>> size.  I believe the feature ioctl header is already going to be eight
> >>> byte aligned, so it's probably not strictly necessary, but Jason seems
> >>> to be adding more of these directives elsewhere, so probably a good
> >>> idea regardless.  Thanks,
> > 
> >> So, should I change it like
> >>
> >> __u8    low_power_state __attribute__((aligned(8)));
> >>
> >>  Or
> >>
> >> __aligned_u64 low_power_state
> > 
> > You should be explicit about padding, add a reserved to cover the gap.
> > 
> > Jasno
> 
> 
>  Thanks Jason.
> 
>  So, I need to make it like following. Correct ?
> 
>  __u8 low_power_state;
>  __u8 reserved[7];
> 
>  It seems, then this aligned attribute should not be required.

Yes

Jason

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 8/8] vfio/pci: Add the support for PCI D3cold state
  2022-05-10 13:26         ` Abhishek Sahu
  2022-05-10 13:30           ` Jason Gunthorpe
@ 2022-05-30 11:15           ` Abhishek Sahu
  2022-05-30 12:25             ` Jason Gunthorpe
  1 sibling, 1 reply; 41+ messages in thread
From: Abhishek Sahu @ 2022-05-30 11:15 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Cornelia Huck, Yishai Hadas, Jason Gunthorpe, Shameer Kolothum,
	Kevin Tian, Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas,
	linux-kernel, kvm, linux-pm, linux-pci

On 5/10/2022 6:56 PM, Abhishek Sahu wrote:
> On 5/10/2022 3:18 AM, Alex Williamson wrote:
>> On Thu, 5 May 2022 17:46:20 +0530
>> Abhishek Sahu <abhsahu@nvidia.com> wrote:
>>
>>> On 5/5/2022 1:15 AM, Alex Williamson wrote:
>>>> On Mon, 25 Apr 2022 14:56:15 +0530
>>>> Abhishek Sahu <abhsahu@nvidia.com> wrote:
>>>>

<snip>

>>>>> diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
>>>>> index af0ae80ef324..65b1bc9586ab 100644
>>>>> --- a/drivers/vfio/pci/vfio_pci_config.c
>>>>> +++ b/drivers/vfio/pci/vfio_pci_config.c
>>>>> @@ -25,6 +25,7 @@
>>>>>  #include <linux/uaccess.h>
>>>>>  #include <linux/vfio.h>
>>>>>  #include <linux/slab.h>
>>>>> +#include <linux/pm_runtime.h>
>>>>>  
>>>>>  #include <linux/vfio_pci_core.h>
>>>>>  
>>>>> @@ -1936,16 +1937,23 @@ static ssize_t vfio_config_do_rw(struct vfio_pci_core_device *vdev, char __user
>>>>>  ssize_t vfio_pci_config_rw(struct vfio_pci_core_device *vdev, char __user *buf,
>>>>>  			   size_t count, loff_t *ppos, bool iswrite)
>>>>>  {
>>>>> +	struct device *dev = &vdev->pdev->dev;
>>>>>  	size_t done = 0;
>>>>>  	int ret = 0;
>>>>>  	loff_t pos = *ppos;
>>>>>  
>>>>>  	pos &= VFIO_PCI_OFFSET_MASK;
>>>>>  
>>>>> +	ret = pm_runtime_resume_and_get(dev);
>>>>> +	if (ret < 0)
>>>>> +		return ret;  
>>>>
>>>> Alternatively we could just check platform_pm_engaged here and return
>>>> -EINVAL, right?  Why is waking the device the better option?
>>>>   
>>>
>>>  This is mainly to prevent race condition where config space access
>>>  happens parallelly with IOCTL access. So, lets consider the following case.
>>>
>>>  1. Config space access happens and vfio_pci_config_rw() will be called.
>>>  2. The IOCTL to move into low power state is called.
>>>  3. The IOCTL will move the device into d3cold.
>>>  4. Exit from vfio_pci_config_rw() happened.
>>>
>>>  Now, if we just check platform_pm_engaged, then in the above
>>>  sequence it won’t work. I checked this parallel access by writing
>>>  a small program where I opened the 2 instances and then
>>>  created 2 threads for config space and IOCTL.
>>>  In my case, I got the above sequence.
>>>
>>>  The pm_runtime_resume_and_get() will make sure that device
>>>  usage count keep incremented throughout the config space
>>>  access (or IOCTL access in the previous patch) and the
>>>  runtime PM framework will not move the device into suspended
>>>  state.
>>
>> I think we're inventing problems here.  If we define that config space
>> is not accessible while the device is in low power and the only way to
>> get the device out of low power is via ioctl, then we should be denying
>> access to the device while in low power.  If the user races exiting the
>> device from low power and a config space access, that's their problem.
>>
> 
>  But what about malicious user who intentionally tries to create
>  this sequence. If the platform_pm_engaged check passed and
>  then user put the device into low power state. In that case,
>  there may be chances where config read happens while the device
>  is in low power state.
> 

 Hi Alex,

 I need help in concluding below part to proceed further on my
 implementation.
 
>  Can we prevent this concurrent access somehow or make sure
>  that nothing else is running when the low power ioctl runs?
> 

 If I add the 'platform_pm_engaged' alone and return early. 
 
 vfio_pci_config_rw()
 {
 ...
     down_read(&vdev->memory_lock);
     if (vdev->platform_pm_engaged) {
         up_read(&vdev->memory_lock);
         return -EIO;
     }
 ...
 }
 
 Then from user side, if two threads are running then there are chances
 that 'platform_pm_engaged' is false while we do check but it gets true
 before returning from this function. If runtime PM framework puts the
 device into D3cold state, then there are chances that config
 read/write happens with D3cold internally. I have added prints in this
 function locally at entry and exit. In entry, the 'platform_pm_engaged'
 is coming false while in exit it is coming as true, if I create 2
 threads from user space. It will be similar to memory access issue
 on disabled memory.
 
 So, we need to make sure that the VFIO_DEVICE_FEATURE_POWER_MANAGEMENT
 ioctl request should be exclusive and no other config or ioctl
 request should be running in parallel.
 
 Could you or someone else please suggest a way to handle this case.
 
 From my side, I have following solution to handle this but not sure if
 this will be acceptable and work for all the cases.
 
 1. In real use case, config or any other ioctl should not come along
    with VFIO_DEVICE_FEATURE_POWER_MANAGEMENT ioctl request.
 
 2. Maintain some 'access_count' which will be incremented when we
    do any config space access or ioctl.
 
 3. At the beginning of config space access or ioctl, we can do
    something like this

         down_read(&vdev->memory_lock);
         atomic_inc(&vdev->access_count);
         if (vdev->platform_pm_engaged) {
                 atomic_dec(&vdev->access_count);
                 up_read(&vdev->memory_lock);
                 return -EIO;
         }
         up_read(&vdev->memory_lock);
 
     And before returning, we can decrement the 'access_count'.
 
         down_read(&vdev->memory_lock);
         atomic_dec(&vdev->access_count);
         up_read(&vdev->memory_lock);

     The atmoic_dec() is put under 'memory_lock' to maintain
     lock ordering rules for the arch where atomic_t is internally
     implemented using locks.
 
 4. Inside vfio_pci_core_feature_pm(), we can do something like this
         down_write(&vdev->memory_lock);
         if (atomic_read(&vdev->access_count) != 1) {
                 up_write(&vdev->memory_lock);
                 return -EBUSY;
         }
         vdev->platform_pm_engaged = true;
         up_write(&vdev->memory_lock);
 
 
 5. The idea here is to check the 'access_count' in
    vfio_pci_core_feature_pm(). If 'access_count' is greater than 1,
    that means some other ioctl or config space is happening,
    and we return early. Otherwise, we can set 'platform_pm_engaged' and
    release the lock.
 
 6. In case of race condition, if vfio_pci_core_feature_pm() gets
    lock and found 'access_count' 1, then its sets 'platform_pm_engaged'.
    Now at the config space access or ioctl, the 'platform_pm_engaged'
    will get as true and it will return early.
 
    If config space access or ioctl happens first, then
    'platform_pm_engaged' will be false and the request will be
    successful. But the 'access_count' will be kept incremented till
    the last. Now, in vfio_pci_core_feature_pm(), it will get
    refcount as 2 and will return -EBUSY.
 
 7. For ioctl access, I need to add two callbacks functions (one
    for start and one for end) in the struct vfio_device_ops and call
    the same at start and end of ioctl from vfio_device_fops_unl_ioctl().
 
 Another option was to add one more lock like 'memory_lock' and maintain
 it throughout the config and ioctl access but maintaining
 two locks won't be easy since memory lock is already being
 used inside inside config and ioctl. 

 Thanks,
 Abhishek

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 8/8] vfio/pci: Add the support for PCI D3cold state
  2022-05-30 11:15           ` Abhishek Sahu
@ 2022-05-30 12:25             ` Jason Gunthorpe
  2022-05-31 12:14               ` Abhishek Sahu
  0 siblings, 1 reply; 41+ messages in thread
From: Jason Gunthorpe @ 2022-05-30 12:25 UTC (permalink / raw)
  To: Abhishek Sahu
  Cc: Alex Williamson, Cornelia Huck, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas,
	linux-kernel, kvm, linux-pm, linux-pci

On Mon, May 30, 2022 at 04:45:59PM +0530, Abhishek Sahu wrote:

>  1. In real use case, config or any other ioctl should not come along
>     with VFIO_DEVICE_FEATURE_POWER_MANAGEMENT ioctl request.
>  
>  2. Maintain some 'access_count' which will be incremented when we
>     do any config space access or ioctl.

Please don't open code locks - if you need a lock then write a proper
lock. You can use the 'try' variants to bail out in cases where that
is appropriate.

Jason

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 8/8] vfio/pci: Add the support for PCI D3cold state
  2022-05-30 12:25             ` Jason Gunthorpe
@ 2022-05-31 12:14               ` Abhishek Sahu
  2022-05-31 19:43                 ` Jason Gunthorpe
  0 siblings, 1 reply; 41+ messages in thread
From: Abhishek Sahu @ 2022-05-31 12:14 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Alex Williamson, Cornelia Huck, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas,
	linux-kernel, kvm, linux-pm, linux-pci

On 5/30/2022 5:55 PM, Jason Gunthorpe wrote:
> On Mon, May 30, 2022 at 04:45:59PM +0530, Abhishek Sahu wrote:
> 
>>  1. In real use case, config or any other ioctl should not come along
>>     with VFIO_DEVICE_FEATURE_POWER_MANAGEMENT ioctl request.
>>  
>>  2. Maintain some 'access_count' which will be incremented when we
>>     do any config space access or ioctl.
> 
> Please don't open code locks - if you need a lock then write a proper
> lock. You can use the 'try' variants to bail out in cases where that
> is appropriate.
> 
> Jason

 Thanks Jason for providing your inputs.

 In that case, should I introduce new rw_semaphore (For example
 power_lock) and move ‘platform_pm_engaged’ under ‘power_lock’ ?
 
 I was mainly concerned about locking rules w.r.t. existing
 ‘memory_lock’ and the code present in
 vfio_pci_zap_and_down_write_memory_lock() which is internally taking
 ‘mmap_lock’ and ‘vma_lock’. But from the initial analysis, it seems
 this should not cause any issue since we should not need ‘power_lock’
 in the mmap fault handler or any read/write functions. We can
 maintain following locking order
 
   power_lock => memory_lock
 
 1. At the beginning of config space access or ioctl, we can take the
    lock
 
     down_read(&vdev->power_lock);
     if (vdev->platform_pm_engaged) {
         up_read(&vdev->power_lock);
         return -EIO;
     }
 
    And before returning from config or ioctl, we can release the lock.
 
 2.  Now ‘platform_pm_engaged’ is not protected with memory_lock and we
     need to support the case where VFIO_DEVICE_FEATURE_POWER_MANAGEMENT
     can be called without putting the device into D3hot explicitly.
     So, I need to introduce a second variable which tracks the memory
     disablement (like power_state_d3 in this patch) and will be
     protected with 'memory_lock'. It will be set for both the cases,
     where users change the power state to D3hot by config
     write or user makes this ioctl. Inside vfio_pci_core_feature_pm(), now
     the code will become
    
         down_write(&vdev->power_lock);
         ...
         switch (vfio_pm.low_power_state) {
         case VFIO_DEVICE_LOW_POWER_STATE_ENTER:
                 ...
                         vfio_pci_zap_and_down_write_memory_lock(vdev);
                         vdev->power_state_d3 = true;
                         up_write(&vdev->memory_lock);

         ...
         up_write(&vdev->power_lock);
 
 3.  Inside __vfio_pci_memory_enabled(), we can check
     vdev->power_state_d3 instead of current_state.
 
 4.  For ioctl access, as mentioned previously I need to add two
     callbacks functions (one for start and one for end) in the struct
     vfio_device_ops and call the same at start and end of ioctl from
     vfio_device_fops_unl_ioctl().

 Thanks,
 Abhishek

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 8/8] vfio/pci: Add the support for PCI D3cold state
  2022-05-31 12:14               ` Abhishek Sahu
@ 2022-05-31 19:43                 ` Jason Gunthorpe
  2022-05-31 22:52                   ` Alex Williamson
  0 siblings, 1 reply; 41+ messages in thread
From: Jason Gunthorpe @ 2022-05-31 19:43 UTC (permalink / raw)
  To: Abhishek Sahu
  Cc: Alex Williamson, Cornelia Huck, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas,
	linux-kernel, kvm, linux-pm, linux-pci

On Tue, May 31, 2022 at 05:44:11PM +0530, Abhishek Sahu wrote:
> On 5/30/2022 5:55 PM, Jason Gunthorpe wrote:
> > On Mon, May 30, 2022 at 04:45:59PM +0530, Abhishek Sahu wrote:
> > 
> >>  1. In real use case, config or any other ioctl should not come along
> >>     with VFIO_DEVICE_FEATURE_POWER_MANAGEMENT ioctl request.
> >>  
> >>  2. Maintain some 'access_count' which will be incremented when we
> >>     do any config space access or ioctl.
> > 
> > Please don't open code locks - if you need a lock then write a proper
> > lock. You can use the 'try' variants to bail out in cases where that
> > is appropriate.
> > 
> > Jason
> 
>  Thanks Jason for providing your inputs.
> 
>  In that case, should I introduce new rw_semaphore (For example
>  power_lock) and move ‘platform_pm_engaged’ under ‘power_lock’ ?

Possibly, this is better than an atomic at least

>  1. At the beginning of config space access or ioctl, we can take the
>     lock
>  
>      down_read(&vdev->power_lock);

You can also do down_read_trylock() here and bail out as you were
suggesting with the atomic.

trylock doesn't have lock odering rules because it can't sleep so it
gives a bit more flexability when designing the lock ordering.

Though userspace has to be able to tolerate the failure, or never make
the request.

>          down_write(&vdev->power_lock);
>          ...
>          switch (vfio_pm.low_power_state) {
>          case VFIO_DEVICE_LOW_POWER_STATE_ENTER:
>                  ...
>                          vfio_pci_zap_and_down_write_memory_lock(vdev);
>                          vdev->power_state_d3 = true;
>                          up_write(&vdev->memory_lock);
> 
>          ...
>          up_write(&vdev->power_lock);

And something checks the power lock before allowing the memor to be
re-enabled?

>  4.  For ioctl access, as mentioned previously I need to add two
>      callbacks functions (one for start and one for end) in the struct
>      vfio_device_ops and call the same at start and end of ioctl from
>      vfio_device_fops_unl_ioctl().

Not sure I followed this..

Jason

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 8/8] vfio/pci: Add the support for PCI D3cold state
  2022-05-31 19:43                 ` Jason Gunthorpe
@ 2022-05-31 22:52                   ` Alex Williamson
  2022-06-01  9:49                     ` Abhishek Sahu
  0 siblings, 1 reply; 41+ messages in thread
From: Alex Williamson @ 2022-05-31 22:52 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Abhishek Sahu, Cornelia Huck, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas,
	linux-kernel, kvm, linux-pm, linux-pci

On Tue, 31 May 2022 16:43:04 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Tue, May 31, 2022 at 05:44:11PM +0530, Abhishek Sahu wrote:
> > On 5/30/2022 5:55 PM, Jason Gunthorpe wrote:  
> > > On Mon, May 30, 2022 at 04:45:59PM +0530, Abhishek Sahu wrote:
> > >   
> > >>  1. In real use case, config or any other ioctl should not come along
> > >>     with VFIO_DEVICE_FEATURE_POWER_MANAGEMENT ioctl request.
> > >>  
> > >>  2. Maintain some 'access_count' which will be incremented when we
> > >>     do any config space access or ioctl.  
> > > 
> > > Please don't open code locks - if you need a lock then write a proper
> > > lock. You can use the 'try' variants to bail out in cases where that
> > > is appropriate.
> > > 
> > > Jason  
> > 
> >  Thanks Jason for providing your inputs.
> > 
> >  In that case, should I introduce new rw_semaphore (For example
> >  power_lock) and move ‘platform_pm_engaged’ under ‘power_lock’ ?  
> 
> Possibly, this is better than an atomic at least
> 
> >  1. At the beginning of config space access or ioctl, we can take the
> >     lock
> >  
> >      down_read(&vdev->power_lock);  
> 
> You can also do down_read_trylock() here and bail out as you were
> suggesting with the atomic.
> 
> trylock doesn't have lock odering rules because it can't sleep so it
> gives a bit more flexability when designing the lock ordering.
> 
> Though userspace has to be able to tolerate the failure, or never make
> the request.
> 
> >          down_write(&vdev->power_lock);
> >          ...
> >          switch (vfio_pm.low_power_state) {
> >          case VFIO_DEVICE_LOW_POWER_STATE_ENTER:
> >                  ...
> >                          vfio_pci_zap_and_down_write_memory_lock(vdev);
> >                          vdev->power_state_d3 = true;
> >                          up_write(&vdev->memory_lock);
> > 
> >          ...
> >          up_write(&vdev->power_lock);  
> 
> And something checks the power lock before allowing the memor to be
> re-enabled?
> 
> >  4.  For ioctl access, as mentioned previously I need to add two
> >      callbacks functions (one for start and one for end) in the struct
> >      vfio_device_ops and call the same at start and end of ioctl from
> >      vfio_device_fops_unl_ioctl().  
> 
> Not sure I followed this..

I'm kinda lost here too.  A couple replies back there was some concern
about race scenarios with multiple user threads accessing the device.
The ones concerning non-deterministic behavior if a user is
concurrently changing power state and performing other accesses are a
non-issue, imo.  I think our goal is only to expand the current
memory_lock to block accesses, including config space, while the device
is in low power, or some approximation bounded by the entry/exit ioctl.

I think the remaining issues is how to do that relative to the fact
that config space access can change the memory enable state and would
therefore need to upgrade the memory_lock read-lock to a write-lock.
For that I think we can simply drop the read-lock, acquire the
write-lock, and re-test the low power state.  If it has changed, that
suggests the user has again raced changing power state with another
access and we can simply drop the lock and return -EIO.

If I'm still misunderstanding, please let me know.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 8/8] vfio/pci: Add the support for PCI D3cold state
  2022-05-31 22:52                   ` Alex Williamson
@ 2022-06-01  9:49                     ` Abhishek Sahu
  2022-06-01 16:21                       ` Alex Williamson
  0 siblings, 1 reply; 41+ messages in thread
From: Abhishek Sahu @ 2022-06-01  9:49 UTC (permalink / raw)
  To: Alex Williamson, Jason Gunthorpe
  Cc: Cornelia Huck, Yishai Hadas, Shameer Kolothum, Kevin Tian,
	Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas, linux-kernel,
	kvm, linux-pm, linux-pci

On 6/1/2022 4:22 AM, Alex Williamson wrote:
> On Tue, 31 May 2022 16:43:04 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
>> On Tue, May 31, 2022 at 05:44:11PM +0530, Abhishek Sahu wrote:
>>> On 5/30/2022 5:55 PM, Jason Gunthorpe wrote:  
>>>> On Mon, May 30, 2022 at 04:45:59PM +0530, Abhishek Sahu wrote:
>>>>   
>>>>>  1. In real use case, config or any other ioctl should not come along
>>>>>     with VFIO_DEVICE_FEATURE_POWER_MANAGEMENT ioctl request.
>>>>>  
>>>>>  2. Maintain some 'access_count' which will be incremented when we
>>>>>     do any config space access or ioctl.  
>>>>
>>>> Please don't open code locks - if you need a lock then write a proper
>>>> lock. You can use the 'try' variants to bail out in cases where that
>>>> is appropriate.
>>>>
>>>> Jason  
>>>
>>>  Thanks Jason for providing your inputs.
>>>
>>>  In that case, should I introduce new rw_semaphore (For example
>>>  power_lock) and move ‘platform_pm_engaged’ under ‘power_lock’ ?  
>>
>> Possibly, this is better than an atomic at least
>>
>>>  1. At the beginning of config space access or ioctl, we can take the
>>>     lock
>>>  
>>>      down_read(&vdev->power_lock);  
>>
>> You can also do down_read_trylock() here and bail out as you were
>> suggesting with the atomic.
>>
>> trylock doesn't have lock odering rules because it can't sleep so it
>> gives a bit more flexability when designing the lock ordering.
>>
>> Though userspace has to be able to tolerate the failure, or never make
>> the request.
>>

 Thanks Alex and Jason for providing your inputs.

 Using down_read_trylock() along with Alex suggestion seems fine.
 In real use case, config space access should not happen when the
 device is in low power state so returning error should not
 cause any issue in this case.

>>>          down_write(&vdev->power_lock);
>>>          ...
>>>          switch (vfio_pm.low_power_state) {
>>>          case VFIO_DEVICE_LOW_POWER_STATE_ENTER:
>>>                  ...
>>>                          vfio_pci_zap_and_down_write_memory_lock(vdev);
>>>                          vdev->power_state_d3 = true;
>>>                          up_write(&vdev->memory_lock);
>>>
>>>          ...
>>>          up_write(&vdev->power_lock);  
>>
>> And something checks the power lock before allowing the memor to be
>> re-enabled?
>>
>>>  4.  For ioctl access, as mentioned previously I need to add two
>>>      callbacks functions (one for start and one for end) in the struct
>>>      vfio_device_ops and call the same at start and end of ioctl from
>>>      vfio_device_fops_unl_ioctl().  
>>
>> Not sure I followed this..
> 
> I'm kinda lost here too.


 I have summarized the things below

 1. In the current patch (v3 8/8), if config space access or ioctl was
    being made by the user when the device is already in low power state,
    then it was waking the device. This wake up was happening with
    pm_runtime_resume_and_get() API in vfio_pci_config_rw() and
    vfio_device_fops_unl_ioctl() (with patch v3 7/8 in this patch series).

 2. Now, it has been decided to return error instead of waking the
    device if the device is already in low power state.

 3. Initially I thought to add following code in config space path
    (and similar in ioctl)

        vfio_pci_config_rw() {
            ...
            down_read(&vdev->memory_lock);
            if (vdev->platform_pm_engaged)
            {
                up_read(&vdev->memory_lock);
                return -EIO;
            }
            ...
        }

     And then there was a possibility that the physical config happens
     when the device in D3cold in case of race condition.

 4.  So, I wanted to add some mechanism so that the low power entry
     ioctl will be serialized with other ioctl or config space. With this
     if low power entry gets scheduled first then config/other ioctls will
     get failure, otherwise low power entry will wait.

 5.  For serializing this access, I need to ensure that lock is held
     throughout the operation. For config space I can add the code in
     vfio_pci_config_rw(). But for ioctls, I was not sure what is the best
     way since few ioctls (VFIO_DEVICE_FEATURE_MIGRATION,
     VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE etc.) are being handled in the
     vfio core layer itself.

 The memory_lock and the variables to track low power in specific to
 vfio-pci so I need some mechanism by which I add low power check for
 each ioctl. For serialization, I need to call function implemented in
 vfio-pci before vfio core layer makes the actual ioctl to grab the
 locks. Similarly, I need to release the lock once vfio core layer
 finished the actual ioctl. I have mentioned about this problem in the
 above point (point 4 in my earlier mail).

> A couple replies back there was some concern
> about race scenarios with multiple user threads accessing the device.
> The ones concerning non-deterministic behavior if a user is
> concurrently changing power state and performing other accesses are a
> non-issue, imo.  

 What does non-deterministic behavior here mean.
 Is it for user side that user will see different result
 (failure or success) during race condition or in the kernel side
 (as explained in point 3 above where physical config access
 happens when the device in D3cold) ? My concern here is for later
 part where this config space access in D3cold can cause fatal error
 on the system side as we have seen for memory disablement.

> I think our goal is only to expand the current
> memory_lock to block accesses, including config space, while the device
> is in low power, or some approximation bounded by the entry/exit ioctl.
> 
> I think the remaining issues is how to do that relative to the fact
> that config space access can change the memory enable state and would
> therefore need to upgrade the memory_lock read-lock to a write-lock.
> For that I think we can simply drop the read-lock, acquire the
> write-lock, and re-test the low power state.  If it has changed, that
> suggests the user has again raced changing power state with another
> access and we can simply drop the lock and return -EIO.
> 

 Yes. This looks better option. So, just to confirm, I can take the
 memory_lock read-lock at the starting of vfio_pci_config_rw() and
 release it just before returning from vfio_pci_config_rw() and
 for memory related config access, we will release this lock and
 re-aquiring again write version of this. Once memory write happens,
 then we can downgrade this write lock to read lock ?

 Also, what about IOCTLs. How can I take and release memory_lock for
 ioctl. is it okay to go with Patch 7 where we call
 pm_runtime_resume_and_get() before each ioctl or we need to do the
 same low power check for ioctl also ?
 In Later case, I am not sure how should I do the implementation so
 that all other ioctl are covered from vfio core layer itself.

 Thanks,
 Abhishek

> If I'm still misunderstanding, please let me know.  Thanks,
> 
> Alex
> 


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 8/8] vfio/pci: Add the support for PCI D3cold state
  2022-06-01  9:49                     ` Abhishek Sahu
@ 2022-06-01 16:21                       ` Alex Williamson
  2022-06-01 17:30                         ` Jason Gunthorpe
  2022-06-02 11:52                         ` Abhishek Sahu
  0 siblings, 2 replies; 41+ messages in thread
From: Alex Williamson @ 2022-06-01 16:21 UTC (permalink / raw)
  To: Abhishek Sahu
  Cc: Jason Gunthorpe, Cornelia Huck, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas,
	linux-kernel, kvm, linux-pm, linux-pci

On Wed, 1 Jun 2022 15:19:07 +0530
Abhishek Sahu <abhsahu@nvidia.com> wrote:

> On 6/1/2022 4:22 AM, Alex Williamson wrote:
> > On Tue, 31 May 2022 16:43:04 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >   
> >> On Tue, May 31, 2022 at 05:44:11PM +0530, Abhishek Sahu wrote:  
> >>> On 5/30/2022 5:55 PM, Jason Gunthorpe wrote:    
> >>>> On Mon, May 30, 2022 at 04:45:59PM +0530, Abhishek Sahu wrote:
> >>>>     
> >>>>>  1. In real use case, config or any other ioctl should not come along
> >>>>>     with VFIO_DEVICE_FEATURE_POWER_MANAGEMENT ioctl request.
> >>>>>  
> >>>>>  2. Maintain some 'access_count' which will be incremented when we
> >>>>>     do any config space access or ioctl.    
> >>>>
> >>>> Please don't open code locks - if you need a lock then write a proper
> >>>> lock. You can use the 'try' variants to bail out in cases where that
> >>>> is appropriate.
> >>>>
> >>>> Jason    
> >>>
> >>>  Thanks Jason for providing your inputs.
> >>>
> >>>  In that case, should I introduce new rw_semaphore (For example
> >>>  power_lock) and move ‘platform_pm_engaged’ under ‘power_lock’ ?    
> >>
> >> Possibly, this is better than an atomic at least
> >>  
> >>>  1. At the beginning of config space access or ioctl, we can take the
> >>>     lock
> >>>  
> >>>      down_read(&vdev->power_lock);    
> >>
> >> You can also do down_read_trylock() here and bail out as you were
> >> suggesting with the atomic.
> >>
> >> trylock doesn't have lock odering rules because it can't sleep so it
> >> gives a bit more flexability when designing the lock ordering.
> >>
> >> Though userspace has to be able to tolerate the failure, or never make
> >> the request.
> >>  
> 
>  Thanks Alex and Jason for providing your inputs.
> 
>  Using down_read_trylock() along with Alex suggestion seems fine.
>  In real use case, config space access should not happen when the
>  device is in low power state so returning error should not
>  cause any issue in this case.
> 
> >>>          down_write(&vdev->power_lock);
> >>>          ...
> >>>          switch (vfio_pm.low_power_state) {
> >>>          case VFIO_DEVICE_LOW_POWER_STATE_ENTER:
> >>>                  ...
> >>>                          vfio_pci_zap_and_down_write_memory_lock(vdev);
> >>>                          vdev->power_state_d3 = true;
> >>>                          up_write(&vdev->memory_lock);
> >>>
> >>>          ...
> >>>          up_write(&vdev->power_lock);    
> >>
> >> And something checks the power lock before allowing the memor to be
> >> re-enabled?
> >>  
> >>>  4.  For ioctl access, as mentioned previously I need to add two
> >>>      callbacks functions (one for start and one for end) in the struct
> >>>      vfio_device_ops and call the same at start and end of ioctl from
> >>>      vfio_device_fops_unl_ioctl().    
> >>
> >> Not sure I followed this..  
> > 
> > I'm kinda lost here too.  
> 
> 
>  I have summarized the things below
> 
>  1. In the current patch (v3 8/8), if config space access or ioctl was
>     being made by the user when the device is already in low power state,
>     then it was waking the device. This wake up was happening with
>     pm_runtime_resume_and_get() API in vfio_pci_config_rw() and
>     vfio_device_fops_unl_ioctl() (with patch v3 7/8 in this patch series).
> 
>  2. Now, it has been decided to return error instead of waking the
>     device if the device is already in low power state.
> 
>  3. Initially I thought to add following code in config space path
>     (and similar in ioctl)
> 
>         vfio_pci_config_rw() {
>             ...
>             down_read(&vdev->memory_lock);
>             if (vdev->platform_pm_engaged)
>             {
>                 up_read(&vdev->memory_lock);
>                 return -EIO;
>             }
>             ...
>         }
> 
>      And then there was a possibility that the physical config happens
>      when the device in D3cold in case of race condition.
> 
>  4.  So, I wanted to add some mechanism so that the low power entry
>      ioctl will be serialized with other ioctl or config space. With this
>      if low power entry gets scheduled first then config/other ioctls will
>      get failure, otherwise low power entry will wait.
> 
>  5.  For serializing this access, I need to ensure that lock is held
>      throughout the operation. For config space I can add the code in
>      vfio_pci_config_rw(). But for ioctls, I was not sure what is the best
>      way since few ioctls (VFIO_DEVICE_FEATURE_MIGRATION,
>      VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE etc.) are being handled in the
>      vfio core layer itself.
> 
>  The memory_lock and the variables to track low power in specific to
>  vfio-pci so I need some mechanism by which I add low power check for
>  each ioctl. For serialization, I need to call function implemented in
>  vfio-pci before vfio core layer makes the actual ioctl to grab the
>  locks. Similarly, I need to release the lock once vfio core layer
>  finished the actual ioctl. I have mentioned about this problem in the
>  above point (point 4 in my earlier mail).
> 
> > A couple replies back there was some concern
> > about race scenarios with multiple user threads accessing the device.
> > The ones concerning non-deterministic behavior if a user is
> > concurrently changing power state and performing other accesses are a
> > non-issue, imo.    
> 
>  What does non-deterministic behavior here mean.
>  Is it for user side that user will see different result
>  (failure or success) during race condition or in the kernel side
>  (as explained in point 3 above where physical config access
>  happens when the device in D3cold) ? My concern here is for later
>  part where this config space access in D3cold can cause fatal error
>  on the system side as we have seen for memory disablement.

Yes, our only concern should be to prevent such an access.  The user
seeing non-deterministic behavior, such as during concurrent power
control and config space access, all combinations of success/failure
are possible, is par for the course when we decide to block accesses
across the life of the low power state.
 
> > I think our goal is only to expand the current
> > memory_lock to block accesses, including config space, while the device
> > is in low power, or some approximation bounded by the entry/exit ioctl.
> > 
> > I think the remaining issues is how to do that relative to the fact
> > that config space access can change the memory enable state and would
> > therefore need to upgrade the memory_lock read-lock to a write-lock.
> > For that I think we can simply drop the read-lock, acquire the
> > write-lock, and re-test the low power state.  If it has changed, that
> > suggests the user has again raced changing power state with another
> > access and we can simply drop the lock and return -EIO.
> >   
> 
>  Yes. This looks better option. So, just to confirm, I can take the
>  memory_lock read-lock at the starting of vfio_pci_config_rw() and
>  release it just before returning from vfio_pci_config_rw() and
>  for memory related config access, we will release this lock and
>  re-aquiring again write version of this. Once memory write happens,
>  then we can downgrade this write lock to read lock ?

We only need to lock for the device access, so if you've finished that
access after acquiring the write-lock, there'd be no point to then
downgrade that to a read-lock.  The access should be finished by that
point.
 
>  Also, what about IOCTLs. How can I take and release memory_lock for
>  ioctl. is it okay to go with Patch 7 where we call
>  pm_runtime_resume_and_get() before each ioctl or we need to do the
>  same low power check for ioctl also ?
>  In Later case, I am not sure how should I do the implementation so
>  that all other ioctl are covered from vfio core layer itself.

Some ioctls clearly cannot occur while the device is in low power, such
as resets and interrupt control, but even less obvious things like
getting region info require device access.  Migration also provides a
channel to device access.  Do we want to manage a list of ioctls that
are allowed in low power, or do we only want to allow the ioctl to exit
low power?

I'm also still curious how we're going to handle devices that cannot
return to low power such as the self-refresh mode on the GPU.  We can
potentially prevent any wake-ups from the vfio device interface, but
that doesn't preclude a wake-up via an external lspci.  I think we need
to understand how we're going to handle such devices before we can
really complete the design.  AIUI, we cannot disable the self-refresh
sleep mode without imposing unreasonable latency and memory
requirements on the guest and we cannot retrigger the self-refresh
low-power mode without non-trivial device specific code.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 8/8] vfio/pci: Add the support for PCI D3cold state
  2022-06-01 16:21                       ` Alex Williamson
@ 2022-06-01 17:30                         ` Jason Gunthorpe
  2022-06-01 18:15                           ` Alex Williamson
  2022-06-02 11:52                         ` Abhishek Sahu
  1 sibling, 1 reply; 41+ messages in thread
From: Jason Gunthorpe @ 2022-06-01 17:30 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Abhishek Sahu, Cornelia Huck, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas,
	linux-kernel, kvm, linux-pm, linux-pci

On Wed, Jun 01, 2022 at 10:21:51AM -0600, Alex Williamson wrote:

> Some ioctls clearly cannot occur while the device is in low power, such
> as resets and interrupt control, but even less obvious things like
> getting region info require device access.  Migration also provides a
> channel to device access.  

I wonder what power management means in a case like that.

For the migration drivers they all rely on a PF driver that is not
VFIO, so it should be impossible for power management to cause the PF
to stop working.

I would expect any sane design of power management for a VF to not
cause any harm to the migration driver..

> I'm also still curious how we're going to handle devices that cannot
> return to low power such as the self-refresh mode on the GPU.  We can
> potentially prevent any wake-ups from the vfio device interface, but
> that doesn't preclude a wake-up via an external lspci.  I think we need
> to understand how we're going to handle such devices before we can
> really complete the design.  AIUI, we cannot disable the self-refresh
> sleep mode without imposing unreasonable latency and memory
> requirements on the guest and we cannot retrigger the self-refresh
> low-power mode without non-trivial device specific code.

It begs the question if power management should be something that only
a device-specific drivers should allow?

Jason

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 8/8] vfio/pci: Add the support for PCI D3cold state
  2022-06-01 17:30                         ` Jason Gunthorpe
@ 2022-06-01 18:15                           ` Alex Williamson
  2022-06-01 23:17                             ` Jason Gunthorpe
  0 siblings, 1 reply; 41+ messages in thread
From: Alex Williamson @ 2022-06-01 18:15 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Abhishek Sahu, Cornelia Huck, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas,
	linux-kernel, kvm, linux-pm, linux-pci

On Wed, 1 Jun 2022 14:30:54 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, Jun 01, 2022 at 10:21:51AM -0600, Alex Williamson wrote:
> 
> > Some ioctls clearly cannot occur while the device is in low power, such
> > as resets and interrupt control, but even less obvious things like
> > getting region info require device access.  Migration also provides a
> > channel to device access.    
> 
> I wonder what power management means in a case like that.
> 
> For the migration drivers they all rely on a PF driver that is not
> VFIO, so it should be impossible for power management to cause the PF
> to stop working.
> 
> I would expect any sane design of power management for a VF to not
> cause any harm to the migration driver..

Is there even a significant benefit or use case for power management
for VFs?  The existing D3hot support should be ok, but I imagine to
support D3cold, all the VFs and the PF would need to move to low power.
It might be safe to simply exclude VFs from providing this feature for
now.

> > I'm also still curious how we're going to handle devices that cannot
> > return to low power such as the self-refresh mode on the GPU.  We can
> > potentially prevent any wake-ups from the vfio device interface, but
> > that doesn't preclude a wake-up via an external lspci.  I think we need
> > to understand how we're going to handle such devices before we can
> > really complete the design.  AIUI, we cannot disable the self-refresh
> > sleep mode without imposing unreasonable latency and memory
> > requirements on the guest and we cannot retrigger the self-refresh
> > low-power mode without non-trivial device specific code.  
> 
> It begs the question if power management should be something that only
> a device-specific drivers should allow?

Yes, but that's also penalizing devices that require no special
support, for the few that do.  I'm not opposed to some sort of
vfio-pci-nvidia-gpu variant driver to provide that device specific
support, but I'd think the device table for such a driver might just be
added to the exclusion list for power management support in vfio-pci.
vfio-pci-core would need some way for drivers to opt-out/in for power
management.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 8/8] vfio/pci: Add the support for PCI D3cold state
  2022-06-01 18:15                           ` Alex Williamson
@ 2022-06-01 23:17                             ` Jason Gunthorpe
  0 siblings, 0 replies; 41+ messages in thread
From: Jason Gunthorpe @ 2022-06-01 23:17 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Abhishek Sahu, Cornelia Huck, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas,
	linux-kernel, kvm, linux-pm, linux-pci

On Wed, Jun 01, 2022 at 12:15:47PM -0600, Alex Williamson wrote:
> On Wed, 1 Jun 2022 14:30:54 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Wed, Jun 01, 2022 at 10:21:51AM -0600, Alex Williamson wrote:
> > 
> > > Some ioctls clearly cannot occur while the device is in low power, such
> > > as resets and interrupt control, but even less obvious things like
> > > getting region info require device access.  Migration also provides a
> > > channel to device access.    
> > 
> > I wonder what power management means in a case like that.
> > 
> > For the migration drivers they all rely on a PF driver that is not
> > VFIO, so it should be impossible for power management to cause the PF
> > to stop working.
> > 
> > I would expect any sane design of power management for a VF to not
> > cause any harm to the migration driver..
> 
> Is there even a significant benefit or use case for power management
> for VFs?  The existing D3hot support should be ok, but I imagine to
> support D3cold, all the VFs and the PF would need to move to low power.
> It might be safe to simply exclude VFs from providing this feature for
> now.

I know of no use case, I think it would be a good idea to exclude VFs.

> Yes, but that's also penalizing devices that require no special
> support, for the few that do.  I'm not opposed to some sort of
> vfio-pci-nvidia-gpu variant driver to provide that device specific
> support, but I'd think the device table for such a driver might just be
> added to the exclusion list for power management support in vfio-pci.
> vfio-pci-core would need some way for drivers to opt-out/in for power
> management. 

If you think it can be done generically with a small exclusion list
then that probably makes sense.

Jason

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 8/8] vfio/pci: Add the support for PCI D3cold state
  2022-06-01 16:21                       ` Alex Williamson
  2022-06-01 17:30                         ` Jason Gunthorpe
@ 2022-06-02 11:52                         ` Abhishek Sahu
  2022-06-02 17:44                           ` Alex Williamson
  1 sibling, 1 reply; 41+ messages in thread
From: Abhishek Sahu @ 2022-06-02 11:52 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jason Gunthorpe, Cornelia Huck, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas,
	linux-kernel, kvm, linux-pm, linux-pci

On 6/1/2022 9:51 PM, Alex Williamson wrote:
> On Wed, 1 Jun 2022 15:19:07 +0530
> Abhishek Sahu <abhsahu@nvidia.com> wrote:
> 
>> On 6/1/2022 4:22 AM, Alex Williamson wrote:
>>> On Tue, 31 May 2022 16:43:04 -0300
>>> Jason Gunthorpe <jgg@nvidia.com> wrote:
>>>   
>>>> On Tue, May 31, 2022 at 05:44:11PM +0530, Abhishek Sahu wrote:  
>>>>> On 5/30/2022 5:55 PM, Jason Gunthorpe wrote:    
>>>>>> On Mon, May 30, 2022 at 04:45:59PM +0530, Abhishek Sahu wrote:
>>>>>>     
>>>>>>>  1. In real use case, config or any other ioctl should not come along
>>>>>>>     with VFIO_DEVICE_FEATURE_POWER_MANAGEMENT ioctl request.
>>>>>>>  
>>>>>>>  2. Maintain some 'access_count' which will be incremented when we
>>>>>>>     do any config space access or ioctl.    
>>>>>>
>>>>>> Please don't open code locks - if you need a lock then write a proper
>>>>>> lock. You can use the 'try' variants to bail out in cases where that
>>>>>> is appropriate.
>>>>>>
>>>>>> Jason    
>>>>>
>>>>>  Thanks Jason for providing your inputs.
>>>>>
>>>>>  In that case, should I introduce new rw_semaphore (For example
>>>>>  power_lock) and move ‘platform_pm_engaged’ under ‘power_lock’ ?    
>>>>
>>>> Possibly, this is better than an atomic at least
>>>>  
>>>>>  1. At the beginning of config space access or ioctl, we can take the
>>>>>     lock
>>>>>  
>>>>>      down_read(&vdev->power_lock);    
>>>>
>>>> You can also do down_read_trylock() here and bail out as you were
>>>> suggesting with the atomic.
>>>>
>>>> trylock doesn't have lock odering rules because it can't sleep so it
>>>> gives a bit more flexability when designing the lock ordering.
>>>>
>>>> Though userspace has to be able to tolerate the failure, or never make
>>>> the request.
>>>>  
>>
>>  Thanks Alex and Jason for providing your inputs.
>>
>>  Using down_read_trylock() along with Alex suggestion seems fine.
>>  In real use case, config space access should not happen when the
>>  device is in low power state so returning error should not
>>  cause any issue in this case.
>>
>>>>>          down_write(&vdev->power_lock);
>>>>>          ...
>>>>>          switch (vfio_pm.low_power_state) {
>>>>>          case VFIO_DEVICE_LOW_POWER_STATE_ENTER:
>>>>>                  ...
>>>>>                          vfio_pci_zap_and_down_write_memory_lock(vdev);
>>>>>                          vdev->power_state_d3 = true;
>>>>>                          up_write(&vdev->memory_lock);
>>>>>
>>>>>          ...
>>>>>          up_write(&vdev->power_lock);    
>>>>
>>>> And something checks the power lock before allowing the memor to be
>>>> re-enabled?
>>>>  
>>>>>  4.  For ioctl access, as mentioned previously I need to add two
>>>>>      callbacks functions (one for start and one for end) in the struct
>>>>>      vfio_device_ops and call the same at start and end of ioctl from
>>>>>      vfio_device_fops_unl_ioctl().    
>>>>
>>>> Not sure I followed this..  
>>>
>>> I'm kinda lost here too.  
>>
>>
>>  I have summarized the things below
>>
>>  1. In the current patch (v3 8/8), if config space access or ioctl was
>>     being made by the user when the device is already in low power state,
>>     then it was waking the device. This wake up was happening with
>>     pm_runtime_resume_and_get() API in vfio_pci_config_rw() and
>>     vfio_device_fops_unl_ioctl() (with patch v3 7/8 in this patch series).
>>
>>  2. Now, it has been decided to return error instead of waking the
>>     device if the device is already in low power state.
>>
>>  3. Initially I thought to add following code in config space path
>>     (and similar in ioctl)
>>
>>         vfio_pci_config_rw() {
>>             ...
>>             down_read(&vdev->memory_lock);
>>             if (vdev->platform_pm_engaged)
>>             {
>>                 up_read(&vdev->memory_lock);
>>                 return -EIO;
>>             }
>>             ...
>>         }
>>
>>      And then there was a possibility that the physical config happens
>>      when the device in D3cold in case of race condition.
>>
>>  4.  So, I wanted to add some mechanism so that the low power entry
>>      ioctl will be serialized with other ioctl or config space. With this
>>      if low power entry gets scheduled first then config/other ioctls will
>>      get failure, otherwise low power entry will wait.
>>
>>  5.  For serializing this access, I need to ensure that lock is held
>>      throughout the operation. For config space I can add the code in
>>      vfio_pci_config_rw(). But for ioctls, I was not sure what is the best
>>      way since few ioctls (VFIO_DEVICE_FEATURE_MIGRATION,
>>      VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE etc.) are being handled in the
>>      vfio core layer itself.
>>
>>  The memory_lock and the variables to track low power in specific to
>>  vfio-pci so I need some mechanism by which I add low power check for
>>  each ioctl. For serialization, I need to call function implemented in
>>  vfio-pci before vfio core layer makes the actual ioctl to grab the
>>  locks. Similarly, I need to release the lock once vfio core layer
>>  finished the actual ioctl. I have mentioned about this problem in the
>>  above point (point 4 in my earlier mail).
>>
>>> A couple replies back there was some concern
>>> about race scenarios with multiple user threads accessing the device.
>>> The ones concerning non-deterministic behavior if a user is
>>> concurrently changing power state and performing other accesses are a
>>> non-issue, imo.    
>>
>>  What does non-deterministic behavior here mean.
>>  Is it for user side that user will see different result
>>  (failure or success) during race condition or in the kernel side
>>  (as explained in point 3 above where physical config access
>>  happens when the device in D3cold) ? My concern here is for later
>>  part where this config space access in D3cold can cause fatal error
>>  on the system side as we have seen for memory disablement.
> 
> Yes, our only concern should be to prevent such an access.  The user
> seeing non-deterministic behavior, such as during concurrent power
> control and config space access, all combinations of success/failure
> are possible, is par for the course when we decide to block accesses
> across the life of the low power state.
>  
>>> I think our goal is only to expand the current
>>> memory_lock to block accesses, including config space, while the device
>>> is in low power, or some approximation bounded by the entry/exit ioctl.
>>>
>>> I think the remaining issues is how to do that relative to the fact
>>> that config space access can change the memory enable state and would
>>> therefore need to upgrade the memory_lock read-lock to a write-lock.
>>> For that I think we can simply drop the read-lock, acquire the
>>> write-lock, and re-test the low power state.  If it has changed, that
>>> suggests the user has again raced changing power state with another
>>> access and we can simply drop the lock and return -EIO.
>>>   
>>
>>  Yes. This looks better option. So, just to confirm, I can take the
>>  memory_lock read-lock at the starting of vfio_pci_config_rw() and
>>  release it just before returning from vfio_pci_config_rw() and
>>  for memory related config access, we will release this lock and
>>  re-aquiring again write version of this. Once memory write happens,
>>  then we can downgrade this write lock to read lock ?
> 
> We only need to lock for the device access, so if you've finished that
> access after acquiring the write-lock, there'd be no point to then
> downgrade that to a read-lock.  The access should be finished by that
> point.
>

 I was planning to take memory_lock read-lock at the beginning of
 vfio_pci_config_rw() and release the same just before returning from
 this function. If I don't downgrade it back to read-lock, then the
 release in the end will be called for the lock which has not taken.
 Also, user can specify count to any number of bytes and then the
 vfio_config_do_rw() will be invoked multiple times and then in
 the second call, it will be without lock.
  
>>  Also, what about IOCTLs. How can I take and release memory_lock for
>>  ioctl. is it okay to go with Patch 7 where we call
>>  pm_runtime_resume_and_get() before each ioctl or we need to do the
>>  same low power check for ioctl also ?
>>  In Later case, I am not sure how should I do the implementation so
>>  that all other ioctl are covered from vfio core layer itself.
> 
> Some ioctls clearly cannot occur while the device is in low power, such
> as resets and interrupt control, but even less obvious things like
> getting region info require device access.  Migration also provides a
> channel to device access.  Do we want to manage a list of ioctls that
> are allowed in low power, or do we only want to allow the ioctl to exit
> low power?
> 

 In previous version of this patch, you mentioned that maintaining the
 safe ioctl list will be tough to maintain. So, currently we wanted to
 allow the ioctl for low power exit.

> I'm also still curious how we're going to handle devices that cannot
> return to low power such as the self-refresh mode on the GPU.  We can
> potentially prevent any wake-ups from the vfio device interface, but
> that doesn't preclude a wake-up via an external lspci.  I think we need
> to understand how we're going to handle such devices before we can
> really complete the design.  AIUI, we cannot disable the self-refresh
> sleep mode without imposing unreasonable latency and memory
> requirements on the guest and we cannot retrigger the self-refresh
> low-power mode without non-trivial device specific code.  Thanks,
> 
> Alex
> 

 I am working on adding support to notify guest through virtual PME
 whenever there is any wake-up triggered by the host and the guest has
 already put the device into runtime suspended state. This virtual PME
 will be similar to physical PME. Normally, if PCI device need power
 management transition, then it sends PME event which will be
 ultimately handled by host OS. In virtual PME case, if host need power
 management transition, then it sends event to guest and then guest OS
 handles these virtual PME events. Following is summary:

 1. Add the support for one more event like VFIO_PCI_ERR_IRQ_INDEX
    named VFIO_PCI_PME_IRQ_INDEX and add the required code for this
    virtual PME event.

 2. From the guest side, when the PME_IRQ is enabled then we will
    set event_fd for PME.

 3. In the vfio driver, the PME support bits are already
    virtualized and currently set to 0. We can set PME capability support
    for D3cold so that in guest, it looks like

     Capabilities: [60] Power Management version 3
     Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
            PME(D0-,D1-,D2-,D3hot-,D3cold+)

 4. From the guest side, it can do PME enable (PME_En bit in Power
    Management Control/Status Register) which will be again virtualized.

 5. When host gets request for resuming the device other than from
    low power ioctl, then device pm usage count will be incremented, the
    PME status (PME_Status bit in Power Management Control/Status Register)
    will be set and then we can do the event_fd signal.

 6. In the PCIe, the PME events will be handled by root port. For
    using low power D3cold feature, it is required to create virtual root
    port in hypervisor side and when hypervisor receives this PME event,
    then it can send virtual interrupt to root port.

 7. If we take example of Linux kernel, then pcie_pme_irq() will
    handle this and then do the runtime resume on the guest side. Also, it
    will clear the PME status bit here. Then guest can put the device
    again into suspended state.

 8. I did prototype changes in QEMU for above logic and was getting wake-up
    in the guest whenever I do lspci on the host side.

 9. Since currently only nvidia GPU has this limitation to require
    driver interaction each time before going into D3cold so we can allow
    the reentry for other device. We can have nvidia vendor (along with
    VGA/3D controller class code). In future, if any other device also has
    similar requirement then we can update this list. For other device
    host can put the device into D3cold in case of any wake-up.

 10. In the vfio driver, we can put all these restriction for
     enabling PME and return error if user tries to make low power entry
     ioctl without enabling the PME related things.

 11. The virtual PME can help in handling physical PME also for all
     the devices. The PME logic is not dependent upon nvidia GPU
     restriction. If virtual PME is enabled by hypervisor, then when
     physical PME wakes the device, then it will resume on the guest side
     also.

 Thanks,
 Abhishek

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 8/8] vfio/pci: Add the support for PCI D3cold state
  2022-06-02 11:52                         ` Abhishek Sahu
@ 2022-06-02 17:44                           ` Alex Williamson
  2022-06-03 10:19                             ` Abhishek Sahu
  0 siblings, 1 reply; 41+ messages in thread
From: Alex Williamson @ 2022-06-02 17:44 UTC (permalink / raw)
  To: Abhishek Sahu
  Cc: Jason Gunthorpe, Cornelia Huck, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas,
	linux-kernel, kvm, linux-pm, linux-pci

On Thu, 2 Jun 2022 17:22:03 +0530
Abhishek Sahu <abhsahu@nvidia.com> wrote:

> On 6/1/2022 9:51 PM, Alex Williamson wrote:
> > On Wed, 1 Jun 2022 15:19:07 +0530
> > Abhishek Sahu <abhsahu@nvidia.com> wrote:
> >   
> >> On 6/1/2022 4:22 AM, Alex Williamson wrote:  
> >>> On Tue, 31 May 2022 16:43:04 -0300
> >>> Jason Gunthorpe <jgg@nvidia.com> wrote:
> >>>     
> >>>> On Tue, May 31, 2022 at 05:44:11PM +0530, Abhishek Sahu wrote:    
> >>>>> On 5/30/2022 5:55 PM, Jason Gunthorpe wrote:      
> >>>>>> On Mon, May 30, 2022 at 04:45:59PM +0530, Abhishek Sahu wrote:
> >>>>>>       
> >>>>>>>  1. In real use case, config or any other ioctl should not come along
> >>>>>>>     with VFIO_DEVICE_FEATURE_POWER_MANAGEMENT ioctl request.
> >>>>>>>  
> >>>>>>>  2. Maintain some 'access_count' which will be incremented when we
> >>>>>>>     do any config space access or ioctl.      
> >>>>>>
> >>>>>> Please don't open code locks - if you need a lock then write a proper
> >>>>>> lock. You can use the 'try' variants to bail out in cases where that
> >>>>>> is appropriate.
> >>>>>>
> >>>>>> Jason      
> >>>>>
> >>>>>  Thanks Jason for providing your inputs.
> >>>>>
> >>>>>  In that case, should I introduce new rw_semaphore (For example
> >>>>>  power_lock) and move ‘platform_pm_engaged’ under ‘power_lock’ ?      
> >>>>
> >>>> Possibly, this is better than an atomic at least
> >>>>    
> >>>>>  1. At the beginning of config space access or ioctl, we can take the
> >>>>>     lock
> >>>>>  
> >>>>>      down_read(&vdev->power_lock);      
> >>>>
> >>>> You can also do down_read_trylock() here and bail out as you were
> >>>> suggesting with the atomic.
> >>>>
> >>>> trylock doesn't have lock odering rules because it can't sleep so it
> >>>> gives a bit more flexability when designing the lock ordering.
> >>>>
> >>>> Though userspace has to be able to tolerate the failure, or never make
> >>>> the request.
> >>>>    
> >>
> >>  Thanks Alex and Jason for providing your inputs.
> >>
> >>  Using down_read_trylock() along with Alex suggestion seems fine.
> >>  In real use case, config space access should not happen when the
> >>  device is in low power state so returning error should not
> >>  cause any issue in this case.
> >>  
> >>>>>          down_write(&vdev->power_lock);
> >>>>>          ...
> >>>>>          switch (vfio_pm.low_power_state) {
> >>>>>          case VFIO_DEVICE_LOW_POWER_STATE_ENTER:
> >>>>>                  ...
> >>>>>                          vfio_pci_zap_and_down_write_memory_lock(vdev);
> >>>>>                          vdev->power_state_d3 = true;
> >>>>>                          up_write(&vdev->memory_lock);
> >>>>>
> >>>>>          ...
> >>>>>          up_write(&vdev->power_lock);      
> >>>>
> >>>> And something checks the power lock before allowing the memor to be
> >>>> re-enabled?
> >>>>    
> >>>>>  4.  For ioctl access, as mentioned previously I need to add two
> >>>>>      callbacks functions (one for start and one for end) in the struct
> >>>>>      vfio_device_ops and call the same at start and end of ioctl from
> >>>>>      vfio_device_fops_unl_ioctl().      
> >>>>
> >>>> Not sure I followed this..    
> >>>
> >>> I'm kinda lost here too.    
> >>
> >>
> >>  I have summarized the things below
> >>
> >>  1. In the current patch (v3 8/8), if config space access or ioctl was
> >>     being made by the user when the device is already in low power state,
> >>     then it was waking the device. This wake up was happening with
> >>     pm_runtime_resume_and_get() API in vfio_pci_config_rw() and
> >>     vfio_device_fops_unl_ioctl() (with patch v3 7/8 in this patch series).
> >>
> >>  2. Now, it has been decided to return error instead of waking the
> >>     device if the device is already in low power state.
> >>
> >>  3. Initially I thought to add following code in config space path
> >>     (and similar in ioctl)
> >>
> >>         vfio_pci_config_rw() {
> >>             ...
> >>             down_read(&vdev->memory_lock);
> >>             if (vdev->platform_pm_engaged)
> >>             {
> >>                 up_read(&vdev->memory_lock);
> >>                 return -EIO;
> >>             }
> >>             ...
> >>         }
> >>
> >>      And then there was a possibility that the physical config happens
> >>      when the device in D3cold in case of race condition.
> >>
> >>  4.  So, I wanted to add some mechanism so that the low power entry
> >>      ioctl will be serialized with other ioctl or config space. With this
> >>      if low power entry gets scheduled first then config/other ioctls will
> >>      get failure, otherwise low power entry will wait.
> >>
> >>  5.  For serializing this access, I need to ensure that lock is held
> >>      throughout the operation. For config space I can add the code in
> >>      vfio_pci_config_rw(). But for ioctls, I was not sure what is the best
> >>      way since few ioctls (VFIO_DEVICE_FEATURE_MIGRATION,
> >>      VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE etc.) are being handled in the
> >>      vfio core layer itself.
> >>
> >>  The memory_lock and the variables to track low power in specific to
> >>  vfio-pci so I need some mechanism by which I add low power check for
> >>  each ioctl. For serialization, I need to call function implemented in
> >>  vfio-pci before vfio core layer makes the actual ioctl to grab the
> >>  locks. Similarly, I need to release the lock once vfio core layer
> >>  finished the actual ioctl. I have mentioned about this problem in the
> >>  above point (point 4 in my earlier mail).
> >>  
> >>> A couple replies back there was some concern
> >>> about race scenarios with multiple user threads accessing the device.
> >>> The ones concerning non-deterministic behavior if a user is
> >>> concurrently changing power state and performing other accesses are a
> >>> non-issue, imo.      
> >>
> >>  What does non-deterministic behavior here mean.
> >>  Is it for user side that user will see different result
> >>  (failure or success) during race condition or in the kernel side
> >>  (as explained in point 3 above where physical config access
> >>  happens when the device in D3cold) ? My concern here is for later
> >>  part where this config space access in D3cold can cause fatal error
> >>  on the system side as we have seen for memory disablement.  
> > 
> > Yes, our only concern should be to prevent such an access.  The user
> > seeing non-deterministic behavior, such as during concurrent power
> > control and config space access, all combinations of success/failure
> > are possible, is par for the course when we decide to block accesses
> > across the life of the low power state.
> >    
> >>> I think our goal is only to expand the current
> >>> memory_lock to block accesses, including config space, while the device
> >>> is in low power, or some approximation bounded by the entry/exit ioctl.
> >>>
> >>> I think the remaining issues is how to do that relative to the fact
> >>> that config space access can change the memory enable state and would
> >>> therefore need to upgrade the memory_lock read-lock to a write-lock.
> >>> For that I think we can simply drop the read-lock, acquire the
> >>> write-lock, and re-test the low power state.  If it has changed, that
> >>> suggests the user has again raced changing power state with another
> >>> access and we can simply drop the lock and return -EIO.
> >>>     
> >>
> >>  Yes. This looks better option. So, just to confirm, I can take the
> >>  memory_lock read-lock at the starting of vfio_pci_config_rw() and
> >>  release it just before returning from vfio_pci_config_rw() and
> >>  for memory related config access, we will release this lock and
> >>  re-aquiring again write version of this. Once memory write happens,
> >>  then we can downgrade this write lock to read lock ?  
> > 
> > We only need to lock for the device access, so if you've finished that
> > access after acquiring the write-lock, there'd be no point to then
> > downgrade that to a read-lock.  The access should be finished by that
> > point.
> >  
> 
>  I was planning to take memory_lock read-lock at the beginning of
>  vfio_pci_config_rw() and release the same just before returning from
>  this function. If I don't downgrade it back to read-lock, then the
>  release in the end will be called for the lock which has not taken.
>  Also, user can specify count to any number of bytes and then the
>  vfio_config_do_rw() will be invoked multiple times and then in
>  the second call, it will be without lock.

Ok, yes, I can imagine how it might result in a cleaner exit path to do
a downgrade_write().

> >>  Also, what about IOCTLs. How can I take and release memory_lock for
> >>  ioctl. is it okay to go with Patch 7 where we call
> >>  pm_runtime_resume_and_get() before each ioctl or we need to do the
> >>  same low power check for ioctl also ?
> >>  In Later case, I am not sure how should I do the implementation so
> >>  that all other ioctl are covered from vfio core layer itself.  
> > 
> > Some ioctls clearly cannot occur while the device is in low power, such
> > as resets and interrupt control, but even less obvious things like
> > getting region info require device access.  Migration also provides a
> > channel to device access.  Do we want to manage a list of ioctls that
> > are allowed in low power, or do we only want to allow the ioctl to exit
> > low power?
> >   
> 
>  In previous version of this patch, you mentioned that maintaining the
>  safe ioctl list will be tough to maintain. So, currently we wanted to
>  allow the ioctl for low power exit.

Yes, I'm still conflicted in how that would work.
 
> > I'm also still curious how we're going to handle devices that cannot
> > return to low power such as the self-refresh mode on the GPU.  We can
> > potentially prevent any wake-ups from the vfio device interface, but
> > that doesn't preclude a wake-up via an external lspci.  I think we need
> > to understand how we're going to handle such devices before we can
> > really complete the design.  AIUI, we cannot disable the self-refresh
> > sleep mode without imposing unreasonable latency and memory
> > requirements on the guest and we cannot retrigger the self-refresh
> > low-power mode without non-trivial device specific code.  Thanks,
> > 
> > Alex
> >   
> 
>  I am working on adding support to notify guest through virtual PME
>  whenever there is any wake-up triggered by the host and the guest has
>  already put the device into runtime suspended state. This virtual PME
>  will be similar to physical PME. Normally, if PCI device need power
>  management transition, then it sends PME event which will be
>  ultimately handled by host OS. In virtual PME case, if host need power
>  management transition, then it sends event to guest and then guest OS
>  handles these virtual PME events. Following is summary:
> 
>  1. Add the support for one more event like VFIO_PCI_ERR_IRQ_INDEX
>     named VFIO_PCI_PME_IRQ_INDEX and add the required code for this
>     virtual PME event.
> 
>  2. From the guest side, when the PME_IRQ is enabled then we will
>     set event_fd for PME.
> 
>  3. In the vfio driver, the PME support bits are already
>     virtualized and currently set to 0. We can set PME capability support
>     for D3cold so that in guest, it looks like
> 
>      Capabilities: [60] Power Management version 3
>      Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
>             PME(D0-,D1-,D2-,D3hot-,D3cold+)
> 
>  4. From the guest side, it can do PME enable (PME_En bit in Power
>     Management Control/Status Register) which will be again virtualized.
> 
>  5. When host gets request for resuming the device other than from
>     low power ioctl, then device pm usage count will be incremented, the
>     PME status (PME_Status bit in Power Management Control/Status Register)
>     will be set and then we can do the event_fd signal.
> 
>  6. In the PCIe, the PME events will be handled by root port. For
>     using low power D3cold feature, it is required to create virtual root
>     port in hypervisor side and when hypervisor receives this PME event,
>     then it can send virtual interrupt to root port.
> 
>  7. If we take example of Linux kernel, then pcie_pme_irq() will
>     handle this and then do the runtime resume on the guest side. Also, it
>     will clear the PME status bit here. Then guest can put the device
>     again into suspended state.
> 
>  8. I did prototype changes in QEMU for above logic and was getting wake-up
>     in the guest whenever I do lspci on the host side.
> 
>  9. Since currently only nvidia GPU has this limitation to require
>     driver interaction each time before going into D3cold so we can allow
>     the reentry for other device. We can have nvidia vendor (along with
>     VGA/3D controller class code). In future, if any other device also has
>     similar requirement then we can update this list. For other device
>     host can put the device into D3cold in case of any wake-up.
> 
>  10. In the vfio driver, we can put all these restriction for
>      enabling PME and return error if user tries to make low power entry
>      ioctl without enabling the PME related things.
> 
>  11. The virtual PME can help in handling physical PME also for all
>      the devices. The PME logic is not dependent upon nvidia GPU
>      restriction. If virtual PME is enabled by hypervisor, then when
>      physical PME wakes the device, then it will resume on the guest side
>      also.

So if host accesses through things like lspci are going to wake the
device and we can't prevent that, and the solution to that is to notify
the guest to put the device back to low power, then it seems a lot less
important to try to prevent the user from waking the device through
random accesses.  In that context, maybe we do simply wrap all accesses
with pm_runtime_get/put() put calls, which eliminates the problem of
maintaining a list of safe ioctls in low power.

I'd probably argue that whether to allow the kernel to put the device
back to low power directly is a policy decision and should therefore be
directed by userspace.  For example the low power entry ioctl would
have a flag to indicate the desired behavior and QEMU might have an
on/off/[auto] vfio-pci device option which allows configuration of that
behavior.  The default auto policy might direct for automatic low-power
re-entry except for NVIDIA VGA/3D class codes and other devices we
discover that need it.  This lets us have an immediate workaround for
devices requiring guest support without a new kernel.

This PME notification to the guest is really something that needs to be
part of the base specification for user managed low power access due to
these sorts of design decisions.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 8/8] vfio/pci: Add the support for PCI D3cold state
  2022-06-02 17:44                           ` Alex Williamson
@ 2022-06-03 10:19                             ` Abhishek Sahu
  2022-06-07 21:50                               ` Alex Williamson
  0 siblings, 1 reply; 41+ messages in thread
From: Abhishek Sahu @ 2022-06-03 10:19 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jason Gunthorpe, Cornelia Huck, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas,
	linux-kernel, kvm, linux-pm, linux-pci

On 6/2/2022 11:14 PM, Alex Williamson wrote:
> On Thu, 2 Jun 2022 17:22:03 +0530
> Abhishek Sahu <abhsahu@nvidia.com> wrote:
> 
>> On 6/1/2022 9:51 PM, Alex Williamson wrote:
>>> On Wed, 1 Jun 2022 15:19:07 +0530
>>> Abhishek Sahu <abhsahu@nvidia.com> wrote:
>>>   
>>>> On 6/1/2022 4:22 AM, Alex Williamson wrote:  
>>>>> On Tue, 31 May 2022 16:43:04 -0300
>>>>> Jason Gunthorpe <jgg@nvidia.com> wrote:
>>>>>     
>>>>>> On Tue, May 31, 2022 at 05:44:11PM +0530, Abhishek Sahu wrote:    
>>>>>>> On 5/30/2022 5:55 PM, Jason Gunthorpe wrote:      
>>>>>>>> On Mon, May 30, 2022 at 04:45:59PM +0530, Abhishek Sahu wrote:
>>>>>>>>       
>>>>>>>>>  1. In real use case, config or any other ioctl should not come along
>>>>>>>>>     with VFIO_DEVICE_FEATURE_POWER_MANAGEMENT ioctl request.
>>>>>>>>>  
>>>>>>>>>  2. Maintain some 'access_count' which will be incremented when we
>>>>>>>>>     do any config space access or ioctl.      
>>>>>>>>
>>>>>>>> Please don't open code locks - if you need a lock then write a proper
>>>>>>>> lock. You can use the 'try' variants to bail out in cases where that
>>>>>>>> is appropriate.
>>>>>>>>
>>>>>>>> Jason      
>>>>>>>
>>>>>>>  Thanks Jason for providing your inputs.
>>>>>>>
>>>>>>>  In that case, should I introduce new rw_semaphore (For example
>>>>>>>  power_lock) and move ‘platform_pm_engaged’ under ‘power_lock’ ?      
>>>>>>
>>>>>> Possibly, this is better than an atomic at least
>>>>>>    
>>>>>>>  1. At the beginning of config space access or ioctl, we can take the
>>>>>>>     lock
>>>>>>>  
>>>>>>>      down_read(&vdev->power_lock);      
>>>>>>
>>>>>> You can also do down_read_trylock() here and bail out as you were
>>>>>> suggesting with the atomic.
>>>>>>
>>>>>> trylock doesn't have lock odering rules because it can't sleep so it
>>>>>> gives a bit more flexability when designing the lock ordering.
>>>>>>
>>>>>> Though userspace has to be able to tolerate the failure, or never make
>>>>>> the request.
>>>>>>    
>>>>
>>>>  Thanks Alex and Jason for providing your inputs.
>>>>
>>>>  Using down_read_trylock() along with Alex suggestion seems fine.
>>>>  In real use case, config space access should not happen when the
>>>>  device is in low power state so returning error should not
>>>>  cause any issue in this case.
>>>>  
>>>>>>>          down_write(&vdev->power_lock);
>>>>>>>          ...
>>>>>>>          switch (vfio_pm.low_power_state) {
>>>>>>>          case VFIO_DEVICE_LOW_POWER_STATE_ENTER:
>>>>>>>                  ...
>>>>>>>                          vfio_pci_zap_and_down_write_memory_lock(vdev);
>>>>>>>                          vdev->power_state_d3 = true;
>>>>>>>                          up_write(&vdev->memory_lock);
>>>>>>>
>>>>>>>          ...
>>>>>>>          up_write(&vdev->power_lock);      
>>>>>>
>>>>>> And something checks the power lock before allowing the memor to be
>>>>>> re-enabled?
>>>>>>    
>>>>>>>  4.  For ioctl access, as mentioned previously I need to add two
>>>>>>>      callbacks functions (one for start and one for end) in the struct
>>>>>>>      vfio_device_ops and call the same at start and end of ioctl from
>>>>>>>      vfio_device_fops_unl_ioctl().      
>>>>>>
>>>>>> Not sure I followed this..    
>>>>>
>>>>> I'm kinda lost here too.    
>>>>
>>>>
>>>>  I have summarized the things below
>>>>
>>>>  1. In the current patch (v3 8/8), if config space access or ioctl was
>>>>     being made by the user when the device is already in low power state,
>>>>     then it was waking the device. This wake up was happening with
>>>>     pm_runtime_resume_and_get() API in vfio_pci_config_rw() and
>>>>     vfio_device_fops_unl_ioctl() (with patch v3 7/8 in this patch series).
>>>>
>>>>  2. Now, it has been decided to return error instead of waking the
>>>>     device if the device is already in low power state.
>>>>
>>>>  3. Initially I thought to add following code in config space path
>>>>     (and similar in ioctl)
>>>>
>>>>         vfio_pci_config_rw() {
>>>>             ...
>>>>             down_read(&vdev->memory_lock);
>>>>             if (vdev->platform_pm_engaged)
>>>>             {
>>>>                 up_read(&vdev->memory_lock);
>>>>                 return -EIO;
>>>>             }
>>>>             ...
>>>>         }
>>>>
>>>>      And then there was a possibility that the physical config happens
>>>>      when the device in D3cold in case of race condition.
>>>>
>>>>  4.  So, I wanted to add some mechanism so that the low power entry
>>>>      ioctl will be serialized with other ioctl or config space. With this
>>>>      if low power entry gets scheduled first then config/other ioctls will
>>>>      get failure, otherwise low power entry will wait.
>>>>
>>>>  5.  For serializing this access, I need to ensure that lock is held
>>>>      throughout the operation. For config space I can add the code in
>>>>      vfio_pci_config_rw(). But for ioctls, I was not sure what is the best
>>>>      way since few ioctls (VFIO_DEVICE_FEATURE_MIGRATION,
>>>>      VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE etc.) are being handled in the
>>>>      vfio core layer itself.
>>>>
>>>>  The memory_lock and the variables to track low power in specific to
>>>>  vfio-pci so I need some mechanism by which I add low power check for
>>>>  each ioctl. For serialization, I need to call function implemented in
>>>>  vfio-pci before vfio core layer makes the actual ioctl to grab the
>>>>  locks. Similarly, I need to release the lock once vfio core layer
>>>>  finished the actual ioctl. I have mentioned about this problem in the
>>>>  above point (point 4 in my earlier mail).
>>>>  
>>>>> A couple replies back there was some concern
>>>>> about race scenarios with multiple user threads accessing the device.
>>>>> The ones concerning non-deterministic behavior if a user is
>>>>> concurrently changing power state and performing other accesses are a
>>>>> non-issue, imo.      
>>>>
>>>>  What does non-deterministic behavior here mean.
>>>>  Is it for user side that user will see different result
>>>>  (failure or success) during race condition or in the kernel side
>>>>  (as explained in point 3 above where physical config access
>>>>  happens when the device in D3cold) ? My concern here is for later
>>>>  part where this config space access in D3cold can cause fatal error
>>>>  on the system side as we have seen for memory disablement.  
>>>
>>> Yes, our only concern should be to prevent such an access.  The user
>>> seeing non-deterministic behavior, such as during concurrent power
>>> control and config space access, all combinations of success/failure
>>> are possible, is par for the course when we decide to block accesses
>>> across the life of the low power state.
>>>    
>>>>> I think our goal is only to expand the current
>>>>> memory_lock to block accesses, including config space, while the device
>>>>> is in low power, or some approximation bounded by the entry/exit ioctl.
>>>>>
>>>>> I think the remaining issues is how to do that relative to the fact
>>>>> that config space access can change the memory enable state and would
>>>>> therefore need to upgrade the memory_lock read-lock to a write-lock.
>>>>> For that I think we can simply drop the read-lock, acquire the
>>>>> write-lock, and re-test the low power state.  If it has changed, that
>>>>> suggests the user has again raced changing power state with another
>>>>> access and we can simply drop the lock and return -EIO.
>>>>>     
>>>>
>>>>  Yes. This looks better option. So, just to confirm, I can take the
>>>>  memory_lock read-lock at the starting of vfio_pci_config_rw() and
>>>>  release it just before returning from vfio_pci_config_rw() and
>>>>  for memory related config access, we will release this lock and
>>>>  re-aquiring again write version of this. Once memory write happens,
>>>>  then we can downgrade this write lock to read lock ?  
>>>
>>> We only need to lock for the device access, so if you've finished that
>>> access after acquiring the write-lock, there'd be no point to then
>>> downgrade that to a read-lock.  The access should be finished by that
>>> point.
>>>  
>>
>>  I was planning to take memory_lock read-lock at the beginning of
>>  vfio_pci_config_rw() and release the same just before returning from
>>  this function. If I don't downgrade it back to read-lock, then the
>>  release in the end will be called for the lock which has not taken.
>>  Also, user can specify count to any number of bytes and then the
>>  vfio_config_do_rw() will be invoked multiple times and then in
>>  the second call, it will be without lock.
> 
> Ok, yes, I can imagine how it might result in a cleaner exit path to do
> a downgrade_write().
> 
>>>>  Also, what about IOCTLs. How can I take and release memory_lock for
>>>>  ioctl. is it okay to go with Patch 7 where we call
>>>>  pm_runtime_resume_and_get() before each ioctl or we need to do the
>>>>  same low power check for ioctl also ?
>>>>  In Later case, I am not sure how should I do the implementation so
>>>>  that all other ioctl are covered from vfio core layer itself.  
>>>
>>> Some ioctls clearly cannot occur while the device is in low power, such
>>> as resets and interrupt control, but even less obvious things like
>>> getting region info require device access.  Migration also provides a
>>> channel to device access.  Do we want to manage a list of ioctls that
>>> are allowed in low power, or do we only want to allow the ioctl to exit
>>> low power?
>>>   
>>
>>  In previous version of this patch, you mentioned that maintaining the
>>  safe ioctl list will be tough to maintain. So, currently we wanted to
>>  allow the ioctl for low power exit.
> 
> Yes, I'm still conflicted in how that would work.
>  
>>> I'm also still curious how we're going to handle devices that cannot
>>> return to low power such as the self-refresh mode on the GPU.  We can
>>> potentially prevent any wake-ups from the vfio device interface, but
>>> that doesn't preclude a wake-up via an external lspci.  I think we need
>>> to understand how we're going to handle such devices before we can
>>> really complete the design.  AIUI, we cannot disable the self-refresh
>>> sleep mode without imposing unreasonable latency and memory
>>> requirements on the guest and we cannot retrigger the self-refresh
>>> low-power mode without non-trivial device specific code.  Thanks,
>>>
>>> Alex
>>>   
>>
>>  I am working on adding support to notify guest through virtual PME
>>  whenever there is any wake-up triggered by the host and the guest has
>>  already put the device into runtime suspended state. This virtual PME
>>  will be similar to physical PME. Normally, if PCI device need power
>>  management transition, then it sends PME event which will be
>>  ultimately handled by host OS. In virtual PME case, if host need power
>>  management transition, then it sends event to guest and then guest OS
>>  handles these virtual PME events. Following is summary:
>>
>>  1. Add the support for one more event like VFIO_PCI_ERR_IRQ_INDEX
>>     named VFIO_PCI_PME_IRQ_INDEX and add the required code for this
>>     virtual PME event.
>>
>>  2. From the guest side, when the PME_IRQ is enabled then we will
>>     set event_fd for PME.
>>
>>  3. In the vfio driver, the PME support bits are already
>>     virtualized and currently set to 0. We can set PME capability support
>>     for D3cold so that in guest, it looks like
>>
>>      Capabilities: [60] Power Management version 3
>>      Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
>>             PME(D0-,D1-,D2-,D3hot-,D3cold+)
>>
>>  4. From the guest side, it can do PME enable (PME_En bit in Power
>>     Management Control/Status Register) which will be again virtualized.
>>
>>  5. When host gets request for resuming the device other than from
>>     low power ioctl, then device pm usage count will be incremented, the
>>     PME status (PME_Status bit in Power Management Control/Status Register)
>>     will be set and then we can do the event_fd signal.
>>
>>  6. In the PCIe, the PME events will be handled by root port. For
>>     using low power D3cold feature, it is required to create virtual root
>>     port in hypervisor side and when hypervisor receives this PME event,
>>     then it can send virtual interrupt to root port.
>>
>>  7. If we take example of Linux kernel, then pcie_pme_irq() will
>>     handle this and then do the runtime resume on the guest side. Also, it
>>     will clear the PME status bit here. Then guest can put the device
>>     again into suspended state.
>>
>>  8. I did prototype changes in QEMU for above logic and was getting wake-up
>>     in the guest whenever I do lspci on the host side.
>>
>>  9. Since currently only nvidia GPU has this limitation to require
>>     driver interaction each time before going into D3cold so we can allow
>>     the reentry for other device. We can have nvidia vendor (along with
>>     VGA/3D controller class code). In future, if any other device also has
>>     similar requirement then we can update this list. For other device
>>     host can put the device into D3cold in case of any wake-up.
>>
>>  10. In the vfio driver, we can put all these restriction for
>>      enabling PME and return error if user tries to make low power entry
>>      ioctl without enabling the PME related things.
>>
>>  11. The virtual PME can help in handling physical PME also for all
>>      the devices. The PME logic is not dependent upon nvidia GPU
>>      restriction. If virtual PME is enabled by hypervisor, then when
>>      physical PME wakes the device, then it will resume on the guest side
>>      also.
> 
> So if host accesses through things like lspci are going to wake the
> device and we can't prevent that, and the solution to that is to notify
> the guest to put the device back to low power, then it seems a lot less
> important to try to prevent the user from waking the device through
> random accesses.  In that context, maybe we do simply wrap all accesses
> with pm_runtime_get/put() put calls, which eliminates the problem of
> maintaining a list of safe ioctls in low power.
> 

 So wrap all access with pm_runtime_get()/put() will only be applicable
 for IOCTLs. Correct ?
 For config space, we can go with the approach discussed earlier in which
 we return error ?
 
> I'd probably argue that whether to allow the kernel to put the device
> back to low power directly is a policy decision and should therefore be
> directed by userspace.  For example the low power entry ioctl would
> have a flag to indicate the desired behavior and QEMU might have an
> on/off/[auto] vfio-pci device option which allows configuration of that
> behavior.  The default auto policy might direct for automatic low-power
> re-entry except for NVIDIA VGA/3D class codes and other devices we
> discover that need it.  This lets us have an immediate workaround for
> devices requiring guest support without a new kernel.
> 

 Yes. That is better option.
 I will do the changes.
 
> This PME notification to the guest is really something that needs to be
> part of the base specification for user managed low power access due to
> these sorts of design decisions.  Thanks,
> 
> Alex
> 

 Yes. I will include this in my next patch series.

 Regards,
 Abhishek

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 8/8] vfio/pci: Add the support for PCI D3cold state
  2022-06-03 10:19                             ` Abhishek Sahu
@ 2022-06-07 21:50                               ` Alex Williamson
  2022-06-08 10:12                                 ` Abhishek Sahu
  0 siblings, 1 reply; 41+ messages in thread
From: Alex Williamson @ 2022-06-07 21:50 UTC (permalink / raw)
  To: Abhishek Sahu
  Cc: Jason Gunthorpe, Cornelia Huck, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas,
	linux-kernel, kvm, linux-pm, linux-pci

On Fri, 3 Jun 2022 15:49:27 +0530
Abhishek Sahu <abhsahu@nvidia.com> wrote:

> On 6/2/2022 11:14 PM, Alex Williamson wrote:
> > On Thu, 2 Jun 2022 17:22:03 +0530
> > Abhishek Sahu <abhsahu@nvidia.com> wrote:
> >   
> >> On 6/1/2022 9:51 PM, Alex Williamson wrote:  
> >>> On Wed, 1 Jun 2022 15:19:07 +0530
> >>> Abhishek Sahu <abhsahu@nvidia.com> wrote:
> >>>     
> >>>> On 6/1/2022 4:22 AM, Alex Williamson wrote:    
> >>>>> On Tue, 31 May 2022 16:43:04 -0300
> >>>>> Jason Gunthorpe <jgg@nvidia.com> wrote:
> >>>>>       
> >>>>>> On Tue, May 31, 2022 at 05:44:11PM +0530, Abhishek Sahu wrote:      
> >>>>>>> On 5/30/2022 5:55 PM, Jason Gunthorpe wrote:        
> >>>>>>>> On Mon, May 30, 2022 at 04:45:59PM +0530, Abhishek Sahu wrote:
> >>>>>>>>         
> >>>>>>>>>  1. In real use case, config or any other ioctl should not come along
> >>>>>>>>>     with VFIO_DEVICE_FEATURE_POWER_MANAGEMENT ioctl request.
> >>>>>>>>>  
> >>>>>>>>>  2. Maintain some 'access_count' which will be incremented when we
> >>>>>>>>>     do any config space access or ioctl.        
> >>>>>>>>
> >>>>>>>> Please don't open code locks - if you need a lock then write a proper
> >>>>>>>> lock. You can use the 'try' variants to bail out in cases where that
> >>>>>>>> is appropriate.
> >>>>>>>>
> >>>>>>>> Jason        
> >>>>>>>
> >>>>>>>  Thanks Jason for providing your inputs.
> >>>>>>>
> >>>>>>>  In that case, should I introduce new rw_semaphore (For example
> >>>>>>>  power_lock) and move ‘platform_pm_engaged’ under ‘power_lock’ ?        
> >>>>>>
> >>>>>> Possibly, this is better than an atomic at least
> >>>>>>      
> >>>>>>>  1. At the beginning of config space access or ioctl, we can take the
> >>>>>>>     lock
> >>>>>>>  
> >>>>>>>      down_read(&vdev->power_lock);        
> >>>>>>
> >>>>>> You can also do down_read_trylock() here and bail out as you were
> >>>>>> suggesting with the atomic.
> >>>>>>
> >>>>>> trylock doesn't have lock odering rules because it can't sleep so it
> >>>>>> gives a bit more flexability when designing the lock ordering.
> >>>>>>
> >>>>>> Though userspace has to be able to tolerate the failure, or never make
> >>>>>> the request.
> >>>>>>      
> >>>>
> >>>>  Thanks Alex and Jason for providing your inputs.
> >>>>
> >>>>  Using down_read_trylock() along with Alex suggestion seems fine.
> >>>>  In real use case, config space access should not happen when the
> >>>>  device is in low power state so returning error should not
> >>>>  cause any issue in this case.
> >>>>    
> >>>>>>>          down_write(&vdev->power_lock);
> >>>>>>>          ...
> >>>>>>>          switch (vfio_pm.low_power_state) {
> >>>>>>>          case VFIO_DEVICE_LOW_POWER_STATE_ENTER:
> >>>>>>>                  ...
> >>>>>>>                          vfio_pci_zap_and_down_write_memory_lock(vdev);
> >>>>>>>                          vdev->power_state_d3 = true;
> >>>>>>>                          up_write(&vdev->memory_lock);
> >>>>>>>
> >>>>>>>          ...
> >>>>>>>          up_write(&vdev->power_lock);        
> >>>>>>
> >>>>>> And something checks the power lock before allowing the memor to be
> >>>>>> re-enabled?
> >>>>>>      
> >>>>>>>  4.  For ioctl access, as mentioned previously I need to add two
> >>>>>>>      callbacks functions (one for start and one for end) in the struct
> >>>>>>>      vfio_device_ops and call the same at start and end of ioctl from
> >>>>>>>      vfio_device_fops_unl_ioctl().        
> >>>>>>
> >>>>>> Not sure I followed this..      
> >>>>>
> >>>>> I'm kinda lost here too.      
> >>>>
> >>>>
> >>>>  I have summarized the things below
> >>>>
> >>>>  1. In the current patch (v3 8/8), if config space access or ioctl was
> >>>>     being made by the user when the device is already in low power state,
> >>>>     then it was waking the device. This wake up was happening with
> >>>>     pm_runtime_resume_and_get() API in vfio_pci_config_rw() and
> >>>>     vfio_device_fops_unl_ioctl() (with patch v3 7/8 in this patch series).
> >>>>
> >>>>  2. Now, it has been decided to return error instead of waking the
> >>>>     device if the device is already in low power state.
> >>>>
> >>>>  3. Initially I thought to add following code in config space path
> >>>>     (and similar in ioctl)
> >>>>
> >>>>         vfio_pci_config_rw() {
> >>>>             ...
> >>>>             down_read(&vdev->memory_lock);
> >>>>             if (vdev->platform_pm_engaged)
> >>>>             {
> >>>>                 up_read(&vdev->memory_lock);
> >>>>                 return -EIO;
> >>>>             }
> >>>>             ...
> >>>>         }
> >>>>
> >>>>      And then there was a possibility that the physical config happens
> >>>>      when the device in D3cold in case of race condition.
> >>>>
> >>>>  4.  So, I wanted to add some mechanism so that the low power entry
> >>>>      ioctl will be serialized with other ioctl or config space. With this
> >>>>      if low power entry gets scheduled first then config/other ioctls will
> >>>>      get failure, otherwise low power entry will wait.
> >>>>
> >>>>  5.  For serializing this access, I need to ensure that lock is held
> >>>>      throughout the operation. For config space I can add the code in
> >>>>      vfio_pci_config_rw(). But for ioctls, I was not sure what is the best
> >>>>      way since few ioctls (VFIO_DEVICE_FEATURE_MIGRATION,
> >>>>      VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE etc.) are being handled in the
> >>>>      vfio core layer itself.
> >>>>
> >>>>  The memory_lock and the variables to track low power in specific to
> >>>>  vfio-pci so I need some mechanism by which I add low power check for
> >>>>  each ioctl. For serialization, I need to call function implemented in
> >>>>  vfio-pci before vfio core layer makes the actual ioctl to grab the
> >>>>  locks. Similarly, I need to release the lock once vfio core layer
> >>>>  finished the actual ioctl. I have mentioned about this problem in the
> >>>>  above point (point 4 in my earlier mail).
> >>>>    
> >>>>> A couple replies back there was some concern
> >>>>> about race scenarios with multiple user threads accessing the device.
> >>>>> The ones concerning non-deterministic behavior if a user is
> >>>>> concurrently changing power state and performing other accesses are a
> >>>>> non-issue, imo.        
> >>>>
> >>>>  What does non-deterministic behavior here mean.
> >>>>  Is it for user side that user will see different result
> >>>>  (failure or success) during race condition or in the kernel side
> >>>>  (as explained in point 3 above where physical config access
> >>>>  happens when the device in D3cold) ? My concern here is for later
> >>>>  part where this config space access in D3cold can cause fatal error
> >>>>  on the system side as we have seen for memory disablement.    
> >>>
> >>> Yes, our only concern should be to prevent such an access.  The user
> >>> seeing non-deterministic behavior, such as during concurrent power
> >>> control and config space access, all combinations of success/failure
> >>> are possible, is par for the course when we decide to block accesses
> >>> across the life of the low power state.
> >>>      
> >>>>> I think our goal is only to expand the current
> >>>>> memory_lock to block accesses, including config space, while the device
> >>>>> is in low power, or some approximation bounded by the entry/exit ioctl.
> >>>>>
> >>>>> I think the remaining issues is how to do that relative to the fact
> >>>>> that config space access can change the memory enable state and would
> >>>>> therefore need to upgrade the memory_lock read-lock to a write-lock.
> >>>>> For that I think we can simply drop the read-lock, acquire the
> >>>>> write-lock, and re-test the low power state.  If it has changed, that
> >>>>> suggests the user has again raced changing power state with another
> >>>>> access and we can simply drop the lock and return -EIO.
> >>>>>       
> >>>>
> >>>>  Yes. This looks better option. So, just to confirm, I can take the
> >>>>  memory_lock read-lock at the starting of vfio_pci_config_rw() and
> >>>>  release it just before returning from vfio_pci_config_rw() and
> >>>>  for memory related config access, we will release this lock and
> >>>>  re-aquiring again write version of this. Once memory write happens,
> >>>>  then we can downgrade this write lock to read lock ?    
> >>>
> >>> We only need to lock for the device access, so if you've finished that
> >>> access after acquiring the write-lock, there'd be no point to then
> >>> downgrade that to a read-lock.  The access should be finished by that
> >>> point.
> >>>    
> >>
> >>  I was planning to take memory_lock read-lock at the beginning of
> >>  vfio_pci_config_rw() and release the same just before returning from
> >>  this function. If I don't downgrade it back to read-lock, then the
> >>  release in the end will be called for the lock which has not taken.
> >>  Also, user can specify count to any number of bytes and then the
> >>  vfio_config_do_rw() will be invoked multiple times and then in
> >>  the second call, it will be without lock.  
> > 
> > Ok, yes, I can imagine how it might result in a cleaner exit path to do
> > a downgrade_write().
> >   
> >>>>  Also, what about IOCTLs. How can I take and release memory_lock for
> >>>>  ioctl. is it okay to go with Patch 7 where we call
> >>>>  pm_runtime_resume_and_get() before each ioctl or we need to do the
> >>>>  same low power check for ioctl also ?
> >>>>  In Later case, I am not sure how should I do the implementation so
> >>>>  that all other ioctl are covered from vfio core layer itself.    
> >>>
> >>> Some ioctls clearly cannot occur while the device is in low power, such
> >>> as resets and interrupt control, but even less obvious things like
> >>> getting region info require device access.  Migration also provides a
> >>> channel to device access.  Do we want to manage a list of ioctls that
> >>> are allowed in low power, or do we only want to allow the ioctl to exit
> >>> low power?
> >>>     
> >>
> >>  In previous version of this patch, you mentioned that maintaining the
> >>  safe ioctl list will be tough to maintain. So, currently we wanted to
> >>  allow the ioctl for low power exit.  
> > 
> > Yes, I'm still conflicted in how that would work.
> >    
> >>> I'm also still curious how we're going to handle devices that cannot
> >>> return to low power such as the self-refresh mode on the GPU.  We can
> >>> potentially prevent any wake-ups from the vfio device interface, but
> >>> that doesn't preclude a wake-up via an external lspci.  I think we need
> >>> to understand how we're going to handle such devices before we can
> >>> really complete the design.  AIUI, we cannot disable the self-refresh
> >>> sleep mode without imposing unreasonable latency and memory
> >>> requirements on the guest and we cannot retrigger the self-refresh
> >>> low-power mode without non-trivial device specific code.  Thanks,
> >>>
> >>> Alex
> >>>     
> >>
> >>  I am working on adding support to notify guest through virtual PME
> >>  whenever there is any wake-up triggered by the host and the guest has
> >>  already put the device into runtime suspended state. This virtual PME
> >>  will be similar to physical PME. Normally, if PCI device need power
> >>  management transition, then it sends PME event which will be
> >>  ultimately handled by host OS. In virtual PME case, if host need power
> >>  management transition, then it sends event to guest and then guest OS
> >>  handles these virtual PME events. Following is summary:
> >>
> >>  1. Add the support for one more event like VFIO_PCI_ERR_IRQ_INDEX
> >>     named VFIO_PCI_PME_IRQ_INDEX and add the required code for this
> >>     virtual PME event.
> >>
> >>  2. From the guest side, when the PME_IRQ is enabled then we will
> >>     set event_fd for PME.
> >>
> >>  3. In the vfio driver, the PME support bits are already
> >>     virtualized and currently set to 0. We can set PME capability support
> >>     for D3cold so that in guest, it looks like
> >>
> >>      Capabilities: [60] Power Management version 3
> >>      Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
> >>             PME(D0-,D1-,D2-,D3hot-,D3cold+)
> >>
> >>  4. From the guest side, it can do PME enable (PME_En bit in Power
> >>     Management Control/Status Register) which will be again virtualized.
> >>
> >>  5. When host gets request for resuming the device other than from
> >>     low power ioctl, then device pm usage count will be incremented, the
> >>     PME status (PME_Status bit in Power Management Control/Status Register)
> >>     will be set and then we can do the event_fd signal.
> >>
> >>  6. In the PCIe, the PME events will be handled by root port. For
> >>     using low power D3cold feature, it is required to create virtual root
> >>     port in hypervisor side and when hypervisor receives this PME event,
> >>     then it can send virtual interrupt to root port.
> >>
> >>  7. If we take example of Linux kernel, then pcie_pme_irq() will
> >>     handle this and then do the runtime resume on the guest side. Also, it
> >>     will clear the PME status bit here. Then guest can put the device
> >>     again into suspended state.
> >>
> >>  8. I did prototype changes in QEMU for above logic and was getting wake-up
> >>     in the guest whenever I do lspci on the host side.
> >>
> >>  9. Since currently only nvidia GPU has this limitation to require
> >>     driver interaction each time before going into D3cold so we can allow
> >>     the reentry for other device. We can have nvidia vendor (along with
> >>     VGA/3D controller class code). In future, if any other device also has
> >>     similar requirement then we can update this list. For other device
> >>     host can put the device into D3cold in case of any wake-up.
> >>
> >>  10. In the vfio driver, we can put all these restriction for
> >>      enabling PME and return error if user tries to make low power entry
> >>      ioctl without enabling the PME related things.
> >>
> >>  11. The virtual PME can help in handling physical PME also for all
> >>      the devices. The PME logic is not dependent upon nvidia GPU
> >>      restriction. If virtual PME is enabled by hypervisor, then when
> >>      physical PME wakes the device, then it will resume on the guest side
> >>      also.  
> > 
> > So if host accesses through things like lspci are going to wake the
> > device and we can't prevent that, and the solution to that is to notify
> > the guest to put the device back to low power, then it seems a lot less
> > important to try to prevent the user from waking the device through
> > random accesses.  In that context, maybe we do simply wrap all accesses
> > with pm_runtime_get/put() put calls, which eliminates the problem of
> > maintaining a list of safe ioctls in low power.
> >   
> 
>  So wrap all access with pm_runtime_get()/put() will only be applicable
>  for IOCTLs. Correct ?
>  For config space, we can go with the approach discussed earlier in which
>  we return error ?

If we need to handle arbitrarily induced wakes from the host, it
doesn't make much sense to restrict those same sort of accesses by the
user through the vfio-device.  It also seems a lot easier to simply do
a pm_get/put() around not only ioctls, but all region accesses to avoid
the sorts of races you previously identified.  Access through mmap
should still arguably fault given that there is no discrete end to such
an access like we have for read/write operations.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v3 8/8] vfio/pci: Add the support for PCI D3cold state
  2022-06-07 21:50                               ` Alex Williamson
@ 2022-06-08 10:12                                 ` Abhishek Sahu
  0 siblings, 0 replies; 41+ messages in thread
From: Abhishek Sahu @ 2022-06-08 10:12 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Jason Gunthorpe, Cornelia Huck, Yishai Hadas, Shameer Kolothum,
	Kevin Tian, Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas,
	linux-kernel, kvm, linux-pm, linux-pci

On 6/8/2022 3:20 AM, Alex Williamson wrote:
> On Fri, 3 Jun 2022 15:49:27 +0530
> Abhishek Sahu <abhsahu@nvidia.com> wrote:
> 
>> On 6/2/2022 11:14 PM, Alex Williamson wrote:
>>> On Thu, 2 Jun 2022 17:22:03 +0530
>>> Abhishek Sahu <abhsahu@nvidia.com> wrote:
>>>   
>>>> On 6/1/2022 9:51 PM, Alex Williamson wrote:  
>>>>> On Wed, 1 Jun 2022 15:19:07 +0530
>>>>> Abhishek Sahu <abhsahu@nvidia.com> wrote:
>>>>>     
>>>>>> On 6/1/2022 4:22 AM, Alex Williamson wrote:    
>>>>>>> On Tue, 31 May 2022 16:43:04 -0300
>>>>>>> Jason Gunthorpe <jgg@nvidia.com> wrote:
>>>>>>>       
>>>>>>>> On Tue, May 31, 2022 at 05:44:11PM +0530, Abhishek Sahu wrote:      
>>>>>>>>> On 5/30/2022 5:55 PM, Jason Gunthorpe wrote:        
>>>>>>>>>> On Mon, May 30, 2022 at 04:45:59PM +0530, Abhishek Sahu wrote:
>>>>>>>>>>         
>>>>>>>>>>>  1. In real use case, config or any other ioctl should not come along
>>>>>>>>>>>     with VFIO_DEVICE_FEATURE_POWER_MANAGEMENT ioctl request.
>>>>>>>>>>>  
>>>>>>>>>>>  2. Maintain some 'access_count' which will be incremented when we
>>>>>>>>>>>     do any config space access or ioctl.        
>>>>>>>>>>
>>>>>>>>>> Please don't open code locks - if you need a lock then write a proper
>>>>>>>>>> lock. You can use the 'try' variants to bail out in cases where that
>>>>>>>>>> is appropriate.
>>>>>>>>>>
>>>>>>>>>> Jason        
>>>>>>>>>
>>>>>>>>>  Thanks Jason for providing your inputs.
>>>>>>>>>
>>>>>>>>>  In that case, should I introduce new rw_semaphore (For example
>>>>>>>>>  power_lock) and move ‘platform_pm_engaged’ under ‘power_lock’ ?        
>>>>>>>>
>>>>>>>> Possibly, this is better than an atomic at least
>>>>>>>>      
>>>>>>>>>  1. At the beginning of config space access or ioctl, we can take the
>>>>>>>>>     lock
>>>>>>>>>  
>>>>>>>>>      down_read(&vdev->power_lock);        
>>>>>>>>
>>>>>>>> You can also do down_read_trylock() here and bail out as you were
>>>>>>>> suggesting with the atomic.
>>>>>>>>
>>>>>>>> trylock doesn't have lock odering rules because it can't sleep so it
>>>>>>>> gives a bit more flexability when designing the lock ordering.
>>>>>>>>
>>>>>>>> Though userspace has to be able to tolerate the failure, or never make
>>>>>>>> the request.
>>>>>>>>      
>>>>>>
>>>>>>  Thanks Alex and Jason for providing your inputs.
>>>>>>
>>>>>>  Using down_read_trylock() along with Alex suggestion seems fine.
>>>>>>  In real use case, config space access should not happen when the
>>>>>>  device is in low power state so returning error should not
>>>>>>  cause any issue in this case.
>>>>>>    
>>>>>>>>>          down_write(&vdev->power_lock);
>>>>>>>>>          ...
>>>>>>>>>          switch (vfio_pm.low_power_state) {
>>>>>>>>>          case VFIO_DEVICE_LOW_POWER_STATE_ENTER:
>>>>>>>>>                  ...
>>>>>>>>>                          vfio_pci_zap_and_down_write_memory_lock(vdev);
>>>>>>>>>                          vdev->power_state_d3 = true;
>>>>>>>>>                          up_write(&vdev->memory_lock);
>>>>>>>>>
>>>>>>>>>          ...
>>>>>>>>>          up_write(&vdev->power_lock);        
>>>>>>>>
>>>>>>>> And something checks the power lock before allowing the memor to be
>>>>>>>> re-enabled?
>>>>>>>>      
>>>>>>>>>  4.  For ioctl access, as mentioned previously I need to add two
>>>>>>>>>      callbacks functions (one for start and one for end) in the struct
>>>>>>>>>      vfio_device_ops and call the same at start and end of ioctl from
>>>>>>>>>      vfio_device_fops_unl_ioctl().        
>>>>>>>>
>>>>>>>> Not sure I followed this..      
>>>>>>>
>>>>>>> I'm kinda lost here too.      
>>>>>>
>>>>>>
>>>>>>  I have summarized the things below
>>>>>>
>>>>>>  1. In the current patch (v3 8/8), if config space access or ioctl was
>>>>>>     being made by the user when the device is already in low power state,
>>>>>>     then it was waking the device. This wake up was happening with
>>>>>>     pm_runtime_resume_and_get() API in vfio_pci_config_rw() and
>>>>>>     vfio_device_fops_unl_ioctl() (with patch v3 7/8 in this patch series).
>>>>>>
>>>>>>  2. Now, it has been decided to return error instead of waking the
>>>>>>     device if the device is already in low power state.
>>>>>>
>>>>>>  3. Initially I thought to add following code in config space path
>>>>>>     (and similar in ioctl)
>>>>>>
>>>>>>         vfio_pci_config_rw() {
>>>>>>             ...
>>>>>>             down_read(&vdev->memory_lock);
>>>>>>             if (vdev->platform_pm_engaged)
>>>>>>             {
>>>>>>                 up_read(&vdev->memory_lock);
>>>>>>                 return -EIO;
>>>>>>             }
>>>>>>             ...
>>>>>>         }
>>>>>>
>>>>>>      And then there was a possibility that the physical config happens
>>>>>>      when the device in D3cold in case of race condition.
>>>>>>
>>>>>>  4.  So, I wanted to add some mechanism so that the low power entry
>>>>>>      ioctl will be serialized with other ioctl or config space. With this
>>>>>>      if low power entry gets scheduled first then config/other ioctls will
>>>>>>      get failure, otherwise low power entry will wait.
>>>>>>
>>>>>>  5.  For serializing this access, I need to ensure that lock is held
>>>>>>      throughout the operation. For config space I can add the code in
>>>>>>      vfio_pci_config_rw(). But for ioctls, I was not sure what is the best
>>>>>>      way since few ioctls (VFIO_DEVICE_FEATURE_MIGRATION,
>>>>>>      VFIO_DEVICE_FEATURE_MIG_DEVICE_STATE etc.) are being handled in the
>>>>>>      vfio core layer itself.
>>>>>>
>>>>>>  The memory_lock and the variables to track low power in specific to
>>>>>>  vfio-pci so I need some mechanism by which I add low power check for
>>>>>>  each ioctl. For serialization, I need to call function implemented in
>>>>>>  vfio-pci before vfio core layer makes the actual ioctl to grab the
>>>>>>  locks. Similarly, I need to release the lock once vfio core layer
>>>>>>  finished the actual ioctl. I have mentioned about this problem in the
>>>>>>  above point (point 4 in my earlier mail).
>>>>>>    
>>>>>>> A couple replies back there was some concern
>>>>>>> about race scenarios with multiple user threads accessing the device.
>>>>>>> The ones concerning non-deterministic behavior if a user is
>>>>>>> concurrently changing power state and performing other accesses are a
>>>>>>> non-issue, imo.        
>>>>>>
>>>>>>  What does non-deterministic behavior here mean.
>>>>>>  Is it for user side that user will see different result
>>>>>>  (failure or success) during race condition or in the kernel side
>>>>>>  (as explained in point 3 above where physical config access
>>>>>>  happens when the device in D3cold) ? My concern here is for later
>>>>>>  part where this config space access in D3cold can cause fatal error
>>>>>>  on the system side as we have seen for memory disablement.    
>>>>>
>>>>> Yes, our only concern should be to prevent such an access.  The user
>>>>> seeing non-deterministic behavior, such as during concurrent power
>>>>> control and config space access, all combinations of success/failure
>>>>> are possible, is par for the course when we decide to block accesses
>>>>> across the life of the low power state.
>>>>>      
>>>>>>> I think our goal is only to expand the current
>>>>>>> memory_lock to block accesses, including config space, while the device
>>>>>>> is in low power, or some approximation bounded by the entry/exit ioctl.
>>>>>>>
>>>>>>> I think the remaining issues is how to do that relative to the fact
>>>>>>> that config space access can change the memory enable state and would
>>>>>>> therefore need to upgrade the memory_lock read-lock to a write-lock.
>>>>>>> For that I think we can simply drop the read-lock, acquire the
>>>>>>> write-lock, and re-test the low power state.  If it has changed, that
>>>>>>> suggests the user has again raced changing power state with another
>>>>>>> access and we can simply drop the lock and return -EIO.
>>>>>>>       
>>>>>>
>>>>>>  Yes. This looks better option. So, just to confirm, I can take the
>>>>>>  memory_lock read-lock at the starting of vfio_pci_config_rw() and
>>>>>>  release it just before returning from vfio_pci_config_rw() and
>>>>>>  for memory related config access, we will release this lock and
>>>>>>  re-aquiring again write version of this. Once memory write happens,
>>>>>>  then we can downgrade this write lock to read lock ?    
>>>>>
>>>>> We only need to lock for the device access, so if you've finished that
>>>>> access after acquiring the write-lock, there'd be no point to then
>>>>> downgrade that to a read-lock.  The access should be finished by that
>>>>> point.
>>>>>    
>>>>
>>>>  I was planning to take memory_lock read-lock at the beginning of
>>>>  vfio_pci_config_rw() and release the same just before returning from
>>>>  this function. If I don't downgrade it back to read-lock, then the
>>>>  release in the end will be called for the lock which has not taken.
>>>>  Also, user can specify count to any number of bytes and then the
>>>>  vfio_config_do_rw() will be invoked multiple times and then in
>>>>  the second call, it will be without lock.  
>>>
>>> Ok, yes, I can imagine how it might result in a cleaner exit path to do
>>> a downgrade_write().
>>>   
>>>>>>  Also, what about IOCTLs. How can I take and release memory_lock for
>>>>>>  ioctl. is it okay to go with Patch 7 where we call
>>>>>>  pm_runtime_resume_and_get() before each ioctl or we need to do the
>>>>>>  same low power check for ioctl also ?
>>>>>>  In Later case, I am not sure how should I do the implementation so
>>>>>>  that all other ioctl are covered from vfio core layer itself.    
>>>>>
>>>>> Some ioctls clearly cannot occur while the device is in low power, such
>>>>> as resets and interrupt control, but even less obvious things like
>>>>> getting region info require device access.  Migration also provides a
>>>>> channel to device access.  Do we want to manage a list of ioctls that
>>>>> are allowed in low power, or do we only want to allow the ioctl to exit
>>>>> low power?
>>>>>     
>>>>
>>>>  In previous version of this patch, you mentioned that maintaining the
>>>>  safe ioctl list will be tough to maintain. So, currently we wanted to
>>>>  allow the ioctl for low power exit.  
>>>
>>> Yes, I'm still conflicted in how that would work.
>>>    
>>>>> I'm also still curious how we're going to handle devices that cannot
>>>>> return to low power such as the self-refresh mode on the GPU.  We can
>>>>> potentially prevent any wake-ups from the vfio device interface, but
>>>>> that doesn't preclude a wake-up via an external lspci.  I think we need
>>>>> to understand how we're going to handle such devices before we can
>>>>> really complete the design.  AIUI, we cannot disable the self-refresh
>>>>> sleep mode without imposing unreasonable latency and memory
>>>>> requirements on the guest and we cannot retrigger the self-refresh
>>>>> low-power mode without non-trivial device specific code.  Thanks,
>>>>>
>>>>> Alex
>>>>>     
>>>>
>>>>  I am working on adding support to notify guest through virtual PME
>>>>  whenever there is any wake-up triggered by the host and the guest has
>>>>  already put the device into runtime suspended state. This virtual PME
>>>>  will be similar to physical PME. Normally, if PCI device need power
>>>>  management transition, then it sends PME event which will be
>>>>  ultimately handled by host OS. In virtual PME case, if host need power
>>>>  management transition, then it sends event to guest and then guest OS
>>>>  handles these virtual PME events. Following is summary:
>>>>
>>>>  1. Add the support for one more event like VFIO_PCI_ERR_IRQ_INDEX
>>>>     named VFIO_PCI_PME_IRQ_INDEX and add the required code for this
>>>>     virtual PME event.
>>>>
>>>>  2. From the guest side, when the PME_IRQ is enabled then we will
>>>>     set event_fd for PME.
>>>>
>>>>  3. In the vfio driver, the PME support bits are already
>>>>     virtualized and currently set to 0. We can set PME capability support
>>>>     for D3cold so that in guest, it looks like
>>>>
>>>>      Capabilities: [60] Power Management version 3
>>>>      Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
>>>>             PME(D0-,D1-,D2-,D3hot-,D3cold+)
>>>>
>>>>  4. From the guest side, it can do PME enable (PME_En bit in Power
>>>>     Management Control/Status Register) which will be again virtualized.
>>>>
>>>>  5. When host gets request for resuming the device other than from
>>>>     low power ioctl, then device pm usage count will be incremented, the
>>>>     PME status (PME_Status bit in Power Management Control/Status Register)
>>>>     will be set and then we can do the event_fd signal.
>>>>
>>>>  6. In the PCIe, the PME events will be handled by root port. For
>>>>     using low power D3cold feature, it is required to create virtual root
>>>>     port in hypervisor side and when hypervisor receives this PME event,
>>>>     then it can send virtual interrupt to root port.
>>>>
>>>>  7. If we take example of Linux kernel, then pcie_pme_irq() will
>>>>     handle this and then do the runtime resume on the guest side. Also, it
>>>>     will clear the PME status bit here. Then guest can put the device
>>>>     again into suspended state.
>>>>
>>>>  8. I did prototype changes in QEMU for above logic and was getting wake-up
>>>>     in the guest whenever I do lspci on the host side.
>>>>
>>>>  9. Since currently only nvidia GPU has this limitation to require
>>>>     driver interaction each time before going into D3cold so we can allow
>>>>     the reentry for other device. We can have nvidia vendor (along with
>>>>     VGA/3D controller class code). In future, if any other device also has
>>>>     similar requirement then we can update this list. For other device
>>>>     host can put the device into D3cold in case of any wake-up.
>>>>
>>>>  10. In the vfio driver, we can put all these restriction for
>>>>      enabling PME and return error if user tries to make low power entry
>>>>      ioctl without enabling the PME related things.
>>>>
>>>>  11. The virtual PME can help in handling physical PME also for all
>>>>      the devices. The PME logic is not dependent upon nvidia GPU
>>>>      restriction. If virtual PME is enabled by hypervisor, then when
>>>>      physical PME wakes the device, then it will resume on the guest side
>>>>      also.  
>>>
>>> So if host accesses through things like lspci are going to wake the
>>> device and we can't prevent that, and the solution to that is to notify
>>> the guest to put the device back to low power, then it seems a lot less
>>> important to try to prevent the user from waking the device through
>>> random accesses.  In that context, maybe we do simply wrap all accesses
>>> with pm_runtime_get/put() put calls, which eliminates the problem of
>>> maintaining a list of safe ioctls in low power.
>>>   
>>
>>  So wrap all access with pm_runtime_get()/put() will only be applicable
>>  for IOCTLs. Correct ?
>>  For config space, we can go with the approach discussed earlier in which
>>  we return error ?
> 
> If we need to handle arbitrarily induced wakes from the host, it
> doesn't make much sense to restrict those same sort of accesses by the
> user through the vfio-device.  It also seems a lot easier to simply do
> a pm_get/put() around not only ioctls, but all region accesses to avoid
> the sorts of races you previously identified.  Access through mmap
> should still arguably fault given that there is no discrete end to such
> an access like we have for read/write operations.  Thanks,
> 
> Alex
> 

 Thanks Alex for confirming.
 I will do the same.

 Regards,
 Abhishek

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2022-06-08 10:25 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-25  9:26 [PATCH v3 0/8] vfio/pci: power management changes Abhishek Sahu
2022-04-25  9:26 ` [PATCH v3 1/8] vfio/pci: Invalidate mmaps and block the access in D3hot power state Abhishek Sahu
2022-04-26  1:42   ` kernel test robot
2022-04-26 14:14     ` Bjorn Helgaas
2022-04-25  9:26 ` [PATCH v3 2/8] vfio/pci: Change the PF power state to D0 before enabling VFs Abhishek Sahu
2022-04-25  9:26 ` [PATCH v3 3/8] vfio/pci: Virtualize PME related registers bits and initialize to zero Abhishek Sahu
2022-04-25  9:26 ` [PATCH v3 4/8] vfio/pci: Add support for setting driver data inside core layer Abhishek Sahu
2022-05-03 17:11   ` Alex Williamson
2022-05-04  0:20     ` Jason Gunthorpe
2022-05-04 10:32       ` Abhishek Sahu
2022-04-25  9:26 ` [PATCH v3 5/8] vfio/pci: Enable runtime PM for vfio_pci_core based drivers Abhishek Sahu
2022-05-04 19:42   ` Alex Williamson
2022-05-05  9:07     ` Abhishek Sahu
2022-04-25  9:26 ` [PATCH v3 6/8] vfio: Invoke runtime PM API for IOCTL request Abhishek Sahu
2022-05-04 19:42   ` Alex Williamson
2022-05-05  9:40     ` Abhishek Sahu
2022-05-09 22:30       ` Alex Williamson
2022-04-25  9:26 ` [PATCH v3 7/8] vfio/pci: Mask INTx during runtime suspend Abhishek Sahu
2022-04-25  9:26 ` [PATCH v3 8/8] vfio/pci: Add the support for PCI D3cold state Abhishek Sahu
2022-05-04 19:45   ` Alex Williamson
2022-05-05 12:16     ` Abhishek Sahu
2022-05-09 21:48       ` Alex Williamson
2022-05-10 13:26         ` Abhishek Sahu
2022-05-10 13:30           ` Jason Gunthorpe
2022-05-12 12:27             ` Abhishek Sahu
2022-05-12 12:47               ` Jason Gunthorpe
2022-05-30 11:15           ` Abhishek Sahu
2022-05-30 12:25             ` Jason Gunthorpe
2022-05-31 12:14               ` Abhishek Sahu
2022-05-31 19:43                 ` Jason Gunthorpe
2022-05-31 22:52                   ` Alex Williamson
2022-06-01  9:49                     ` Abhishek Sahu
2022-06-01 16:21                       ` Alex Williamson
2022-06-01 17:30                         ` Jason Gunthorpe
2022-06-01 18:15                           ` Alex Williamson
2022-06-01 23:17                             ` Jason Gunthorpe
2022-06-02 11:52                         ` Abhishek Sahu
2022-06-02 17:44                           ` Alex Williamson
2022-06-03 10:19                             ` Abhishek Sahu
2022-06-07 21:50                               ` Alex Williamson
2022-06-08 10:12                                 ` Abhishek Sahu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).