linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/4] vfio/pci: power management changes
@ 2022-05-17 10:02 Abhishek Sahu
  2022-05-17 10:02 ` [PATCH v4 1/4] vfio/pci: Invalidate mmaps and block the access in D3hot power state Abhishek Sahu
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Abhishek Sahu @ 2022-05-17 10:02 UTC (permalink / raw)
  To: Alex Williamson, Cornelia Huck, Yishai Hadas, Jason Gunthorpe,
	Shameer Kolothum, Kevin Tian, Rafael J . Wysocki
  Cc: Max Gurtovoy, Bjorn Helgaas, linux-kernel, kvm, linux-pm,
	linux-pci, Abhishek Sahu

Currently, there is very limited power management support available
in the upstream vfio-pci driver. If there is no user of vfio-pci device,
then it will be moved into D3Hot state. Similarly, if we enable the
runtime power management for vfio-pci device in the guest OS, then the
device is being runtime suspended (for linux guest OS) and the PCI
device will be put into D3hot state (in function
vfio_pm_config_write()). If the D3cold state can be used instead of
D3hot, then it will help in saving maximum power. The D3cold state can't
be possible with native PCI PM. It requires interaction with platform
firmware which is system-specific. To go into low power states
(including D3cold), the runtime PM framework can be used which
internally interacts with PCI and platform firmware and puts the device
into the lowest possible D-States.

This patch series registers the vfio-pci driver with runtime
PM framework and uses the same for moving the physical PCI
device to go into the low power state for unused idle devices.
There will be separate patch series that will add the support
for using runtime PM framework for used idle devices.

The current PM support was added with commit 6eb7018705de ("vfio-pci:
Move idle devices to D3hot power state") where the following point was
mentioned regarding D3cold state.

 "It's tempting to try to use D3cold, but we have no reason to inhibit
  hotplug of idle devices and we might get into a loop of having the
  device disappear before we have a chance to try to use it."

With the runtime PM, if the user want to prevent going into D3cold then
/sys/bus/pci/devices/.../d3cold_allowed can be set to 0 for the
devices where the above functionality is required instead of
disallowing the D3cold state for all the cases.

The BAR access needs to be disabled if device is in D3hot state.
Also, there should not be any config access if device is in D3cold
state. For SR-IOV, the PF power state should be higher than VF's power
state.

* Changes in v4

- Rebased over https://github.com/awilliam/linux-vfio/tree/next.
- Split the patch series into 2 parts. This part contains the patches
  for using runtime PM for unused idle device.
- Used the 'pdev->current_state' for checking if the device in D3 state.
- Adds the check in __vfio_pci_memory_enabled() function itself instead
  of adding power state check at each caller.
- Make vfio_pci_lock_and_set_power_state() global since it is needed
  in different files.
- Used vfio_pci_lock_and_set_power_state() instead of
  vfio_pci_set_power_state() before pci_enable_sriov().
- Inside vfio_pci_core_sriov_configure(), handled both the cases
  (the device is in low power state with and without user).
- Used list_for_each_entry_continue_reverse() in
  vfio_pci_dev_set_pm_runtime_get().

* Changes in v3
  (https://lore.kernel.org/lkml/20220425092615.10133-1-abhsahu@nvidia.com)

- Rebased patches on v5.18-rc3.
- Marked this series as PATCH instead of RFC.
- Addressed the review comments given in v2.
- Removed the limitation to keep device in D0 state if there is any
  access from host side. This is specific to NVIDIA use case and
  will be handled separately.
- Used the existing DEVICE_FEATURE IOCTL itself instead of adding new
  IOCTL for power management.
- Removed all custom code related with power management in runtime
  suspend/resume callbacks and IOCTL handling. Now, the callbacks
  contain code related with INTx handling and few other stuffs and
  all the PCI state and platform PM handling will be done by PCI core
  functions itself.
- Add the support of wake-up in main vfio layer itself since now we have
  more vfio/pci based drivers.
- Instead of assigning the 'struct dev_pm_ops' in individual parent
  driver, now the vfio_pci_core tself assigns the 'struct dev_pm_ops'. 
- Added handling of power management around SR-IOV handling.
- Moved the setting of drvdata in a separate patch.
- Masked INTx before during runtime suspended state.
- Changed the order of patches so that Fix related things are at beginning
  of this patch series.
- Removed storing the power state locally and used one new boolean to
  track the d3 (D3cold and D3hot) power state 
- Removed check for IO access in D3 power state.
- Used another helper function vfio_lock_and_set_power_state() instead
  of touching vfio_pci_set_power_state().
- Considered the fixes made in
  https://lore.kernel.org/lkml/20220217122107.22434-1-abhsahu@nvidia.com
  and updated the patches accordingly.

* Changes in v2
  (https://lore.kernel.org/lkml/20220124181726.19174-1-abhsahu@nvidia.com)

- Rebased patches on v5.17-rc1.
- Included the patch to handle BAR access in D3cold.
- Included the patch to fix memory leak.
- Made a separate IOCTL that can be used to change the power state from
  D3hot to D3cold and D3cold to D0.
- Addressed the review comments given in v1.

* v1
  https://lore.kernel.org/lkml/20211115133640.2231-1-abhsahu@nvidia.com/

Abhishek Sahu (4):
  vfio/pci: Invalidate mmaps and block the access in D3hot power state
  vfio/pci: Change the PF power state to D0 before enabling VFs
  vfio/pci: Virtualize PME related registers bits and initialize to zero
  vfio/pci: Move the unused device into low power state with runtime PM

 drivers/vfio/pci/vfio_pci_config.c |  40 +++++-
 drivers/vfio/pci/vfio_pci_core.c   | 195 +++++++++++++++++++++--------
 include/linux/vfio_pci_core.h      |   2 +
 3 files changed, 181 insertions(+), 56 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v4 1/4] vfio/pci: Invalidate mmaps and block the access in D3hot power state
  2022-05-17 10:02 [PATCH v4 0/4] vfio/pci: power management changes Abhishek Sahu
@ 2022-05-17 10:02 ` Abhishek Sahu
  2022-05-17 23:30   ` kernel test robot
  2022-05-17 10:02 ` [PATCH v4 2/4] vfio/pci: Change the PF power state to D0 before enabling VFs Abhishek Sahu
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 12+ messages in thread
From: Abhishek Sahu @ 2022-05-17 10:02 UTC (permalink / raw)
  To: Alex Williamson, Cornelia Huck, Yishai Hadas, Jason Gunthorpe,
	Shameer Kolothum, Kevin Tian, Rafael J . Wysocki
  Cc: Max Gurtovoy, Bjorn Helgaas, linux-kernel, kvm, linux-pm,
	linux-pci, Abhishek Sahu

According to [PCIe v5 5.3.1.4.1] for D3hot state

 "Configuration and Message requests are the only TLPs accepted by a
  Function in the D3Hot state. All other received Requests must be
  handled as Unsupported Requests, and all received Completions may
  optionally be handled as Unexpected Completions."

Currently, if the vfio PCI device has been put into D3hot state and if
user makes non-config related read/write request in D3hot state, these
requests will be forwarded to the host and this access may cause
issues on a few systems.

This patch leverages the memory-disable support added in commit
'abafbc551fdd ("vfio-pci: Invalidate mmaps and block MMIO access on
disabled memory")' to generate page fault on mmap access and
return error for the direct read/write. If the device is D3hot state,
then the error will be returned for MMIO access. The IO access generally
does not make the system unresponsive so the IO access can still happen
in D3hot state. The default value should be returned in this case
without bringing down the complete system.

Also, the power related structure fields need to be protected so
we can use the same 'memory_lock' to protect these fields also.
This protection is mainly needed when user changes the PCI
power state by writing into PCI_PM_CTRL register.
vfio_pci_lock_and_set_power_state() wrapper function will take the
required locks and then it will invoke the vfio_pci_set_power_state().

Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
---
 drivers/vfio/pci/vfio_pci_config.c |  7 +++++--
 drivers/vfio/pci/vfio_pci_core.c   | 16 ++++++++++++++++
 include/linux/vfio_pci_core.h      |  2 ++
 3 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
index 6e58b4bf7a60..d9077627117f 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -402,11 +402,14 @@ bool __vfio_pci_memory_enabled(struct vfio_pci_core_device *vdev)
 	u16 cmd = le16_to_cpu(*(__le16 *)&vdev->vconfig[PCI_COMMAND]);
 
 	/*
+	 * Memory region cannot be accessed if device power state is D3.
+	 *
 	 * SR-IOV VF memory enable is handled by the MSE bit in the
 	 * PF SR-IOV capability, there's therefore no need to trigger
 	 * faults based on the virtual value.
 	 */
-	return pdev->no_command_memory || (cmd & PCI_COMMAND_MEMORY);
+	return pdev->current_state < PCI_D3hot &&
+	       (pdev->no_command_memory || (cmd & PCI_COMMAND_MEMORY));
 }
 
 /*
@@ -718,7 +721,7 @@ static int vfio_pm_config_write(struct vfio_pci_core_device *vdev, int pos,
 			break;
 		}
 
-		vfio_pci_set_power_state(vdev, state);
+		vfio_pci_lock_and_set_power_state(vdev, state);
 	}
 
 	return count;
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 05a3aa95ba52..b9f222ca48cf 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -255,6 +255,22 @@ int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t stat
 	return ret;
 }
 
+/*
+ * It takes all the required locks to protect the access of power related
+ * variables and then invokes vfio_pci_set_power_state().
+ */
+void vfio_pci_lock_and_set_power_state(struct vfio_pci_core_device *vdev,
+				       pci_power_t state)
+{
+	if (state >= PCI_D3hot)
+		vfio_pci_zap_and_down_write_memory_lock(vdev);
+	else
+		down_write(&vdev->memory_lock);
+
+	vfio_pci_set_power_state(vdev, state);
+	up_write(&vdev->memory_lock);
+}
+
 int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
 {
 	struct pci_dev *pdev = vdev->pdev;
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index 23c176d4b073..8f20056e0b8d 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -189,6 +189,8 @@ extern int vfio_pci_register_dev_region(struct vfio_pci_core_device *vdev,
 
 extern int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev,
 				    pci_power_t state);
+extern void vfio_pci_lock_and_set_power_state(struct vfio_pci_core_device *vdev,
+					      pci_power_t state);
 
 extern bool __vfio_pci_memory_enabled(struct vfio_pci_core_device *vdev);
 extern void vfio_pci_zap_and_down_write_memory_lock(struct vfio_pci_core_device
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v4 2/4] vfio/pci: Change the PF power state to D0 before enabling VFs
  2022-05-17 10:02 [PATCH v4 0/4] vfio/pci: power management changes Abhishek Sahu
  2022-05-17 10:02 ` [PATCH v4 1/4] vfio/pci: Invalidate mmaps and block the access in D3hot power state Abhishek Sahu
@ 2022-05-17 10:02 ` Abhishek Sahu
  2022-05-17 18:27   ` Alex Williamson
  2022-05-17 10:02 ` [PATCH v4 3/4] vfio/pci: Virtualize PME related registers bits and initialize to zero Abhishek Sahu
  2022-05-17 10:02 ` [PATCH v4 4/4] vfio/pci: Move the unused device into low power state with runtime PM Abhishek Sahu
  3 siblings, 1 reply; 12+ messages in thread
From: Abhishek Sahu @ 2022-05-17 10:02 UTC (permalink / raw)
  To: Alex Williamson, Cornelia Huck, Yishai Hadas, Jason Gunthorpe,
	Shameer Kolothum, Kevin Tian, Rafael J . Wysocki
  Cc: Max Gurtovoy, Bjorn Helgaas, linux-kernel, kvm, linux-pm,
	linux-pci, Abhishek Sahu

According to [PCIe v5 9.6.2] for PF Device Power Management States

 "The PF's power management state (D-state) has global impact on its
  associated VFs. If a VF does not implement the Power Management
  Capability, then it behaves as if it is in an equivalent
  power state of its associated PF.

  If a VF implements the Power Management Capability, the Device behavior
  is undefined if the PF is placed in a lower power state than the VF.
  Software should avoid this situation by placing all VFs in lower power
  state before lowering their associated PF's power state."

From the vfio driver side, user can enable SR-IOV when the PF is in D3hot
state. If VF does not implement the Power Management Capability, then
the VF will be actually in D3hot state and then the VF BAR access will
fail. If VF implements the Power Management Capability, then VF will
assume that its current power state is D0 when the PF is D3hot and
in this case, the behavior is undefined.

To support PF power management, we need to create power management
dependency between PF and its VF's. The runtime power management support
may help with this where power management dependencies are supported
through device links. But till we have such support in place, we can
disallow the PF to go into low power state, if PF has VF enabled.
There can be a case, where user first enables the VF's and then
disables the VF's. If there is no user of PF, then the PF can put into
D3hot state again. But with this patch, the PF will still be in D0
state after disabling VF's since detecting this case inside
vfio_pci_core_sriov_configure() requires access to
struct vfio_device::open_count along with its locks. But the subsequent
patches related to runtime PM will handle this case since runtime PM
maintains its own usage count.

Also, vfio_pci_core_sriov_configure() can be called at any time
(with and without vfio pci device user), so the power state change
needs to be protected with the required locks.

Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
---
 drivers/vfio/pci/vfio_pci_core.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index b9f222ca48cf..4fe9a4efc751 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -217,6 +217,10 @@ int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t stat
 	bool needs_restore = false, needs_save = false;
 	int ret;
 
+	/* Prevent changing power state for PFs with VFs enabled */
+	if (pci_num_vf(pdev) && state > PCI_D0)
+		return -EBUSY;
+
 	if (vdev->needs_pm_restore) {
 		if (pdev->current_state < PCI_D3hot && state >= PCI_D3hot) {
 			pci_save_state(pdev);
@@ -1960,6 +1964,13 @@ int vfio_pci_core_sriov_configure(struct vfio_pci_core_device *vdev,
 		}
 		list_add_tail(&vdev->sriov_pfs_item, &vfio_pci_sriov_pfs);
 		mutex_unlock(&vfio_pci_sriov_pfs_mutex);
+
+		/*
+		 * The PF power state should always be higher than the VF power
+		 * state. If PF is in the low power state, then change the
+		 * power state to D0 first before enabling SR-IOV.
+		 */
+		vfio_pci_lock_and_set_power_state(vdev, PCI_D0);
 		ret = pci_enable_sriov(pdev, nr_virtfn);
 		if (ret)
 			goto out_del;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v4 3/4] vfio/pci: Virtualize PME related registers bits and initialize to zero
  2022-05-17 10:02 [PATCH v4 0/4] vfio/pci: power management changes Abhishek Sahu
  2022-05-17 10:02 ` [PATCH v4 1/4] vfio/pci: Invalidate mmaps and block the access in D3hot power state Abhishek Sahu
  2022-05-17 10:02 ` [PATCH v4 2/4] vfio/pci: Change the PF power state to D0 before enabling VFs Abhishek Sahu
@ 2022-05-17 10:02 ` Abhishek Sahu
  2022-05-17 10:02 ` [PATCH v4 4/4] vfio/pci: Move the unused device into low power state with runtime PM Abhishek Sahu
  3 siblings, 0 replies; 12+ messages in thread
From: Abhishek Sahu @ 2022-05-17 10:02 UTC (permalink / raw)
  To: Alex Williamson, Cornelia Huck, Yishai Hadas, Jason Gunthorpe,
	Shameer Kolothum, Kevin Tian, Rafael J . Wysocki
  Cc: Max Gurtovoy, Bjorn Helgaas, linux-kernel, kvm, linux-pm,
	linux-pci, Abhishek Sahu

If any PME event will be generated by PCI, then it will be mostly
handled in the host by the root port PME code. For example, in the case
of PCIe, the PME event will be sent to the root port and then the PME
interrupt will be generated. This will be handled in
drivers/pci/pcie/pme.c at the host side. Inside this, the
pci_check_pme_status() will be called where PME_Status and PME_En bits
will be cleared. So, the guest OS which is using vfio-pci device will
not come to know about this PME event.

To handle these PME events inside guests, we need some framework so
that if any PME events will happen, then it needs to be forwarded to
virtual machine monitor. We can virtualize PME related registers bits
and initialize these bits to zero so vfio-pci device user will assume
that it is not capable of asserting the PME# signal from any power state.

Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
---
 drivers/vfio/pci/vfio_pci_config.c | 33 +++++++++++++++++++++++++++++-
 1 file changed, 32 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
index d9077627117f..188108d28fcd 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -741,12 +741,29 @@ static int __init init_pci_cap_pm_perm(struct perm_bits *perm)
 	 */
 	p_setb(perm, PCI_CAP_LIST_NEXT, (u8)ALL_VIRT, NO_WRITE);
 
+	/*
+	 * The guests can't process PME events. If any PME event will be
+	 * generated, then it will be mostly handled in the host and the
+	 * host will clear the PME_STATUS. So virtualize PME_Support bits.
+	 * The vconfig bits will be cleared during device capability
+	 * initialization.
+	 */
+	p_setw(perm, PCI_PM_PMC, PCI_PM_CAP_PME_MASK, NO_WRITE);
+
 	/*
 	 * Power management is defined *per function*, so we can let
 	 * the user change power state, but we trap and initiate the
 	 * change ourselves, so the state bits are read-only.
+	 *
+	 * The guest can't process PME from D3cold so virtualize PME_Status
+	 * and PME_En bits. The vconfig bits will be cleared during device
+	 * capability initialization.
 	 */
-	p_setd(perm, PCI_PM_CTRL, NO_VIRT, ~PCI_PM_CTRL_STATE_MASK);
+	p_setd(perm, PCI_PM_CTRL,
+	       PCI_PM_CTRL_PME_ENABLE | PCI_PM_CTRL_PME_STATUS,
+	       ~(PCI_PM_CTRL_PME_ENABLE | PCI_PM_CTRL_PME_STATUS |
+		 PCI_PM_CTRL_STATE_MASK));
+
 	return 0;
 }
 
@@ -1415,6 +1432,17 @@ static int vfio_ext_cap_len(struct vfio_pci_core_device *vdev, u16 ecap, u16 epo
 	return 0;
 }
 
+static void vfio_update_pm_vconfig_bytes(struct vfio_pci_core_device *vdev,
+					 int offset)
+{
+	__le16 *pmc = (__le16 *)&vdev->vconfig[offset + PCI_PM_PMC];
+	__le16 *ctrl = (__le16 *)&vdev->vconfig[offset + PCI_PM_CTRL];
+
+	/* Clear vconfig PME_Support, PME_Status, and PME_En bits */
+	*pmc &= ~cpu_to_le16(PCI_PM_CAP_PME_MASK);
+	*ctrl &= ~cpu_to_le16(PCI_PM_CTRL_PME_ENABLE | PCI_PM_CTRL_PME_STATUS);
+}
+
 static int vfio_fill_vconfig_bytes(struct vfio_pci_core_device *vdev,
 				   int offset, int size)
 {
@@ -1538,6 +1566,9 @@ static int vfio_cap_init(struct vfio_pci_core_device *vdev)
 		if (ret)
 			return ret;
 
+		if (cap == PCI_CAP_ID_PM)
+			vfio_update_pm_vconfig_bytes(vdev, pos);
+
 		prev = &vdev->vconfig[pos + PCI_CAP_LIST_NEXT];
 		pos = next;
 		caps++;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v4 4/4] vfio/pci: Move the unused device into low power state with runtime PM
  2022-05-17 10:02 [PATCH v4 0/4] vfio/pci: power management changes Abhishek Sahu
                   ` (2 preceding siblings ...)
  2022-05-17 10:02 ` [PATCH v4 3/4] vfio/pci: Virtualize PME related registers bits and initialize to zero Abhishek Sahu
@ 2022-05-17 10:02 ` Abhishek Sahu
  2022-05-17 20:02   ` Alex Williamson
  2022-05-17 20:42   ` Alex Williamson
  3 siblings, 2 replies; 12+ messages in thread
From: Abhishek Sahu @ 2022-05-17 10:02 UTC (permalink / raw)
  To: Alex Williamson, Cornelia Huck, Yishai Hadas, Jason Gunthorpe,
	Shameer Kolothum, Kevin Tian, Rafael J . Wysocki
  Cc: Max Gurtovoy, Bjorn Helgaas, linux-kernel, kvm, linux-pm,
	linux-pci, Abhishek Sahu

Currently, there is very limited power management support
available in the upstream vfio_pci_core based drivers. If there
are no users of the device, then the PCI device will be moved into
D3hot state by writing directly into PCI PM registers. This D3hot
state help in saving power but we can achieve zero power consumption
if we go into the D3cold state. The D3cold state cannot be possible
with native PCI PM. It requires interaction with platform firmware
which is system-specific. To go into low power states (including D3cold),
the runtime PM framework can be used which internally interacts with PCI
and platform firmware and puts the device into the lowest possible
D-States.

This patch registers vfio_pci_core based drivers with the
runtime PM framework.

1. The PCI core framework takes care of most of the runtime PM
   related things. For enabling the runtime PM, the PCI driver needs to
   decrement the usage count and needs to provide 'struct dev_pm_ops'
   at least. The runtime suspend/resume callbacks are optional and needed
   only if we need to do any extra handling. Now there are multiple
   vfio_pci_core based drivers. Instead of assigning the
   'struct dev_pm_ops' in individual parent driver, the vfio_pci_core
   itself assigns the 'struct dev_pm_ops'. There are other drivers where
   the 'struct dev_pm_ops' is being assigned inside core layer
   (For example, wlcore_probe() and some sound based driver, etc.).

2. This patch provides the stub implementation of 'struct dev_pm_ops'.
   The subsequent patch will provide the runtime suspend/resume
   callbacks. All the config state saving, and PCI power management
   related things will be done by PCI core framework itself inside its
   runtime suspend/resume callbacks (pci_pm_runtime_suspend() and
   pci_pm_runtime_resume()).

3. Inside pci_reset_bus(), all the devices in dev_set needs to be
   runtime resumed. vfio_pci_dev_set_pm_runtime_get() will take
   care of the runtime resume and its error handling.

4. Inside vfio_pci_core_disable(), the device usage count always needs
   to be decremented which was incremented in vfio_pci_core_enable().

5. Since the runtime PM framework will provide the same functionality,
   so directly writing into PCI PM config register can be replaced with
   the use of runtime PM routines. Also, the use of runtime PM can help
   us in more power saving.

   In the systems which do not support D3cold,

   With the existing implementation:

   // PCI device
   # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
   D3hot
   // upstream bridge
   # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
   D0

   With runtime PM:

   // PCI device
   # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
   D3hot
   // upstream bridge
   # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
   D3hot

   So, with runtime PM, the upstream bridge or root port will also go
   into lower power state which is not possible with existing
   implementation.

   In the systems which support D3cold,

   // PCI device
   # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
   D3hot
   // upstream bridge
   # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
   D0

   With runtime PM:

   // PCI device
   # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
   D3cold
   // upstream bridge
   # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
   D3cold

   So, with runtime PM, both the PCI device and upstream bridge will
   go into D3cold state.

6. If 'disable_idle_d3' module parameter is set, then also the runtime
   PM will be enabled, but in this case, the usage count should not be
   decremented.

7. vfio_pci_dev_set_try_reset() return value is unused now, so this
   function return type can be changed to void.

8. Use the runtime PM API's in vfio_pci_core_sriov_configure().
   The device can be in low power state either with runtime
   power management (when there is no user) or PCI_PM_CTRL register
   write by the user. In both the cases, the PF should be moved to
   D0 state. For preventing any runtime usage mismatch, pci_num_vf()
   has been called explicitly during disable.

Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
---
 drivers/vfio/pci/vfio_pci_core.c | 172 +++++++++++++++++++++----------
 1 file changed, 117 insertions(+), 55 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 4fe9a4efc751..5ea1b3099036 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -156,7 +156,7 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_core_device *vdev)
 }
 
 struct vfio_pci_group_info;
-static bool vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set);
+static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set);
 static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
 				      struct vfio_pci_group_info *groups);
 
@@ -275,6 +275,19 @@ void vfio_pci_lock_and_set_power_state(struct vfio_pci_core_device *vdev,
 	up_write(&vdev->memory_lock);
 }
 
+#ifdef CONFIG_PM
+/*
+ * The dev_pm_ops needs to be provided to make pci-driver runtime PM working,
+ * so use structure without any callbacks.
+ *
+ * The pci-driver core runtime PM routines always save the device state
+ * before going into suspended state. If the device is going into low power
+ * state with only with runtime PM ops, then no explicit handling is needed
+ * for the devices which have NoSoftRst-.
+ */
+static const struct dev_pm_ops vfio_pci_core_pm_ops = { };
+#endif
+
 int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
 {
 	struct pci_dev *pdev = vdev->pdev;
@@ -282,21 +295,23 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
 	u16 cmd;
 	u8 msix_pos;
 
-	vfio_pci_set_power_state(vdev, PCI_D0);
+	if (!disable_idle_d3) {
+		ret = pm_runtime_resume_and_get(&pdev->dev);
+		if (ret < 0)
+			return ret;
+	}
 
 	/* Don't allow our initial saved state to include busmaster */
 	pci_clear_master(pdev);
 
 	ret = pci_enable_device(pdev);
 	if (ret)
-		return ret;
+		goto out_power;
 
 	/* If reset fails because of the device lock, fail this path entirely */
 	ret = pci_try_reset_function(pdev);
-	if (ret == -EAGAIN) {
-		pci_disable_device(pdev);
-		return ret;
-	}
+	if (ret == -EAGAIN)
+		goto out_disable_device;
 
 	vdev->reset_works = !ret;
 	pci_save_state(pdev);
@@ -320,12 +335,8 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
 	}
 
 	ret = vfio_config_init(vdev);
-	if (ret) {
-		kfree(vdev->pci_saved_state);
-		vdev->pci_saved_state = NULL;
-		pci_disable_device(pdev);
-		return ret;
-	}
+	if (ret)
+		goto out_free_state;
 
 	msix_pos = pdev->msix_cap;
 	if (msix_pos) {
@@ -346,6 +357,16 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
 
 
 	return 0;
+
+out_free_state:
+	kfree(vdev->pci_saved_state);
+	vdev->pci_saved_state = NULL;
+out_disable_device:
+	pci_disable_device(pdev);
+out_power:
+	if (!disable_idle_d3)
+		pm_runtime_put(&pdev->dev);
+	return ret;
 }
 EXPORT_SYMBOL_GPL(vfio_pci_core_enable);
 
@@ -453,8 +474,11 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
 out:
 	pci_disable_device(pdev);
 
-	if (!vfio_pci_dev_set_try_reset(vdev->vdev.dev_set) && !disable_idle_d3)
-		vfio_pci_set_power_state(vdev, PCI_D3hot);
+	vfio_pci_dev_set_try_reset(vdev->vdev.dev_set);
+
+	/* Put the pm-runtime usage counter acquired during enable */
+	if (!disable_idle_d3)
+		pm_runtime_put(&pdev->dev);
 }
 EXPORT_SYMBOL_GPL(vfio_pci_core_disable);
 
@@ -1839,10 +1863,11 @@ EXPORT_SYMBOL_GPL(vfio_pci_core_uninit_device);
 int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev)
 {
 	struct pci_dev *pdev = vdev->pdev;
+	struct device *dev = &pdev->dev;
 	int ret;
 
 	/* Drivers must set the vfio_pci_core_device to their drvdata */
-	if (WARN_ON(vdev != dev_get_drvdata(&vdev->pdev->dev)))
+	if (WARN_ON(vdev != dev_get_drvdata(dev)))
 		return -EINVAL;
 
 	if (pdev->hdr_type != PCI_HEADER_TYPE_NORMAL)
@@ -1884,19 +1909,24 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev)
 
 	vfio_pci_probe_power_state(vdev);
 
-	if (!disable_idle_d3) {
-		/*
-		 * pci-core sets the device power state to an unknown value at
-		 * bootup and after being removed from a driver.  The only
-		 * transition it allows from this unknown state is to D0, which
-		 * typically happens when a driver calls pci_enable_device().
-		 * We're not ready to enable the device yet, but we do want to
-		 * be able to get to D3.  Therefore first do a D0 transition
-		 * before going to D3.
-		 */
-		vfio_pci_set_power_state(vdev, PCI_D0);
-		vfio_pci_set_power_state(vdev, PCI_D3hot);
-	}
+	/*
+	 * pci-core sets the device power state to an unknown value at
+	 * bootup and after being removed from a driver.  The only
+	 * transition it allows from this unknown state is to D0, which
+	 * typically happens when a driver calls pci_enable_device().
+	 * We're not ready to enable the device yet, but we do want to
+	 * be able to get to D3.  Therefore first do a D0 transition
+	 * before enabling runtime PM.
+	 */
+	vfio_pci_set_power_state(vdev, PCI_D0);
+
+#if defined(CONFIG_PM)
+	dev->driver->pm = &vfio_pci_core_pm_ops,
+#endif
+
+	pm_runtime_allow(dev);
+	if (!disable_idle_d3)
+		pm_runtime_put(dev);
 
 	ret = vfio_register_group_dev(&vdev->vdev);
 	if (ret)
@@ -1905,7 +1935,9 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev)
 
 out_power:
 	if (!disable_idle_d3)
-		vfio_pci_set_power_state(vdev, PCI_D0);
+		pm_runtime_get_noresume(dev);
+
+	pm_runtime_forbid(dev);
 out_vf:
 	vfio_pci_vf_uninit(vdev);
 	return ret;
@@ -1922,7 +1954,9 @@ void vfio_pci_core_unregister_device(struct vfio_pci_core_device *vdev)
 	vfio_pci_vga_uninit(vdev);
 
 	if (!disable_idle_d3)
-		vfio_pci_set_power_state(vdev, PCI_D0);
+		pm_runtime_get_noresume(&vdev->pdev->dev);
+
+	pm_runtime_forbid(&vdev->pdev->dev);
 }
 EXPORT_SYMBOL_GPL(vfio_pci_core_unregister_device);
 
@@ -1967,17 +2001,29 @@ int vfio_pci_core_sriov_configure(struct vfio_pci_core_device *vdev,
 
 		/*
 		 * The PF power state should always be higher than the VF power
-		 * state. If PF is in the low power state, then change the
-		 * power state to D0 first before enabling SR-IOV.
+		 * state. The PF can be in low power state either with runtime
+		 * power management (when there is no user) or PCI_PM_CTRL
+		 * register write by the user. If PF is in the low power state,
+		 * then change the power state to D0 first before enabling
+		 * SR-IOV.
 		 */
+		ret = pm_runtime_resume_and_get(&pdev->dev);
+		if (ret)
+			goto out_del;
+
 		vfio_pci_lock_and_set_power_state(vdev, PCI_D0);
 		ret = pci_enable_sriov(pdev, nr_virtfn);
-		if (ret)
+		if (ret) {
+			pm_runtime_put(&pdev->dev);
 			goto out_del;
+		}
 		return nr_virtfn;
 	}
 
-	pci_disable_sriov(pdev);
+	if (pci_num_vf(pdev)) {
+		pci_disable_sriov(pdev);
+		pm_runtime_put(&pdev->dev);
+	}
 
 out_del:
 	mutex_lock(&vfio_pci_sriov_pfs_mutex);
@@ -2052,6 +2098,27 @@ vfio_pci_dev_set_resettable(struct vfio_device_set *dev_set)
 	return pdev;
 }
 
+static int vfio_pci_dev_set_pm_runtime_get(struct vfio_device_set *dev_set)
+{
+	struct vfio_pci_core_device *cur;
+	int ret;
+
+	list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) {
+		ret = pm_runtime_resume_and_get(&cur->pdev->dev);
+		if (ret)
+			goto unwind;
+	}
+
+	return 0;
+
+unwind:
+	list_for_each_entry_continue_reverse(cur, &dev_set->device_list,
+					     vdev.dev_set_list)
+		pm_runtime_put(&cur->pdev->dev);
+
+	return ret;
+}
+
 /*
  * We need to get memory_lock for each device, but devices can share mmap_lock,
  * therefore we need to zap and hold the vma_lock for each device, and only then
@@ -2158,43 +2225,38 @@ static bool vfio_pci_dev_set_needs_reset(struct vfio_device_set *dev_set)
  *  - At least one of the affected devices is marked dirty via
  *    needs_reset (such as by lack of FLR support)
  * Then attempt to perform that bus or slot reset.
- * Returns true if the dev_set was reset.
  */
-static bool vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set)
+static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set)
 {
 	struct vfio_pci_core_device *cur;
 	struct pci_dev *pdev;
-	int ret;
+	bool reset_done = false;
 
 	if (!vfio_pci_dev_set_needs_reset(dev_set))
-		return false;
+		return;
 
 	pdev = vfio_pci_dev_set_resettable(dev_set);
 	if (!pdev)
-		return false;
+		return;
 
 	/*
-	 * The pci_reset_bus() will reset all the devices in the bus.
-	 * The power state can be non-D0 for some of the devices in the bus.
-	 * For these devices, the pci_reset_bus() will internally set
-	 * the power state to D0 without vfio driver involvement.
-	 * For the devices which have NoSoftRst-, the reset function can
-	 * cause the PCI config space reset without restoring the original
-	 * state (saved locally in 'vdev->pm_save').
+	 * Some of the devices in the bus can be in the runtime suspended
+	 * state. Increment the usage count for all the devices in the dev_set
+	 * before reset and decrement the same after reset.
 	 */
-	list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list)
-		vfio_pci_set_power_state(cur, PCI_D0);
+	if (!disable_idle_d3 && vfio_pci_dev_set_pm_runtime_get(dev_set))
+		return;
 
-	ret = pci_reset_bus(pdev);
-	if (ret)
-		return false;
+	if (!pci_reset_bus(pdev))
+		reset_done = true;
 
 	list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) {
-		cur->needs_reset = false;
+		if (reset_done)
+			cur->needs_reset = false;
+
 		if (!disable_idle_d3)
-			vfio_pci_set_power_state(cur, PCI_D3hot);
+			pm_runtime_put(&cur->pdev->dev);
 	}
-	return true;
 }
 
 void vfio_pci_core_set_params(bool is_nointxmask, bool is_disable_vga,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 2/4] vfio/pci: Change the PF power state to D0 before enabling VFs
  2022-05-17 10:02 ` [PATCH v4 2/4] vfio/pci: Change the PF power state to D0 before enabling VFs Abhishek Sahu
@ 2022-05-17 18:27   ` Alex Williamson
  2022-05-18  9:56     ` Abhishek Sahu
  0 siblings, 1 reply; 12+ messages in thread
From: Alex Williamson @ 2022-05-17 18:27 UTC (permalink / raw)
  To: Abhishek Sahu
  Cc: Cornelia Huck, Yishai Hadas, Jason Gunthorpe, Shameer Kolothum,
	Kevin Tian, Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas,
	linux-kernel, kvm, linux-pm, linux-pci

On Tue, 17 May 2022 15:32:17 +0530
Abhishek Sahu <abhsahu@nvidia.com> wrote:

> According to [PCIe v5 9.6.2] for PF Device Power Management States
> 
>  "The PF's power management state (D-state) has global impact on its
>   associated VFs. If a VF does not implement the Power Management
>   Capability, then it behaves as if it is in an equivalent
>   power state of its associated PF.
> 
>   If a VF implements the Power Management Capability, the Device behavior
>   is undefined if the PF is placed in a lower power state than the VF.
>   Software should avoid this situation by placing all VFs in lower power
>   state before lowering their associated PF's power state."
> 
> From the vfio driver side, user can enable SR-IOV when the PF is in D3hot
> state. If VF does not implement the Power Management Capability, then
> the VF will be actually in D3hot state and then the VF BAR access will
> fail. If VF implements the Power Management Capability, then VF will
> assume that its current power state is D0 when the PF is D3hot and
> in this case, the behavior is undefined.
> 
> To support PF power management, we need to create power management
> dependency between PF and its VF's. The runtime power management support
> may help with this where power management dependencies are supported
> through device links. But till we have such support in place, we can
> disallow the PF to go into low power state, if PF has VF enabled.
> There can be a case, where user first enables the VF's and then
> disables the VF's. If there is no user of PF, then the PF can put into
> D3hot state again. But with this patch, the PF will still be in D0
> state after disabling VF's since detecting this case inside
> vfio_pci_core_sriov_configure() requires access to
> struct vfio_device::open_count along with its locks. But the subsequent
> patches related to runtime PM will handle this case since runtime PM
> maintains its own usage count.
> 
> Also, vfio_pci_core_sriov_configure() can be called at any time
> (with and without vfio pci device user), so the power state change
> needs to be protected with the required locks.
> 
> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
> ---
>  drivers/vfio/pci/vfio_pci_core.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index b9f222ca48cf..4fe9a4efc751 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -217,6 +217,10 @@ int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t stat
>  	bool needs_restore = false, needs_save = false;
>  	int ret;
>  
> +	/* Prevent changing power state for PFs with VFs enabled */
> +	if (pci_num_vf(pdev) && state > PCI_D0)
> +		return -EBUSY;
> +
>  	if (vdev->needs_pm_restore) {
>  		if (pdev->current_state < PCI_D3hot && state >= PCI_D3hot) {
>  			pci_save_state(pdev);
> @@ -1960,6 +1964,13 @@ int vfio_pci_core_sriov_configure(struct vfio_pci_core_device *vdev,
>  		}
>  		list_add_tail(&vdev->sriov_pfs_item, &vfio_pci_sriov_pfs);
>  		mutex_unlock(&vfio_pci_sriov_pfs_mutex);
> +
> +		/*
> +		 * The PF power state should always be higher than the VF power
> +		 * state. If PF is in the low power state, then change the
> +		 * power state to D0 first before enabling SR-IOV.
> +		 */
> +		vfio_pci_lock_and_set_power_state(vdev, PCI_D0);

But we need to hold memory_lock across the next function or else
userspace could race a write to the PM register to set D3 before
pci_num_vf() can protect us.  Thanks,

Alex

>  		ret = pci_enable_sriov(pdev, nr_virtfn);
>  		if (ret)
>  			goto out_del;


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 4/4] vfio/pci: Move the unused device into low power state with runtime PM
  2022-05-17 10:02 ` [PATCH v4 4/4] vfio/pci: Move the unused device into low power state with runtime PM Abhishek Sahu
@ 2022-05-17 20:02   ` Alex Williamson
  2022-05-18 10:06     ` Abhishek Sahu
  2022-05-17 20:42   ` Alex Williamson
  1 sibling, 1 reply; 12+ messages in thread
From: Alex Williamson @ 2022-05-17 20:02 UTC (permalink / raw)
  To: Abhishek Sahu
  Cc: Cornelia Huck, Yishai Hadas, Jason Gunthorpe, Shameer Kolothum,
	Kevin Tian, Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas,
	linux-kernel, kvm, linux-pm, linux-pci

On Tue, 17 May 2022 15:32:19 +0530
Abhishek Sahu <abhsahu@nvidia.com> wrote:
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index 4fe9a4efc751..5ea1b3099036 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -156,7 +156,7 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_core_device *vdev)
>  }
>  
>  struct vfio_pci_group_info;
> -static bool vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set);
> +static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set);
>  static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
>  				      struct vfio_pci_group_info *groups);
>  
> @@ -275,6 +275,19 @@ void vfio_pci_lock_and_set_power_state(struct vfio_pci_core_device *vdev,
>  	up_write(&vdev->memory_lock);
>  }
>  
> +#ifdef CONFIG_PM

Neither of the CONFIG_PM checks added are actually needed afaict, both
struct dev_pm_ops and the pm pointer on struct device_driver are defined
regardless.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 4/4] vfio/pci: Move the unused device into low power state with runtime PM
  2022-05-17 10:02 ` [PATCH v4 4/4] vfio/pci: Move the unused device into low power state with runtime PM Abhishek Sahu
  2022-05-17 20:02   ` Alex Williamson
@ 2022-05-17 20:42   ` Alex Williamson
  2022-05-17 20:55     ` Alex Williamson
  1 sibling, 1 reply; 12+ messages in thread
From: Alex Williamson @ 2022-05-17 20:42 UTC (permalink / raw)
  To: Abhishek Sahu
  Cc: Cornelia Huck, Yishai Hadas, Jason Gunthorpe, Shameer Kolothum,
	Kevin Tian, Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas,
	linux-kernel, kvm, linux-pm, linux-pci

On Tue, 17 May 2022 15:32:19 +0530
Abhishek Sahu <abhsahu@nvidia.com> wrote:
> 5. Since the runtime PM framework will provide the same functionality,
>    so directly writing into PCI PM config register can be replaced with
>    the use of runtime PM routines. Also, the use of runtime PM can help
>    us in more power saving.
> 
>    In the systems which do not support D3cold,
> 
>    With the existing implementation:
> 
>    // PCI device
>    # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
>    D3hot
>    // upstream bridge
>    # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
>    D0
> 
>    With runtime PM:
> 
>    // PCI device
>    # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
>    D3hot
>    // upstream bridge
>    # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
>    D3hot

I'm not able to reproduce these results.  Output below abridged:

# lspci -t
-[0000:00]-+-00.0
           +-01.0-[01]--+-00.0
           |            \-00.1

# grep . /sys/bus/pci/devices/*/power_state
/sys/bus/pci/devices/0000:00:01.0/power_state:D0
/sys/bus/pci/devices/0000:01:00.0/power_state:D3hot
/sys/bus/pci/devices/0000:01:00.1/power_state:D3hot

# lspci -ks $DEV
00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v2/3rd Gen Core processor PCI Express Root Port (rev 09)
	Kernel driver in use: pcieport
01:00.0 VGA compatible controller: NVIDIA Corporation GM107 [GeForce GTX 750] (rev a2)
	Subsystem: eVga.com. Corp. Device 2753
	Kernel driver in use: vfio-pci
01:00.1 Audio device: NVIDIA Corporation GM107 High Definition Audio Controller [GeForce 940MX] (rev a1)
	Subsystem: eVga.com. Corp. Device 2753
	Kernel driver in use: vfio-pci
	Kernel modules: snd_hda_intel

Any debugging suggestions?  Thanks,

Alex


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 4/4] vfio/pci: Move the unused device into low power state with runtime PM
  2022-05-17 20:42   ` Alex Williamson
@ 2022-05-17 20:55     ` Alex Williamson
  0 siblings, 0 replies; 12+ messages in thread
From: Alex Williamson @ 2022-05-17 20:55 UTC (permalink / raw)
  To: Abhishek Sahu
  Cc: Cornelia Huck, Yishai Hadas, Jason Gunthorpe, Shameer Kolothum,
	Kevin Tian, Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas,
	linux-kernel, kvm, linux-pm, linux-pci

On Tue, 17 May 2022 14:42:56 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Tue, 17 May 2022 15:32:19 +0530
> Abhishek Sahu <abhsahu@nvidia.com> wrote:
> > 5. Since the runtime PM framework will provide the same functionality,
> >    so directly writing into PCI PM config register can be replaced with
> >    the use of runtime PM routines. Also, the use of runtime PM can help
> >    us in more power saving.
> > 
> >    In the systems which do not support D3cold,
> > 
> >    With the existing implementation:
> > 
> >    // PCI device
> >    # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
> >    D3hot
> >    // upstream bridge
> >    # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
> >    D0
> > 
> >    With runtime PM:
> > 
> >    // PCI device
> >    # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
> >    D3hot
> >    // upstream bridge
> >    # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
> >    D3hot  
> 
> I'm not able to reproduce these results.  Output below abridged:
> 
> # lspci -t
> -[0000:00]-+-00.0
>            +-01.0-[01]--+-00.0
>            |            \-00.1
> 
> # grep . /sys/bus/pci/devices/*/power_state
> /sys/bus/pci/devices/0000:00:01.0/power_state:D0
> /sys/bus/pci/devices/0000:01:00.0/power_state:D3hot
> /sys/bus/pci/devices/0000:01:00.1/power_state:D3hot
> 
> # lspci -ks $DEV
> 00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v2/3rd Gen Core processor PCI Express Root Port (rev 09)
> 	Kernel driver in use: pcieport
> 01:00.0 VGA compatible controller: NVIDIA Corporation GM107 [GeForce GTX 750] (rev a2)
> 	Subsystem: eVga.com. Corp. Device 2753
> 	Kernel driver in use: vfio-pci
> 01:00.1 Audio device: NVIDIA Corporation GM107 High Definition Audio Controller [GeForce 940MX] (rev a1)
> 	Subsystem: eVga.com. Corp. Device 2753
> 	Kernel driver in use: vfio-pci
> 	Kernel modules: snd_hda_intel
> 
> Any debugging suggestions?  Thanks,

Nevermind, I see a whole bunch of reasons in pci_bridge_d3_possible()
that runtime-pm wouldn't support D3hot on this bridge/system.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 1/4] vfio/pci: Invalidate mmaps and block the access in D3hot power state
  2022-05-17 10:02 ` [PATCH v4 1/4] vfio/pci: Invalidate mmaps and block the access in D3hot power state Abhishek Sahu
@ 2022-05-17 23:30   ` kernel test robot
  0 siblings, 0 replies; 12+ messages in thread
From: kernel test robot @ 2022-05-17 23:30 UTC (permalink / raw)
  To: Abhishek Sahu, Alex Williamson, Cornelia Huck, Yishai Hadas,
	Jason Gunthorpe, Shameer Kolothum, Kevin Tian,
	Rafael J . Wysocki
  Cc: kbuild-all, Max Gurtovoy, Bjorn Helgaas, linux-kernel, kvm,
	linux-pm, linux-pci, Abhishek Sahu

Hi Abhishek,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on awilliam-vfio/next]
[also build test WARNING on v5.18-rc7]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/intel-lab-lkp/linux/commits/Abhishek-Sahu/vfio-pci-power-management-changes/20220517-180527
base:   https://github.com/awilliam/linux-vfio.git next
config: x86_64-randconfig-s021-20220516 (https://download.01.org/0day-ci/archive/20220518/202205180721.m3Z4ar57-lkp@intel.com/config)
compiler: gcc-11 (Debian 11.2.0-20) 11.2.0
reproduce:
        # apt-get install sparse
        # sparse version: v0.6.4-dirty
        # https://github.com/intel-lab-lkp/linux/commit/2c439fb0dc917fb8bcf3cf432c8d73d51cfb16b0
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Abhishek-Sahu/vfio-pci-power-management-changes/20220517-180527
        git checkout 2c439fb0dc917fb8bcf3cf432c8d73d51cfb16b0
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' O=build_dir ARCH=x86_64 SHELL=/bin/bash drivers/vfio/pci/

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>


sparse warnings: (new ones prefixed by >>)
>> drivers/vfio/pci/vfio_pci_config.c:411:20: sparse: sparse: restricted pci_power_t degrades to integer
   drivers/vfio/pci/vfio_pci_config.c:411:38: sparse: sparse: restricted pci_power_t degrades to integer

vim +411 drivers/vfio/pci/vfio_pci_config.c

   397	
   398	/* Caller should hold memory_lock semaphore */
   399	bool __vfio_pci_memory_enabled(struct vfio_pci_core_device *vdev)
   400	{
   401		struct pci_dev *pdev = vdev->pdev;
   402		u16 cmd = le16_to_cpu(*(__le16 *)&vdev->vconfig[PCI_COMMAND]);
   403	
   404		/*
   405		 * Memory region cannot be accessed if device power state is D3.
   406		 *
   407		 * SR-IOV VF memory enable is handled by the MSE bit in the
   408		 * PF SR-IOV capability, there's therefore no need to trigger
   409		 * faults based on the virtual value.
   410		 */
 > 411		return pdev->current_state < PCI_D3hot &&
   412		       (pdev->no_command_memory || (cmd & PCI_COMMAND_MEMORY));
   413	}
   414	

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 2/4] vfio/pci: Change the PF power state to D0 before enabling VFs
  2022-05-17 18:27   ` Alex Williamson
@ 2022-05-18  9:56     ` Abhishek Sahu
  0 siblings, 0 replies; 12+ messages in thread
From: Abhishek Sahu @ 2022-05-18  9:56 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Cornelia Huck, Yishai Hadas, Jason Gunthorpe, Shameer Kolothum,
	Kevin Tian, Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas,
	linux-kernel, kvm, linux-pm, linux-pci

On 5/17/2022 11:57 PM, Alex Williamson wrote:
> On Tue, 17 May 2022 15:32:17 +0530
> Abhishek Sahu <abhsahu@nvidia.com> wrote:
> 
>> According to [PCIe v5 9.6.2] for PF Device Power Management States
>>
>>  "The PF's power management state (D-state) has global impact on its
>>   associated VFs. If a VF does not implement the Power Management
>>   Capability, then it behaves as if it is in an equivalent
>>   power state of its associated PF.
>>
>>   If a VF implements the Power Management Capability, the Device behavior
>>   is undefined if the PF is placed in a lower power state than the VF.
>>   Software should avoid this situation by placing all VFs in lower power
>>   state before lowering their associated PF's power state."
>>
>> From the vfio driver side, user can enable SR-IOV when the PF is in D3hot
>> state. If VF does not implement the Power Management Capability, then
>> the VF will be actually in D3hot state and then the VF BAR access will
>> fail. If VF implements the Power Management Capability, then VF will
>> assume that its current power state is D0 when the PF is D3hot and
>> in this case, the behavior is undefined.
>>
>> To support PF power management, we need to create power management
>> dependency between PF and its VF's. The runtime power management support
>> may help with this where power management dependencies are supported
>> through device links. But till we have such support in place, we can
>> disallow the PF to go into low power state, if PF has VF enabled.
>> There can be a case, where user first enables the VF's and then
>> disables the VF's. If there is no user of PF, then the PF can put into
>> D3hot state again. But with this patch, the PF will still be in D0
>> state after disabling VF's since detecting this case inside
>> vfio_pci_core_sriov_configure() requires access to
>> struct vfio_device::open_count along with its locks. But the subsequent
>> patches related to runtime PM will handle this case since runtime PM
>> maintains its own usage count.
>>
>> Also, vfio_pci_core_sriov_configure() can be called at any time
>> (with and without vfio pci device user), so the power state change
>> needs to be protected with the required locks.
>>
>> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
>> ---
>>  drivers/vfio/pci/vfio_pci_core.c | 11 +++++++++++
>>  1 file changed, 11 insertions(+)
>>
>> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
>> index b9f222ca48cf..4fe9a4efc751 100644
>> --- a/drivers/vfio/pci/vfio_pci_core.c
>> +++ b/drivers/vfio/pci/vfio_pci_core.c
>> @@ -217,6 +217,10 @@ int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t stat
>>  	bool needs_restore = false, needs_save = false;
>>  	int ret;
>>  
>> +	/* Prevent changing power state for PFs with VFs enabled */
>> +	if (pci_num_vf(pdev) && state > PCI_D0)
>> +		return -EBUSY;
>> +
>>  	if (vdev->needs_pm_restore) {
>>  		if (pdev->current_state < PCI_D3hot && state >= PCI_D3hot) {
>>  			pci_save_state(pdev);
>> @@ -1960,6 +1964,13 @@ int vfio_pci_core_sriov_configure(struct vfio_pci_core_device *vdev,
>>  		}
>>  		list_add_tail(&vdev->sriov_pfs_item, &vfio_pci_sriov_pfs);
>>  		mutex_unlock(&vfio_pci_sriov_pfs_mutex);
>> +
>> +		/*
>> +		 * The PF power state should always be higher than the VF power
>> +		 * state. If PF is in the low power state, then change the
>> +		 * power state to D0 first before enabling SR-IOV.
>> +		 */
>> +		vfio_pci_lock_and_set_power_state(vdev, PCI_D0);
> 
> But we need to hold memory_lock across the next function or else
> userspace could race a write to the PM register to set D3 before
> pci_num_vf() can protect us.  Thanks,
> 
> Alex
> 

 Thanks Alex.
 Yes. We need to bring pci_enable_sriov() also to protect this race
 condition. I will update this in my next version.
 
 Regards,
 Abhishek

>>  		ret = pci_enable_sriov(pdev, nr_virtfn);
>>  		if (ret)
>>  			goto out_del;
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v4 4/4] vfio/pci: Move the unused device into low power state with runtime PM
  2022-05-17 20:02   ` Alex Williamson
@ 2022-05-18 10:06     ` Abhishek Sahu
  0 siblings, 0 replies; 12+ messages in thread
From: Abhishek Sahu @ 2022-05-18 10:06 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Cornelia Huck, Yishai Hadas, Jason Gunthorpe, Shameer Kolothum,
	Kevin Tian, Rafael J . Wysocki, Max Gurtovoy, Bjorn Helgaas,
	linux-kernel, kvm, linux-pm, linux-pci

On 5/18/2022 1:32 AM, Alex Williamson wrote:
> On Tue, 17 May 2022 15:32:19 +0530
> Abhishek Sahu <abhsahu@nvidia.com> wrote:
>> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
>> index 4fe9a4efc751..5ea1b3099036 100644
>> --- a/drivers/vfio/pci/vfio_pci_core.c
>> +++ b/drivers/vfio/pci/vfio_pci_core.c
>> @@ -156,7 +156,7 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_core_device *vdev)
>>  }
>>  
>>  struct vfio_pci_group_info;
>> -static bool vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set);
>> +static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set);
>>  static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
>>  				      struct vfio_pci_group_info *groups);
>>  
>> @@ -275,6 +275,19 @@ void vfio_pci_lock_and_set_power_state(struct vfio_pci_core_device *vdev,
>>  	up_write(&vdev->memory_lock);
>>  }
>>  
>> +#ifdef CONFIG_PM
> 
> Neither of the CONFIG_PM checks added are actually needed afaict, both
> struct dev_pm_ops and the pm pointer on struct device_driver are defined
> regardless.  Thanks,
> 
> Alex
> 

 Yes. These are not needed for build.
 I will remove these explicit CONFIG_PM checks.

 Regards,
 Abhishek

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2022-05-18 10:07 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-17 10:02 [PATCH v4 0/4] vfio/pci: power management changes Abhishek Sahu
2022-05-17 10:02 ` [PATCH v4 1/4] vfio/pci: Invalidate mmaps and block the access in D3hot power state Abhishek Sahu
2022-05-17 23:30   ` kernel test robot
2022-05-17 10:02 ` [PATCH v4 2/4] vfio/pci: Change the PF power state to D0 before enabling VFs Abhishek Sahu
2022-05-17 18:27   ` Alex Williamson
2022-05-18  9:56     ` Abhishek Sahu
2022-05-17 10:02 ` [PATCH v4 3/4] vfio/pci: Virtualize PME related registers bits and initialize to zero Abhishek Sahu
2022-05-17 10:02 ` [PATCH v4 4/4] vfio/pci: Move the unused device into low power state with runtime PM Abhishek Sahu
2022-05-17 20:02   ` Alex Williamson
2022-05-18 10:06     ` Abhishek Sahu
2022-05-17 20:42   ` Alex Williamson
2022-05-17 20:55     ` Alex Williamson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).