[PATCH v4 0/6] vfio/pci: power management changes

* [PATCH v4 0/6] vfio/pci: power management changes
@ 2022-07-01 11:08 Abhishek Sahu
  2022-07-01 11:08 ` [PATCH v4 1/6] vfio/pci: Mask INTx during runtime suspend Abhishek Sahu
                   ` (5 more replies)
  0 siblings, 6 replies; 26+ messages in thread
From: Abhishek Sahu @ 2022-07-01 11:08 UTC (permalink / raw)
  To: Alex Williamson, Cornelia Huck, Yishai Hadas, Jason Gunthorpe,
	Shameer Kolothum, Kevin Tian, Rafael J . Wysocki
  Cc: Max Gurtovoy, Bjorn Helgaas, linux-kernel, kvm, linux-pm,
	linux-pci, Abhishek Sahu

This is part 2 for the vfio-pci driver power management support.
Part 1 of this patch series was related to adding D3cold support
when there is no user of the VFIO device and has already merged in the
mainline kernel. If we enable the runtime power management for
vfio-pci device in the guest OS, then the device is being runtime
suspended (for linux guest OS) and the PCI device will be put into
D3hot state (in function vfio_pm_config_write()). If the D3cold
state can be used instead of D3hot, then it will help in saving
maximum power. The D3cold state can't be possible with native
PCI PM. It requires interaction with platform firmware which is
system-specific. To go into low power states (Including D3cold),
the runtime PM framework can be used which internally interacts
with PCI and platform firmware and puts the device into the
lowest possible D-States.

This patch series adds the support to engage runtime power management
initiated by the user. Since D3cold state can't be achieved by writing
PCI standard PM config registers, so a feature has been added in
DEVICE_FEATURE IOCTL for power management related handling. It
includes different flags which can be used for moving the device into
low power state and out of low power state. For the PCI device, this
low power state will be D3cold (if platform supports the D3cold
state). The hypervisors can implement virtual ACPI methods to make the
integration with guest OS. For example, in guest Linux OS if PCI device
ACPI node has _PR3 and _PR0 power resources with _ON/_OFF method,
then guest Linux OS makes the _OFF call during D3cold transition and
then _ON during D0 transition. The hypervisor can tap these virtual
ACPI calls and then do the low power related IOCTL.

Some devices (Like NVIDIA VGA or 3D controller) require driver
involvement each time before going into D3cold. Once the guest put the
device into D3cold, then the user can run some commands on the host side
(like lspci). The runtime PM framework will resume the device before
accessing the device and will suspend the device again. Now, this
second time suspend will be without guest driver involvement. vfio-pci
driver won't suspend the device if re-entry to low power is not
allowed. This patch series also adds virtual PME (power management
event) support which can be used to notify the guest OS for such kind
of host access. The guest can then put the device again into the
suspended state.

* Changes in v4

- Rebased patches on v5.19-rc4.
- Added virtual PME support.
- Used flags for low power entry and exit instead of explicit variable.
- Add the support to keep NVIDIA display related controllers in active
  state if there is any activity on the host side.
- Add a flag that can be set by the user to keep the device in the active
  state if there is any activity on the host side.
- Split the D3cold patch into smaller patches.
- Kept the runtime PM usage count incremented for all the IOCTL
  (except power management IOCTL) and all the PCI region access.
- Masked the runtime errors behind -EIO.
- Refactored logic in runtime suspend/resume routine and for power
  management device feature IOCTL.
- Add helper function for pm_runtime_put() also in the
  drivers/vfio/vfio.c and use the 'struct vfio_device' for the
  function parameter.
- Removed the requirement to move the device into D3hot before calling
  low power entry.
- Renamed power management related new members in the structure.
- Used 'pm_runtime_engaged' check in __vfio_pci_memory_enabled().

* Changes in v3
  (https://lore.kernel.org/lkml/20220425092615.10133-1-abhsahu@nvidia.com)

- Rebased patches on v5.18-rc3.
- Marked this series as PATCH instead of RFC.
- Addressed the review comments given in v2.
- Removed the limitation to keep device in D0 state if there is any
  access from host side. This is specific to NVIDIA use case and
  will be handled separately.
- Used the existing DEVICE_FEATURE IOCTL itself instead of adding new
  IOCTL for power management.
- Removed all custom code related with power management in runtime
  suspend/resume callbacks and IOCTL handling. Now, the callbacks
  contain code related with INTx handling and few other stuffs and
  all the PCI state and platform PM handling will be done by PCI core
  functions itself.
- Add the support of wake-up in main vfio layer itself since now we have
  more vfio/pci based drivers.
- Instead of assigning the 'struct dev_pm_ops' in individual parent
  driver, now the vfio_pci_core tself assigns the 'struct dev_pm_ops'. 
- Added handling of power management around SR-IOV handling.
- Moved the setting of drvdata in a separate patch.
- Masked INTx before during runtime suspended state.
- Changed the order of patches so that Fix related things are at beginning
  of this patch series.
- Removed storing the power state locally and used one new boolean to
  track the d3 (D3cold and D3hot) power state 
- Removed check for IO access in D3 power state.
- Used another helper function vfio_lock_and_set_power_state() instead
  of touching vfio_pci_set_power_state().
- Considered the fixes made in
  https://lore.kernel.org/lkml/20220217122107.22434-1-abhsahu@nvidia.com
  and updated the patches accordingly.

* Changes in v2
  (https://lore.kernel.org/lkml/20220124181726.19174-1-abhsahu@nvidia.com)

- Rebased patches on v5.17-rc1.
- Included the patch to handle BAR access in D3cold.
- Included the patch to fix memory leak.
- Made a separate IOCTL that can be used to change the power state from
  D3hot to D3cold and D3cold to D0.
- Addressed the review comments given in v1.

Abhishek Sahu (6):
  vfio/pci: Mask INTx during runtime suspend
  vfio: Add a new device feature for the power management
  vfio: Increment the runtime PM usage count during IOCTL call
  vfio/pci: Add the support for PCI D3cold state
  vfio/pci: Prevent low power re-entry without guest driver
  vfio/pci: Add support for virtual PME

 drivers/vfio/pci/vfio_pci_config.c |  41 +++-
 drivers/vfio/pci/vfio_pci_core.c   | 312 +++++++++++++++++++++++++++--
 drivers/vfio/pci/vfio_pci_intrs.c  |  24 ++-
 drivers/vfio/vfio.c                |  82 +++++++-
 include/linux/vfio_pci_core.h      |   8 +-
 include/uapi/linux/vfio.h          |  56 ++++++
 6 files changed, 492 insertions(+), 31 deletions(-)

-- 
2.17.1

^ permalink raw reply	[flat|nested] 26+ messages in thread