linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: <abhsahu@nvidia.com>
To: <kvm@vger.kernel.org>,
	Alex Williamson <alex.williamson@redhat.com>,
	Cornelia Huck <cohuck@redhat.com>
Cc: Max Gurtovoy <mgurtovoy@nvidia.com>,
	Yishai Hadas <yishaih@nvidia.com>,
	Zhen Lei <thunder.leizhen@huawei.com>,
	Jason Gunthorpe <jgg@nvidia.com>, <linux-kernel@vger.kernel.org>,
	Abhishek Sahu <abhsahu@nvidia.com>
Subject: [RFC 3/3] vfio/pci: use runtime PM for vfio-device into low power state
Date: Mon, 15 Nov 2021 19:06:40 +0530	[thread overview]
Message-ID: <20211115133640.2231-4-abhsahu@nvidia.com> (raw)
In-Reply-To: <20211115133640.2231-1-abhsahu@nvidia.com>

From: Abhishek Sahu <abhsahu@nvidia.com>

Currently, if the runtime power management is enabled for vfio-pci
device in guest OS, then guest OS will do the register write for
PCI_PM_CTRL register. This write request will be handled in
vfio_pm_config_write() where it will do the actual register write of
PCI_PM_CTRL register. With this, the maximum D3hot state can be
achieved for low power. If we can use the runtime PM framework, then
we can achieve the D3cold state which will help in saving
maximum power.

This patch uses runtime PM framework whenever vfio-pci device will
be put in the low power state.

1. If runtime PM is enabled, then instead of directly writing
   PCI_PM_CTRL register, decrement the device usage counter whenever
   the power state is non-D0. The kernel runtime PM framework will
   itself put the PCI device in low power state when device usage
   counter will become zero. Similarly, when the power state will be
   again changed back to D0, then increment the device usage counter
   and the kernel runtime PM framework will itself bring the PCI device
   out of low power state.

2. The guest OS will read the PCI_PM_CTRL register back to
   confirm the current power state so virtual register bits can be
   used. For this, before decrementing the usage count, read the actual
   PCI_PM_CTRL register, update the power state related bits, and then
   update the vconfig bits corresponding to PCI_PM_CTRL offset. For
   PCI_PM_CTRL register read, return the virtual value if runtime PM is
   requested. This vconfig bits will be cleared when the power state
   will be changed back to D0.

3. For the guest OS, the PCI power state will be still D3hot
   even if put the actual PCI device into D3cold state. In the D3hot
   state, the config space can still be read. So, if there is any request
   from guest OS to read the config space, then we need to move the actual
   PCI device again back to D0. For this, increment the device usage
   count before reading/writing the config space and then decrement it
   again after reading/writing the config space. There can be
   back-to-back config register read/write request, but since the auto
   suspend methods are being used here so only first access will
   wake up the PCI device and for the remaining access, the device will
   already be active.

4. This above-mentioned wake up is not needed if the register
   read/write is done completely with virtual bits. For handling this
   condition, the actual resume of device will only being done in
   vfio_user_config_read()/vfio_user_config_write(). All the config
   register access will come vfio_pci_config_rw(). So, in this
   function, use the pm_runtime_get_noresume() so that only usage count
   will be incremented without resuming the device. Inside,
   vfio_user_config_read()/vfio_user_config_write(), use the routines
   with pm_runtime_put_noidle() so that the PCI device won’t be
   suspended in the lower level functions. Again in the top level
   vfio_pci_config_rw(), use the pm_runtime_put_autosuspend(). Now the
   auto suspend timer will be started and the device will be suspended
   again. If the device is already runtime suspended, then this routine
   will return early.

5. In the host side D3cold will only be used if the platform has
   support for this, otherwise some other state will be used. The
   config space can be read if the device is not in D3cold state. So in
   this case, we can skip the resuming of PCI device. The wrapper
   function vfio_pci_config_pm_runtime_get() takes care of this
   condition and invoke the pm_runtime_resume() only if current power
   state is D3cold.

6. For vfio_pci_config_pm_runtime_get()/vfio_
   pci_config_pm_runtime_put(), the reference code is taken from
   pci_config_pm_runtime_get()/pci_config_pm_runtime_put() and then it
   is modified according to vfio-pci driver requirement.

7. vfio_pci_set_power_state() will be unused after moving to runtime
   PM, so this function can be removed along with other related
   functions and structure fields.

Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
---
 drivers/vfio/pci/vfio_pci_config.c | 178 ++++++++++++++++++++++++++---
 drivers/vfio/pci/vfio_pci_core.c   |  64 ++---------
 include/linux/vfio_pci_core.h      |   3 +-
 3 files changed, 174 insertions(+), 71 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
index fb3a503a5b99..5576eb4308b4 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -25,6 +25,7 @@
 #include <linux/uaccess.h>
 #include <linux/vfio.h>
 #include <linux/slab.h>
+#include <linux/pm_runtime.h>
 
 #include <linux/vfio_pci_core.h>
 
@@ -119,12 +120,51 @@ struct perm_bits {
 #define	NO_WRITE	0
 #define	ALL_WRITE	0xFFFFFFFFU
 
-static int vfio_user_config_read(struct pci_dev *pdev, int offset,
+static void vfio_pci_config_pm_runtime_get(struct pci_dev *pdev)
+{
+	struct device *dev = &pdev->dev;
+	struct device *parent = dev->parent;
+
+	if (parent)
+		pm_runtime_get_sync(parent);
+
+	pm_runtime_get_noresume(dev);
+	/*
+	 * pdev->current_state is set to PCI_D3cold during suspending,
+	 * so wait until suspending completes
+	 */
+	pm_runtime_barrier(dev);
+	/*
+	 * Only need to resume devices in D3cold, because config
+	 * registers are still accessible for devices suspended but
+	 * not in D3cold.
+	 */
+	if (pdev->current_state == PCI_D3cold)
+		pm_runtime_resume(dev);
+}
+
+static void vfio_pci_config_pm_runtime_put(struct pci_dev *pdev)
+{
+	struct device *dev = &pdev->dev;
+	struct device *parent = dev->parent;
+
+	pm_runtime_mark_last_busy(dev);
+	pm_runtime_put_noidle(dev);
+
+	if (parent)
+		pm_runtime_put_noidle(parent);
+}
+
+static int vfio_user_config_read(struct vfio_pci_core_device *vdev, int offset,
 				 __le32 *val, int count)
 {
+	struct pci_dev *pdev = vdev->pdev;
 	int ret = -EINVAL;
 	u32 tmp_val = 0;
 
+	if (vdev->pm_runtime_suspended)
+		vfio_pci_config_pm_runtime_get(pdev);
+
 	switch (count) {
 	case 1:
 	{
@@ -147,15 +187,22 @@ static int vfio_user_config_read(struct pci_dev *pdev, int offset,
 
 	*val = cpu_to_le32(tmp_val);
 
+	if (vdev->pm_runtime_suspended)
+		vfio_pci_config_pm_runtime_put(pdev);
+
 	return ret;
 }
 
-static int vfio_user_config_write(struct pci_dev *pdev, int offset,
+static int vfio_user_config_write(struct vfio_pci_core_device *vdev, int offset,
 				  __le32 val, int count)
 {
+	struct pci_dev *pdev = vdev->pdev;
 	int ret = -EINVAL;
 	u32 tmp_val = le32_to_cpu(val);
 
+	if (vdev->pm_runtime_suspended)
+		vfio_pci_config_pm_runtime_get(pdev);
+
 	switch (count) {
 	case 1:
 		ret = pci_user_write_config_byte(pdev, offset, tmp_val);
@@ -168,6 +215,9 @@ static int vfio_user_config_write(struct pci_dev *pdev, int offset,
 		break;
 	}
 
+	if (vdev->pm_runtime_suspended)
+		vfio_pci_config_pm_runtime_put(pdev);
+
 	return ret;
 }
 
@@ -183,11 +233,10 @@ static int vfio_default_config_read(struct vfio_pci_core_device *vdev, int pos,
 
 	/* Any non-virtualized bits? */
 	if (cpu_to_le32(~0U >> (32 - (count * 8))) != virt) {
-		struct pci_dev *pdev = vdev->pdev;
 		__le32 phys_val = 0;
 		int ret;
 
-		ret = vfio_user_config_read(pdev, pos, &phys_val, count);
+		ret = vfio_user_config_read(vdev, pos, &phys_val, count);
 		if (ret)
 			return ret;
 
@@ -224,18 +273,17 @@ static int vfio_default_config_write(struct vfio_pci_core_device *vdev, int pos,
 
 	/* Non-virtualzed and writable bits go to hardware */
 	if (write & ~virt) {
-		struct pci_dev *pdev = vdev->pdev;
 		__le32 phys_val = 0;
 		int ret;
 
-		ret = vfio_user_config_read(pdev, pos, &phys_val, count);
+		ret = vfio_user_config_read(vdev, pos, &phys_val, count);
 		if (ret)
 			return ret;
 
 		phys_val &= ~(write & ~virt);
 		phys_val |= (val & (write & ~virt));
 
-		ret = vfio_user_config_write(pdev, pos, phys_val, count);
+		ret = vfio_user_config_write(vdev, pos, phys_val, count);
 		if (ret)
 			return ret;
 	}
@@ -250,7 +298,7 @@ static int vfio_direct_config_read(struct vfio_pci_core_device *vdev, int pos,
 {
 	int ret;
 
-	ret = vfio_user_config_read(vdev->pdev, pos, val, count);
+	ret = vfio_user_config_read(vdev, pos, val, count);
 	if (ret)
 		return ret;
 
@@ -275,7 +323,7 @@ static int vfio_raw_config_write(struct vfio_pci_core_device *vdev, int pos,
 {
 	int ret;
 
-	ret = vfio_user_config_write(vdev->pdev, pos, val, count);
+	ret = vfio_user_config_write(vdev, pos, val, count);
 	if (ret)
 		return ret;
 
@@ -288,7 +336,7 @@ static int vfio_raw_config_read(struct vfio_pci_core_device *vdev, int pos,
 {
 	int ret;
 
-	ret = vfio_user_config_read(vdev->pdev, pos, val, count);
+	ret = vfio_user_config_read(vdev, pos, val, count);
 	if (ret)
 		return ret;
 
@@ -692,13 +740,86 @@ static int __init init_pci_cap_basic_perm(struct perm_bits *perm)
 	return 0;
 }
 
+static int vfio_perform_runtime_pm(struct vfio_pci_core_device *vdev, int pos,
+				   int count, struct perm_bits *perm,
+				   int offset, __le32 val, pci_power_t state)
+{
+	/*
+	 * If runtime PM is enabled, then instead of directly writing
+	 * PCI_PM_CTRL register, decrement the device usage counter whenever
+	 * the power state is non-D0. The kernel runtime PM framework will
+	 * itself put the PCI device in the low power state when device usage
+	 * counter will become zero. The guest OS will read the PCI_PM_CTRL
+	 * register back to confirm the current power state so virtual
+	 * register bits can be used. For this, read the actual PCI_PM_CTRL
+	 * register, update the power state related bits and then update the
+	 * vconfig bits corresponding to PCI_PM_CTRL offset. If the
+	 * pm_runtime_suspended status is set, then return the virtual
+	 * register value for PCI_PM_CTRL register read. All the bits
+	 * in PCI_PM_CTRL are being returned from virtual config, so that
+	 * this register read will not wake up the PCI device from suspended
+	 * state.
+	 *
+	 * Once power state will be changed back to D0, then clear the power
+	 * state related bits in vconfig. After this, increment the device
+	 * usage counter which will internally wake up the PCI device from
+	 * suspended state.
+	 */
+	if (state != PCI_D0 && !vdev->pm_runtime_suspended) {
+		__le32 virt_val = 0;
+
+		count = vfio_default_config_write(vdev, pos, count, perm,
+						  offset, val);
+		if (count < 0)
+			return count;
+
+		vfio_default_config_read(vdev, pos, 4, perm, offset, &virt_val);
+		virt_val &= ~cpu_to_le32(PCI_PM_CTRL_STATE_MASK);
+		virt_val |= (val & cpu_to_le32(PCI_PM_CTRL_STATE_MASK));
+		memcpy(vdev->vconfig + pos, &virt_val, 4);
+		vdev->pm_runtime_suspended = true;
+		pm_runtime_mark_last_busy(&vdev->pdev->dev);
+		pm_runtime_put_autosuspend(&vdev->pdev->dev);
+		return count;
+	}
+
+	if (vdev->pm_runtime_suspended && state == PCI_D0) {
+		vdev->pm_runtime_suspended = false;
+		*(__le16 *)&vdev->vconfig[pos] &=
+			~cpu_to_le16(PCI_PM_CTRL_STATE_MASK);
+		pm_runtime_get_sync(&vdev->pdev->dev);
+	}
+
+	return vfio_default_config_write(vdev, pos, count, perm, offset, val);
+}
+
+static int vfio_pm_config_read(struct vfio_pci_core_device *vdev, int pos,
+			       int count, struct perm_bits *perm,
+			       int offset, __le32 *val)
+{
+	/*
+	 * If pm_runtime_suspended status is set, then return the virtual
+	 * register value for PCI_PM_CTRL register read.
+	 */
+	if (vdev->pm_runtime_suspended &&
+	    offset >= PCI_PM_CTRL && offset < (PCI_PM_CTRL + 4)) {
+		memcpy(val, vdev->vconfig + pos, count);
+		return count;
+	}
+
+	return vfio_default_config_read(vdev, pos, count, perm, offset, val);
+}
+
 static int vfio_pm_config_write(struct vfio_pci_core_device *vdev, int pos,
 				int count, struct perm_bits *perm,
 				int offset, __le32 val)
 {
-	count = vfio_default_config_write(vdev, pos, count, perm, offset, val);
-	if (count < 0)
-		return count;
+	if (offset != PCI_PM_CTRL) {
+		count = vfio_default_config_write(vdev, pos, count, perm,
+						  offset, val);
+		if (count < 0)
+			return count;
+	}
 
 	if (offset == PCI_PM_CTRL) {
 		pci_power_t state;
@@ -718,7 +839,8 @@ static int vfio_pm_config_write(struct vfio_pci_core_device *vdev, int pos,
 			break;
 		}
 
-		vfio_pci_set_power_state(vdev, state);
+		count = vfio_perform_runtime_pm(vdev, pos, count, perm,
+						offset, val, state);
 	}
 
 	return count;
@@ -731,6 +853,7 @@ static int __init init_pci_cap_pm_perm(struct perm_bits *perm)
 		return -ENOMEM;
 
 	perm->writefn = vfio_pm_config_write;
+	perm->readfn = vfio_pm_config_read;
 
 	/*
 	 * We always virtualize the next field so we can remove
@@ -1921,13 +2044,31 @@ ssize_t vfio_pci_config_rw(struct vfio_pci_core_device *vdev, char __user *buf,
 	size_t done = 0;
 	int ret = 0;
 	loff_t pos = *ppos;
+	bool runtime_put_required = false;
 
 	pos &= VFIO_PCI_OFFSET_MASK;
 
+	/*
+	 * For virtualized bits read/write, the device should not be resumed.
+	 * Increment the device usage count alone so that the device won't be
+	 * runtime suspended during config read/write.
+	 */
+	if (vdev->pm_runtime_suspended) {
+		pm_runtime_get_noresume(&vdev->pdev->dev);
+		runtime_put_required = true;
+	}
+
 	while (count) {
 		ret = vfio_config_do_rw(vdev, buf, count, &pos, iswrite);
-		if (ret < 0)
+		if (ret < 0) {
+			/*
+			 * Decrement the device usage counter corresponding to
+			 * previous pm_runtime_get_noresume().
+			 */
+			if (runtime_put_required)
+				pm_runtime_put_autosuspend(&vdev->pdev->dev);
 			return ret;
+		}
 
 		count -= ret;
 		done += ret;
@@ -1937,5 +2078,12 @@ ssize_t vfio_pci_config_rw(struct vfio_pci_core_device *vdev, char __user *buf,
 
 	*ppos += done;
 
+	/*
+	 * Decrement the device usage counter corresponding to previous
+	 * pm_runtime_get_noresume().
+	 */
+	if (runtime_put_required)
+		pm_runtime_put_autosuspend(&vdev->pdev->dev);
+
 	return done;
 }
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 511a52e92b32..79fa86914b6c 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -187,57 +187,6 @@ static bool vfio_pci_nointx(struct pci_dev *pdev)
 	return false;
 }
 
-static void vfio_pci_probe_power_state(struct vfio_pci_core_device *vdev)
-{
-	struct pci_dev *pdev = vdev->pdev;
-	u16 pmcsr;
-
-	if (!pdev->pm_cap)
-		return;
-
-	pci_read_config_word(pdev, pdev->pm_cap + PCI_PM_CTRL, &pmcsr);
-
-	vdev->needs_pm_restore = !(pmcsr & PCI_PM_CTRL_NO_SOFT_RESET);
-}
-
-/*
- * pci_set_power_state() wrapper handling devices which perform a soft reset on
- * D3->D0 transition.  Save state prior to D0/1/2->D3, stash it on the vdev,
- * restore when returned to D0.  Saved separately from pci_saved_state for use
- * by PM capability emulation and separately from pci_dev internal saved state
- * to avoid it being overwritten and consumed around other resets.
- */
-int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t state)
-{
-	struct pci_dev *pdev = vdev->pdev;
-	bool needs_restore = false, needs_save = false;
-	int ret;
-
-	if (vdev->needs_pm_restore) {
-		if (pdev->current_state < PCI_D3hot && state >= PCI_D3hot) {
-			pci_save_state(pdev);
-			needs_save = true;
-		}
-
-		if (pdev->current_state >= PCI_D3hot && state <= PCI_D0)
-			needs_restore = true;
-	}
-
-	ret = pci_set_power_state(pdev, state);
-
-	if (!ret) {
-		/* D3 might be unsupported via quirk, skip unless in D3 */
-		if (needs_save && pdev->current_state >= PCI_D3hot) {
-			vdev->pm_save = pci_store_saved_state(pdev);
-		} else if (needs_restore) {
-			pci_load_and_free_saved_state(pdev, &vdev->pm_save);
-			pci_restore_state(pdev);
-		}
-	}
-
-	return ret;
-}
-
 int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
 {
 	struct pci_dev *pdev = vdev->pdev;
@@ -323,6 +272,16 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
 	/* For needs_reset */
 	lockdep_assert_held(&vdev->vdev.dev_set->lock);
 
+	/*
+	 * The vfio device user can close the device after putting the device
+	 * into runtime suspended state so wake up the device first in
+	 * this case.
+	 */
+	if (vdev->pm_runtime_suspended) {
+		vdev->pm_runtime_suspended = false;
+		pm_runtime_get_sync(&pdev->dev);
+	}
+
 	/* Stop the device from further DMA */
 	pci_clear_master(pdev);
 
@@ -1809,7 +1768,6 @@ void vfio_pci_core_uninit_device(struct vfio_pci_core_device *vdev)
 	mutex_destroy(&vdev->vma_lock);
 	vfio_uninit_group_dev(&vdev->vdev);
 	kfree(vdev->region);
-	kfree(vdev->pm_save);
 }
 EXPORT_SYMBOL_GPL(vfio_pci_core_uninit_device);
 
@@ -1855,8 +1813,6 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev)
 	if (ret)
 		goto out_vf;
 
-	vfio_pci_probe_power_state(vdev);
-
 	/*
 	 * pci-core sets the device power state to an unknown value at
 	 * bootup and after being removed from a driver.  The only
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index aafe09c9fa64..2b1a556ce73f 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -123,9 +123,8 @@ struct vfio_pci_core_device {
 	bool			has_vga;
 	bool			needs_reset;
 	bool			nointx;
-	bool			needs_pm_restore;
+	bool                    pm_runtime_suspended;
 	struct pci_saved_state	*pci_saved_state;
-	struct pci_saved_state	*pm_save;
 	int			ioeventfds_nr;
 	struct eventfd_ctx	*err_trigger;
 	struct eventfd_ctx	*req_trigger;
-- 
2.17.1


  parent reply	other threads:[~2021-11-15 13:37 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-11-15 13:36 [RFC 0/3] vfio/pci: Enable runtime power management support abhsahu
2021-11-15 13:36 ` [RFC 1/3] vfio/pci: register vfio-pci driver with runtime PM framework abhsahu
2021-11-17 17:52   ` Alex Williamson
2021-11-18 15:37     ` Abhishek Sahu
2021-11-15 13:36 ` [RFC 2/3] vfio/pci: virtualize PME related registers bits and initialize to zero abhsahu
2021-11-17 17:53   ` Alex Williamson
2021-11-18 15:24     ` Abhishek Sahu
2021-11-15 13:36 ` abhsahu [this message]
2021-11-17 17:53   ` [RFC 3/3] vfio/pci: use runtime PM for vfio-device into low power state Alex Williamson
2021-11-18 15:21     ` Abhishek Sahu
2021-11-18 21:09       ` Alex Williamson
2021-11-19 15:45         ` Alex Williamson
2021-11-23  7:27           ` Abhishek Sahu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20211115133640.2231-4-abhsahu@nvidia.com \
    --to=abhsahu@nvidia.com \
    --cc=alex.williamson@redhat.com \
    --cc=cohuck@redhat.com \
    --cc=jgg@nvidia.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgurtovoy@nvidia.com \
    --cc=thunder.leizhen@huawei.com \
    --cc=yishaih@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).