linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Abhishek Sahu <abhsahu@nvidia.com>
To: Alex Williamson <alex.williamson@redhat.com>,
	Cornelia Huck <cohuck@redhat.com>,
	Yishai Hadas <yishaih@nvidia.com>,
	Jason Gunthorpe <jgg@nvidia.com>,
	Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>,
	Kevin Tian <kevin.tian@intel.com>,
	"Rafael J . Wysocki" <rafael@kernel.org>
Cc: Max Gurtovoy <mgurtovoy@nvidia.com>,
	Bjorn Helgaas <bhelgaas@google.com>,
	<linux-kernel@vger.kernel.org>, <kvm@vger.kernel.org>,
	<linux-pm@vger.kernel.org>, <linux-pci@vger.kernel.org>,
	Abhishek Sahu <abhsahu@nvidia.com>
Subject: [PATCH v4 5/6] vfio/pci: Prevent low power re-entry without guest driver
Date: Fri, 1 Jul 2022 16:38:13 +0530	[thread overview]
Message-ID: <20220701110814.7310-6-abhsahu@nvidia.com> (raw)
In-Reply-To: <20220701110814.7310-1-abhsahu@nvidia.com>

Some devices (Like NVIDIA VGA or 3D controller) require driver
involvement each time before going into D3cold. In the regular flow,
the guest driver do all the required steps inside the guest OS and
then the IOCTL will be called for D3cold entry by the hypervisor. Now,
if there is any activity on the host side (For example user has run
lspci, dump the config space through sysfs, etc.), then the runtime PM
framework will resume the device first, perform the operation and then
suspend the device again. This second time suspend will be without
guest driver involvement. This patch adds the support to prevent
second-time runtime suspend if there is any wake-up. This prevention
is either based on the predefined vendor/class id list or the user can
specify the flag (VFIO_PM_LOW_POWER_REENTERY_DISABLE) during entry for
the same.

'pm_runtime_reentry_allowed' flag tracks if this re-entry is allowed.
It will be set during the entry time.

'pm_runtime_resumed' flag tracks if there is any wake-up before the
guest performs the wake-up. If re-entry is not allowed, then during
runtime resume, the runtime PM count will be incremented, and this
flag will be set. This flag will be checked during guest D3cold exit
and then skip the runtime PM-related handling if this flag is set.

During guest low power exit time, all vdev power-related flags are
accessed under 'memory_lock' and usage count will be incremented. The
resume will be triggered after releasing the lock since the runtime
resume callback again requires the lock. pm_runtime_get_noresume()/
pm_runtime_resume() have been used instead of
pm_runtime_resume_and_get() to handle the following scenario during
the race condition.

 a. The guest triggered the low power exit.
 b. The guest thread got the lock and cleared the vdev related
    flags and released the locks.
 c. Before pm_runtime_resume_and_get(), the host lspci thread got
    scheduled and triggered the runtime resume.
 d. Now, all the vdev related flags are cleared so there won't be
    any extra handling inside the runtime resume.
 e. The runtime PM put the device again into the suspended state.
 f. The guest triggered pm_runtime_resume_and_get() got called.

So, at step (e), the suspend is happening without the guest driver
involvement. Now, by using pm_runtime_get_noresume() before releasing
'memory_lock', the runtime PM framework can't suspend the device due
to incremented usage count.

Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
---
 drivers/vfio/pci/vfio_pci_core.c | 87 ++++++++++++++++++++++++++++++--
 include/linux/vfio_pci_core.h    |  2 +
 2 files changed, 84 insertions(+), 5 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 8c17ca41d156..1ddaaa6ccef5 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -191,6 +191,20 @@ static bool vfio_pci_nointx(struct pci_dev *pdev)
 	return false;
 }
 
+static bool vfio_pci_low_power_reentry_allowed(struct pci_dev *pdev)
+{
+	/*
+	 * The NVIDIA display class requires driver involvement for every
+	 * D3cold entry. The audio and other classes can go into D3cold
+	 * without driver involvement.
+	 */
+	if (pdev->vendor == PCI_VENDOR_ID_NVIDIA &&
+	    ((pdev->class >> 16) == PCI_BASE_CLASS_DISPLAY))
+		return false;
+
+	return true;
+}
+
 static void vfio_pci_probe_power_state(struct vfio_pci_core_device *vdev)
 {
 	struct pci_dev *pdev = vdev->pdev;
@@ -295,6 +309,27 @@ static int vfio_pci_core_runtime_resume(struct device *dev)
 	if (vdev->pm_intx_masked)
 		vfio_pci_intx_unmask(vdev);
 
+	down_write(&vdev->memory_lock);
+
+	/*
+	 * The runtime resume callback will be called for one of the following
+	 * two cases:
+	 *
+	 * - If the user has called IOCTL explicitly to move the device out of
+	 *   the low power state or closed the device.
+	 * - If there is device access on the host side.
+	 *
+	 * For the second case, check if re-entry to the low power state is
+	 * allowed. If not, then increment the usage count so that runtime PM
+	 * framework won't suspend the device and set the 'pm_runtime_resumed'
+	 * flag.
+	 */
+	if (vdev->pm_runtime_engaged && !vdev->pm_runtime_reentry_allowed) {
+		pm_runtime_get_noresume(dev);
+		vdev->pm_runtime_resumed = true;
+	}
+	up_write(&vdev->memory_lock);
+
 	return 0;
 }
 #endif /* CONFIG_PM */
@@ -415,9 +450,12 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
 	 */
 	down_write(&vdev->memory_lock);
 	if (vdev->pm_runtime_engaged) {
+		if (!vdev->pm_runtime_resumed) {
+			pm_runtime_get_noresume(&pdev->dev);
+			do_resume = true;
+		}
+		vdev->pm_runtime_resumed = false;
 		vdev->pm_runtime_engaged = false;
-		pm_runtime_get_noresume(&pdev->dev);
-		do_resume = true;
 	}
 	up_write(&vdev->memory_lock);
 
@@ -1227,12 +1265,17 @@ static int vfio_pci_pm_validate_flags(u32 flags)
 	if (!flags)
 		return -EINVAL;
 	/* Only valid flags should be set */
-	if (flags & ~(VFIO_PM_LOW_POWER_ENTER | VFIO_PM_LOW_POWER_EXIT))
+	if (flags & ~(VFIO_PM_LOW_POWER_ENTER | VFIO_PM_LOW_POWER_EXIT |
+		      VFIO_PM_LOW_POWER_REENTERY_DISABLE))
 		return -EINVAL;
 	/* Both enter and exit should not be set */
 	if ((flags & (VFIO_PM_LOW_POWER_ENTER | VFIO_PM_LOW_POWER_EXIT)) ==
 	    (VFIO_PM_LOW_POWER_ENTER | VFIO_PM_LOW_POWER_EXIT))
 		return -EINVAL;
+	/* re-entry disable can only be set with enter */
+	if ((flags & VFIO_PM_LOW_POWER_REENTERY_DISABLE) &&
+	    !(flags & VFIO_PM_LOW_POWER_ENTER))
+		return -EINVAL;
 
 	return 0;
 }
@@ -1255,10 +1298,17 @@ static int vfio_pci_core_feature_pm(struct vfio_device *device, u32 flags,
 
 	if (flags & VFIO_DEVICE_FEATURE_GET) {
 		down_read(&vdev->memory_lock);
-		if (vdev->pm_runtime_engaged)
+		if (vdev->pm_runtime_engaged) {
 			vfio_pm.flags = VFIO_PM_LOW_POWER_ENTER;
-		else
+			if (!vdev->pm_runtime_reentry_allowed)
+				vfio_pm.flags |=
+					VFIO_PM_LOW_POWER_REENTERY_DISABLE;
+		} else {
 			vfio_pm.flags = VFIO_PM_LOW_POWER_EXIT;
+			if (!vfio_pci_low_power_reentry_allowed(pdev))
+				vfio_pm.flags |=
+					VFIO_PM_LOW_POWER_REENTERY_DISABLE;
+		}
 		up_read(&vdev->memory_lock);
 
 		if (copy_to_user(arg, &vfio_pm, sizeof(vfio_pm)))
@@ -1286,6 +1336,19 @@ static int vfio_pci_core_feature_pm(struct vfio_device *device, u32 flags,
 		}
 
 		vdev->pm_runtime_engaged = true;
+		vdev->pm_runtime_resumed = false;
+
+		/*
+		 * If there is any access when the device is in the runtime
+		 * suspended state, then the device will be resumed first
+		 * before access and then the device will be suspended again.
+		 * Check if this second time suspend is allowed and track the
+		 * same in 'pm_runtime_reentry_allowed' flag.
+		 */
+		vdev->pm_runtime_reentry_allowed =
+			vfio_pci_low_power_reentry_allowed(pdev) &&
+			!(vfio_pm.flags & VFIO_PM_LOW_POWER_REENTERY_DISABLE);
+
 		up_write(&vdev->memory_lock);
 		pm_runtime_put(&pdev->dev);
 	} else if (vfio_pm.flags & VFIO_PM_LOW_POWER_EXIT) {
@@ -1296,6 +1359,20 @@ static int vfio_pci_core_feature_pm(struct vfio_device *device, u32 flags,
 		}
 
 		vdev->pm_runtime_engaged = false;
+		if (vdev->pm_runtime_resumed) {
+			vdev->pm_runtime_resumed = false;
+			up_write(&vdev->memory_lock);
+			return 0;
+		}
+
+		/*
+		 * The 'memory_lock' will be acquired again inside the runtime
+		 * resume callback. So, increment the usage count inside the
+		 * lock and call pm_runtime_resume() after releasing the lock.
+		 * If there is any race condition between the wake-up generated
+		 * at the host and the current path. Then the incremented usage
+		 * count will prevent the device to go into the suspended state.
+		 */
 		pm_runtime_get_noresume(&pdev->dev);
 		up_write(&vdev->memory_lock);
 		ret = pm_runtime_resume(&pdev->dev);
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index bf4823b008f9..18cc83b767b8 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -126,6 +126,8 @@ struct vfio_pci_core_device {
 	bool			needs_pm_restore;
 	bool			pm_intx_masked;
 	bool			pm_runtime_engaged;
+	bool			pm_runtime_resumed;
+	bool			pm_runtime_reentry_allowed;
 	struct pci_saved_state	*pci_saved_state;
 	struct pci_saved_state	*pm_save;
 	int			ioeventfds_nr;
-- 
2.17.1


  parent reply	other threads:[~2022-07-01 11:09 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-07-01 11:08 [PATCH v4 0/6] vfio/pci: power management changes Abhishek Sahu
2022-07-01 11:08 ` [PATCH v4 1/6] vfio/pci: Mask INTx during runtime suspend Abhishek Sahu
2022-07-06 15:39   ` Alex Williamson
2022-07-08  9:21     ` Abhishek Sahu
2022-07-08 15:45       ` Alex Williamson
2022-07-11  9:18         ` Abhishek Sahu
2022-07-11 12:57           ` Alex Williamson
2022-07-01 11:08 ` [PATCH v4 2/6] vfio: Add a new device feature for the power management Abhishek Sahu
2022-07-06 15:39   ` Alex Williamson
2022-07-08  9:39     ` Abhishek Sahu
2022-07-08 16:36       ` Alex Williamson
2022-07-11  9:43         ` Abhishek Sahu
2022-07-11 13:04           ` Alex Williamson
2022-07-11 17:30             ` Abhishek Sahu
2022-07-01 11:08 ` [PATCH v4 3/6] vfio: Increment the runtime PM usage count during IOCTL call Abhishek Sahu
2022-07-06 15:40   ` Alex Williamson
2022-07-08  9:43     ` Abhishek Sahu
2022-07-08 16:49       ` Alex Williamson
2022-07-11  9:50         ` Abhishek Sahu
2022-07-01 11:08 ` [PATCH v4 4/6] vfio/pci: Add the support for PCI D3cold state Abhishek Sahu
2022-07-06 15:40   ` Alex Williamson
2022-07-01 11:08 ` Abhishek Sahu [this message]
2022-07-06 15:40   ` [PATCH v4 5/6] vfio/pci: Prevent low power re-entry without guest driver Alex Williamson
2022-07-01 11:08 ` [PATCH v4 6/6] vfio/pci: Add support for virtual PME Abhishek Sahu
2022-07-06 15:40   ` Alex Williamson
2022-07-08  9:45     ` Abhishek Sahu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220701110814.7310-6-abhsahu@nvidia.com \
    --to=abhsahu@nvidia.com \
    --cc=alex.williamson@redhat.com \
    --cc=bhelgaas@google.com \
    --cc=cohuck@redhat.com \
    --cc=jgg@nvidia.com \
    --cc=kevin.tian@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=linux-pm@vger.kernel.org \
    --cc=mgurtovoy@nvidia.com \
    --cc=rafael@kernel.org \
    --cc=shameerali.kolothum.thodi@huawei.com \
    --cc=yishaih@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).