[PATCH v4 5/6] vfio/pci: Prevent low power re-entry without guest driver

From: Abhishek Sahu <abhsahu@nvidia.com>
To: Alex Williamson <alex.williamson@redhat.com>,
	Cornelia Huck <cohuck@redhat.com>,
	Yishai Hadas <yishaih@nvidia.com>,
	Jason Gunthorpe <jgg@nvidia.com>,
	Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>,
	Kevin Tian <kevin.tian@intel.com>,
	"Rafael J . Wysocki" <rafael@kernel.org>
Cc: Max Gurtovoy <mgurtovoy@nvidia.com>,
	Bjorn Helgaas <bhelgaas@google.com>,
	<linux-kernel@vger.kernel.org>, <kvm@vger.kernel.org>,
	<linux-pm@vger.kernel.org>, <linux-pci@vger.kernel.org>,
	Abhishek Sahu <abhsahu@nvidia.com>
Subject: [PATCH v4 5/6] vfio/pci: Prevent low power re-entry without guest driver
Date: Fri, 1 Jul 2022 16:38:13 +0530	[thread overview]
Message-ID: <20220701110814.7310-6-abhsahu@nvidia.com> (raw)
In-Reply-To: <20220701110814.7310-1-abhsahu@nvidia.com>

Some devices (Like NVIDIA VGA or 3D controller) require driver
involvement each time before going into D3cold. In the regular flow,
the guest driver do all the required steps inside the guest OS and
then the IOCTL will be called for D3cold entry by the hypervisor. Now,
if there is any activity on the host side (For example user has run
lspci, dump the config space through sysfs, etc.), then the runtime PM
framework will resume the device first, perform the operation and then
suspend the device again. This second time suspend will be without
guest driver involvement. This patch adds the support to prevent
second-time runtime suspend if there is any wake-up. This prevention
is either based on the predefined vendor/class id list or the user can
specify the flag (VFIO_PM_LOW_POWER_REENTERY_DISABLE) during entry for
the same.

'pm_runtime_reentry_allowed' flag tracks if this re-entry is allowed.
It will be set during the entry time.

'pm_runtime_resumed' flag tracks if there is any wake-up before the
guest performs the wake-up. If re-entry is not allowed, then during
runtime resume, the runtime PM count will be incremented, and this
flag will be set. This flag will be checked during guest D3cold exit
and then skip the runtime PM-related handling if this flag is set.

During guest low power exit time, all vdev power-related flags are
accessed under 'memory_lock' and usage count will be incremented. The
resume will be triggered after releasing the lock since the runtime
resume callback again requires the lock. pm_runtime_get_noresume()/
pm_runtime_resume() have been used instead of
pm_runtime_resume_and_get() to handle the following scenario during
the race condition.

 a. The guest triggered the low power exit.
 b. The guest thread got the lock and cleared the vdev related
    flags and released the locks.
 c. Before pm_runtime_resume_and_get(), the host lspci thread got
    scheduled and triggered the runtime resume.
 d. Now, all the vdev related flags are cleared so there won't be
    any extra handling inside the runtime resume.
 e. The runtime PM put the device again into the suspended state.
 f. The guest triggered pm_runtime_resume_and_get() got called.

So, at step (e), the suspend is happening without the guest driver
involvement. Now, by using pm_runtime_get_noresume() before releasing
'memory_lock', the runtime PM framework can't suspend the device due
to incremented usage count.

Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
---
 drivers/vfio/pci/vfio_pci_core.c | 87 ++++++++++++++++++++++++++++++--
 include/linux/vfio_pci_core.h    |  2 +
 2 files changed, 84 insertions(+), 5 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 8c17ca41d156..1ddaaa6ccef5 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -191,6 +191,20 @@ static bool vfio_pci_nointx(struct pci_dev *pdev)
 	return false;
 }
 
+static bool vfio_pci_low_power_reentry_allowed(struct pci_dev *pdev)
+{
+	/*
+	 * The NVIDIA display class requires driver involvement for every
+	 * D3cold entry. The audio and other classes can go into D3cold
+	 * without driver involvement.
+	 */
+	if (pdev->vendor == PCI_VENDOR_ID_NVIDIA &&
+	    ((pdev->class >> 16) == PCI_BASE_CLASS_DISPLAY))
+		return false;
+
+	return true;
+}
+
 static void vfio_pci_probe_power_state(struct vfio_pci_core_device *vdev)
 {
 	struct pci_dev *pdev = vdev->pdev;
@@ -295,6 +309,27 @@ static int vfio_pci_core_runtime_resume(struct device *dev)
 	if (vdev->pm_intx_masked)
 		vfio_pci_intx_unmask(vdev);
 
+	down_write(&vdev->memory_lock);
+
+	/*
+	 * The runtime resume callback will be called for one of the following
+	 * two cases:
+	 *
+	 * - If the user has called IOCTL explicitly to move the device out of
+	 *   the low power state or closed the device.
+	 * - If there is device access on the host side.
+	 *
+	 * For the second case, check if re-entry to the low power state is
+	 * allowed. If not, then increment the usage count so that runtime PM
+	 * framework won't suspend the device and set the 'pm_runtime_resumed'
+	 * flag.
+	 */
+	if (vdev->pm_runtime_engaged && !vdev->pm_runtime_reentry_allowed) {
+		pm_runtime_get_noresume(dev);
+		vdev->pm_runtime_resumed = true;
+	}
+	up_write(&vdev->memory_lock);
+
 	return 0;
 }
 #endif /* CONFIG_PM */
@@ -415,9 +450,12 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
 	 */
 	down_write(&vdev->memory_lock);
 	if (vdev->pm_runtime_engaged) {
+		if (!vdev->pm_runtime_resumed) {
+			pm_runtime_get_noresume(&pdev->dev);
+			do_resume = true;
+		}
+		vdev->pm_runtime_resumed = false;
 		vdev->pm_runtime_engaged = false;
-		pm_runtime_get_noresume(&pdev->dev);
-		do_resume = true;
 	}
 	up_write(&vdev->memory_lock);
 
@@ -1227,12 +1265,17 @@ static int vfio_pci_pm_validate_flags(u32 flags)
 	if (!flags)
 		return -EINVAL;
 	/* Only valid flags should be set */
-	if (flags & ~(VFIO_PM_LOW_POWER_ENTER | VFIO_PM_LOW_POWER_EXIT))
+	if (flags & ~(VFIO_PM_LOW_POWER_ENTER | VFIO_PM_LOW_POWER_EXIT |
+		      VFIO_PM_LOW_POWER_REENTERY_DISABLE))
 		return -EINVAL;
 	/* Both enter and exit should not be set */
 	if ((flags & (VFIO_PM_LOW_POWER_ENTER | VFIO_PM_LOW_POWER_EXIT)) ==
 	    (VFIO_PM_LOW_POWER_ENTER | VFIO_PM_LOW_POWER_EXIT))
 		return -EINVAL;
+	/* re-entry disable can only be set with enter */
+	if ((flags & VFIO_PM_LOW_POWER_REENTERY_DISABLE) &&
+	    !(flags & VFIO_PM_LOW_POWER_ENTER))
+		return -EINVAL;
 
 	return 0;
 }
@@ -1255,10 +1298,17 @@ static int vfio_pci_core_feature_pm(struct vfio_device *device, u32 flags,
 
 	if (flags & VFIO_DEVICE_FEATURE_GET) {
 		down_read(&vdev->memory_lock);
-		if (vdev->pm_runtime_engaged)
+		if (vdev->pm_runtime_engaged) {
 			vfio_pm.flags = VFIO_PM_LOW_POWER_ENTER;
-		else
+			if (!vdev->pm_runtime_reentry_allowed)
+				vfio_pm.flags |=
+					VFIO_PM_LOW_POWER_REENTERY_DISABLE;
+		} else {
 			vfio_pm.flags = VFIO_PM_LOW_POWER_EXIT;
+			if (!vfio_pci_low_power_reentry_allowed(pdev))
+				vfio_pm.flags |=
+					VFIO_PM_LOW_POWER_REENTERY_DISABLE;
+		}
 		up_read(&vdev->memory_lock);
 
 		if (copy_to_user(arg, &vfio_pm, sizeof(vfio_pm)))
@@ -1286,6 +1336,19 @@ static int vfio_pci_core_feature_pm(struct vfio_device *device, u32 flags,
 		}
 
 		vdev->pm_runtime_engaged = true;
+		vdev->pm_runtime_resumed = false;
+
+		/*
+		 * If there is any access when the device is in the runtime
+		 * suspended state, then the device will be resumed first
+		 * before access and then the device will be suspended again.
+		 * Check if this second time suspend is allowed and track the
+		 * same in 'pm_runtime_reentry_allowed' flag.
+		 */
+		vdev->pm_runtime_reentry_allowed =
+			vfio_pci_low_power_reentry_allowed(pdev) &&
+			!(vfio_pm.flags & VFIO_PM_LOW_POWER_REENTERY_DISABLE);
+
 		up_write(&vdev->memory_lock);
 		pm_runtime_put(&pdev->dev);
 	} else if (vfio_pm.flags & VFIO_PM_LOW_POWER_EXIT) {
@@ -1296,6 +1359,20 @@ static int vfio_pci_core_feature_pm(struct vfio_device *device, u32 flags,
 		}
 
 		vdev->pm_runtime_engaged = false;
+		if (vdev->pm_runtime_resumed) {
+			vdev->pm_runtime_resumed = false;
+			up_write(&vdev->memory_lock);
+			return 0;
+		}
+
+		/*
+		 * The 'memory_lock' will be acquired again inside the runtime
+		 * resume callback. So, increment the usage count inside the
+		 * lock and call pm_runtime_resume() after releasing the lock.
+		 * If there is any race condition between the wake-up generated
+		 * at the host and the current path. Then the incremented usage
+		 * count will prevent the device to go into the suspended state.
+		 */
 		pm_runtime_get_noresume(&pdev->dev);
 		up_write(&vdev->memory_lock);
 		ret = pm_runtime_resume(&pdev->dev);
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index bf4823b008f9..18cc83b767b8 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -126,6 +126,8 @@ struct vfio_pci_core_device {
 	bool			needs_pm_restore;
 	bool			pm_intx_masked;
 	bool			pm_runtime_engaged;
+	bool			pm_runtime_resumed;
+	bool			pm_runtime_reentry_allowed;
 	struct pci_saved_state	*pci_saved_state;
 	struct pci_saved_state	*pm_save;
 	int			ioeventfds_nr;
-- 
2.17.1