[PATCH v4 6/6] vfio/pci: Add support for virtual PME

From: Abhishek Sahu <abhsahu@nvidia.com>
To: Alex Williamson <alex.williamson@redhat.com>,
	Cornelia Huck <cohuck@redhat.com>,
	Yishai Hadas <yishaih@nvidia.com>,
	Jason Gunthorpe <jgg@nvidia.com>,
	Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>,
	Kevin Tian <kevin.tian@intel.com>,
	"Rafael J . Wysocki" <rafael@kernel.org>
Cc: Max Gurtovoy <mgurtovoy@nvidia.com>,
	Bjorn Helgaas <bhelgaas@google.com>,
	<linux-kernel@vger.kernel.org>, <kvm@vger.kernel.org>,
	<linux-pm@vger.kernel.org>, <linux-pci@vger.kernel.org>,
	Abhishek Sahu <abhsahu@nvidia.com>
Subject: [PATCH v4 6/6] vfio/pci: Add support for virtual PME
Date: Fri, 1 Jul 2022 16:38:14 +0530	[thread overview]
Message-ID: <20220701110814.7310-7-abhsahu@nvidia.com> (raw)
In-Reply-To: <20220701110814.7310-1-abhsahu@nvidia.com>

If the PCI device is in low power state and the device requires
wake-up, then it can generate PME (Power Management Events). Mostly
these PME events will be propagated to the root port and then the
root port will generate the system interrupt. Then the OS should
identify the device which generated the PME and should resume
the device.

We can implement a similar virtual PME framework where if the device
already went into the runtime suspended state and then there is any
wake-up on the host side, then it will send the virtual PME
notification to the guest. This virtual PME will be helpful for the cases
where the device will not be suspended again if there is any wake-up
triggered by the host. Following is the overall approach regarding
the virtual PME.

1. Add one more event like VFIO_PCI_ERR_IRQ_INDEX named
   VFIO_PCI_PME_IRQ_INDEX and do the required code changes to get/set
   this new IRQ.

2. From the guest side, the guest needs to enable eventfd for the
   virtual PME notification.

3. In the vfio-pci driver, the PME support bits are currently
   virtualized and set to 0. We can set PME capability support for all
   the power states. This PME capability support is independent of the
   physical PME support.

4. The PME enable (PME_En bit in Power Management Control/Status
   Register) and PME status (PME_Status bit in Power Management
   Control/Status Register) are also virtualized currently.
   The write support for PME_En bit can be enabled.

5. The PME_Status bit is a write-1-clear bit where the write with
   zero value will have no effect and write with 1 value will clear the
   bit. The write for this bit will be trapped inside
   vfio_pm_config_write() similar to PCI_PM_CTRL write for PM_STATES.

6. When the host gets a request for resuming the device other than from
   low power exit feature IOCTL, then PME_Status bit will be set.
   According to [PCIe v5 7.5.2.2],
     "PME_Status - This bit is Set when the Function would normally
      generate a PME signal. The value of this bit is not affected by
      the value of the PME_En bit."

   So even if PME_En bit is not set, we can set PME_Status bit.

7. If the guest has enabled PME_En and registered for PME events
   through eventfd, then the usage count will be incremented to prevent
   the device to go into the suspended state and notify the guest through
   eventfd trigger.

The virtual PME can help in handling physical PME also. When
physical PME comes, then also the runtime resume will be called. If
the guest has registered for virtual PME, then it will be sent in this
case also.

* Implementation for handling the virtual PME on the hypervisor:

If we take the implementation in Linux OS, then during runtime suspend
time, then it calls __pci_enable_wake(). It internally enables PME
through pci_pme_active() and also enables the ACPI side wake-up
through platform_pci_set_wakeup(). To handle the PME, the hypervisor has
the following two options:

1. Create a virtual root port for the VFIO device and trigger
   interrupt when the PME comes. It will call pcie_pme_irq() which will
   resume the device.

2. Create a virtual ACPI _PRW resource and associate it with the device
   itself. In _PRW, any GPE (General Purpose Event) can be assigned for
   the wake-up. When PME comes, then GPE can be triggered by the
   hypervisor. GPE interrupt will call pci_acpi_wake_dev() function
   internally and it will resume the device.

Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
---
 drivers/vfio/pci/vfio_pci_config.c | 39 +++++++++++++++++++++------
 drivers/vfio/pci/vfio_pci_core.c   | 43 ++++++++++++++++++++++++------
 drivers/vfio/pci/vfio_pci_intrs.c  | 18 +++++++++++++
 include/linux/vfio_pci_core.h      |  2 ++
 include/uapi/linux/vfio.h          |  1 +
 5 files changed, 87 insertions(+), 16 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
index 21a4743d011f..a06375a03758 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -719,6 +719,20 @@ static int vfio_pm_config_write(struct vfio_pci_core_device *vdev, int pos,
 	if (count < 0)
 		return count;
 
+	/*
+	 * PME_STATUS is write-1-clear bit. If PME_STATUS is 1, then clear the
+	 * bit in vconfig. The PME_STATUS is in the upper byte of the control
+	 * register and user can do single byte write also.
+	 */
+	if (offset <= PCI_PM_CTRL + 1 && offset + count > PCI_PM_CTRL + 1) {
+		if (le32_to_cpu(val) &
+		    (PCI_PM_CTRL_PME_STATUS >> (offset - PCI_PM_CTRL) * 8)) {
+			__le16 *ctrl = (__le16 *)&vdev->vconfig
+					[vdev->pm_cap_offset + PCI_PM_CTRL];
+			*ctrl &= ~cpu_to_le16(PCI_PM_CTRL_PME_STATUS);
+		}
+	}
+
 	if (offset == PCI_PM_CTRL) {
 		pci_power_t state;
 
@@ -771,14 +785,16 @@ static int __init init_pci_cap_pm_perm(struct perm_bits *perm)
 	 * the user change power state, but we trap and initiate the
 	 * change ourselves, so the state bits are read-only.
 	 *
-	 * The guest can't process PME from D3cold so virtualize PME_Status
-	 * and PME_En bits. The vconfig bits will be cleared during device
-	 * capability initialization.
+	 * The guest can't process physical PME from D3cold so virtualize
+	 * PME_Status and PME_En bits. These bits will be used for the
+	 * virtual PME between host and guest. The vconfig bits will be
+	 * updated during device capability initialization. PME_Status is
+	 * write-1-clear bit, so it is read-only. We trap and update the
+	 * vconfig bit manually during write.
 	 */
 	p_setd(perm, PCI_PM_CTRL,
 	       PCI_PM_CTRL_PME_ENABLE | PCI_PM_CTRL_PME_STATUS,
-	       ~(PCI_PM_CTRL_PME_ENABLE | PCI_PM_CTRL_PME_STATUS |
-		 PCI_PM_CTRL_STATE_MASK));
+	       ~(PCI_PM_CTRL_STATE_MASK | PCI_PM_CTRL_PME_STATUS));
 
 	return 0;
 }
@@ -1454,8 +1470,13 @@ static void vfio_update_pm_vconfig_bytes(struct vfio_pci_core_device *vdev,
 	__le16 *pmc = (__le16 *)&vdev->vconfig[offset + PCI_PM_PMC];
 	__le16 *ctrl = (__le16 *)&vdev->vconfig[offset + PCI_PM_CTRL];
 
-	/* Clear vconfig PME_Support, PME_Status, and PME_En bits */
-	*pmc &= ~cpu_to_le16(PCI_PM_CAP_PME_MASK);
+	/*
+	 * Set the vconfig PME_Support bits. The PME_Status is being used for
+	 * virtual PME support and is not dependent upon the physical
+	 * PME support.
+	 */
+	*pmc |= cpu_to_le16(PCI_PM_CAP_PME_MASK);
+	/* Clear vconfig PME_Support and PME_En bits */
 	*ctrl &= ~cpu_to_le16(PCI_PM_CTRL_PME_ENABLE | PCI_PM_CTRL_PME_STATUS);
 }
 
@@ -1582,8 +1603,10 @@ static int vfio_cap_init(struct vfio_pci_core_device *vdev)
 		if (ret)
 			return ret;
 
-		if (cap == PCI_CAP_ID_PM)
+		if (cap == PCI_CAP_ID_PM) {
+			vdev->pm_cap_offset = pos;
 			vfio_update_pm_vconfig_bytes(vdev, pos);
+		}
 
 		prev = &vdev->vconfig[pos + PCI_CAP_LIST_NEXT];
 		pos = next;
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 1ddaaa6ccef5..6c1225bc2aeb 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -319,14 +319,35 @@ static int vfio_pci_core_runtime_resume(struct device *dev)
 	 *   the low power state or closed the device.
 	 * - If there is device access on the host side.
 	 *
-	 * For the second case, check if re-entry to the low power state is
-	 * allowed. If not, then increment the usage count so that runtime PM
-	 * framework won't suspend the device and set the 'pm_runtime_resumed'
-	 * flag.
+	 * For the second case:
+	 * - The virtual PME_STATUS bit will be set. If PME_ENABLE bit is set
+	 *   and user has registered for virtual PME events, then send the PME
+	 *   virtual PME event.
+	 * - Check if re-entry to the low power state is not allowed.
+	 *
+	 * For the above conditions, increment the usage count so that
+	 * runtime PM framework won't suspend the device and set the
+	 * 'pm_runtime_resumed' flag.
 	 */
-	if (vdev->pm_runtime_engaged && !vdev->pm_runtime_reentry_allowed) {
-		pm_runtime_get_noresume(dev);
-		vdev->pm_runtime_resumed = true;
+	if (vdev->pm_runtime_engaged) {
+		bool pme_triggered = false;
+		__le16 *ctrl = (__le16 *)&vdev->vconfig
+				[vdev->pm_cap_offset + PCI_PM_CTRL];
+
+		*ctrl |= cpu_to_le16(PCI_PM_CTRL_PME_STATUS);
+		if (le16_to_cpu(*ctrl) & PCI_PM_CTRL_PME_ENABLE) {
+			mutex_lock(&vdev->igate);
+			if (vdev->pme_trigger) {
+				pme_triggered = true;
+				eventfd_signal(vdev->pme_trigger, 1);
+			}
+			mutex_unlock(&vdev->igate);
+		}
+
+		if (!vdev->pm_runtime_reentry_allowed || pme_triggered) {
+			pm_runtime_get_noresume(dev);
+			vdev->pm_runtime_resumed = true;
+		}
 	}
 	up_write(&vdev->memory_lock);
 
@@ -586,6 +607,10 @@ void vfio_pci_core_close_device(struct vfio_device *core_vdev)
 		eventfd_ctx_put(vdev->req_trigger);
 		vdev->req_trigger = NULL;
 	}
+	if (vdev->pme_trigger) {
+		eventfd_ctx_put(vdev->pme_trigger);
+		vdev->pme_trigger = NULL;
+	}
 	mutex_unlock(&vdev->igate);
 }
 EXPORT_SYMBOL_GPL(vfio_pci_core_close_device);
@@ -639,7 +664,8 @@ static int vfio_pci_get_irq_count(struct vfio_pci_core_device *vdev, int irq_typ
 	} else if (irq_type == VFIO_PCI_ERR_IRQ_INDEX) {
 		if (pci_is_pcie(vdev->pdev))
 			return 1;
-	} else if (irq_type == VFIO_PCI_REQ_IRQ_INDEX) {
+	} else if (irq_type == VFIO_PCI_REQ_IRQ_INDEX ||
+		   irq_type == VFIO_PCI_PME_IRQ_INDEX) {
 		return 1;
 	}
 
@@ -985,6 +1011,7 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
 		switch (info.index) {
 		case VFIO_PCI_INTX_IRQ_INDEX ... VFIO_PCI_MSIX_IRQ_INDEX:
 		case VFIO_PCI_REQ_IRQ_INDEX:
+		case VFIO_PCI_PME_IRQ_INDEX:
 			break;
 		case VFIO_PCI_ERR_IRQ_INDEX:
 			if (pci_is_pcie(vdev->pdev))
diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index 1a37db99df48..db4180687a74 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -639,6 +639,17 @@ static int vfio_pci_set_req_trigger(struct vfio_pci_core_device *vdev,
 					       count, flags, data);
 }
 
+static int vfio_pci_set_pme_trigger(struct vfio_pci_core_device *vdev,
+				    unsigned index, unsigned start,
+				    unsigned count, uint32_t flags, void *data)
+{
+	if (index != VFIO_PCI_PME_IRQ_INDEX || start != 0 || count > 1)
+		return -EINVAL;
+
+	return vfio_pci_set_ctx_trigger_single(&vdev->pme_trigger,
+					       count, flags, data);
+}
+
 int vfio_pci_set_irqs_ioctl(struct vfio_pci_core_device *vdev, uint32_t flags,
 			    unsigned index, unsigned start, unsigned count,
 			    void *data)
@@ -688,6 +699,13 @@ int vfio_pci_set_irqs_ioctl(struct vfio_pci_core_device *vdev, uint32_t flags,
 			break;
 		}
 		break;
+	case VFIO_PCI_PME_IRQ_INDEX:
+		switch (flags & VFIO_IRQ_SET_ACTION_TYPE_MASK) {
+		case VFIO_IRQ_SET_ACTION_TRIGGER:
+			func = vfio_pci_set_pme_trigger;
+			break;
+		}
+		break;
 	}
 
 	if (!func)
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index 18cc83b767b8..ee2646d820c2 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -102,6 +102,7 @@ struct vfio_pci_core_device {
 	bool			bar_mmap_supported[PCI_STD_NUM_BARS];
 	u8			*pci_config_map;
 	u8			*vconfig;
+	u8			pm_cap_offset;
 	struct perm_bits	*msi_perm;
 	spinlock_t		irqlock;
 	struct mutex		igate;
@@ -133,6 +134,7 @@ struct vfio_pci_core_device {
 	int			ioeventfds_nr;
 	struct eventfd_ctx	*err_trigger;
 	struct eventfd_ctx	*req_trigger;
+	struct eventfd_ctx	*pme_trigger;
 	struct list_head	dummy_resources_list;
 	struct mutex		ioeventfds_lock;
 	struct list_head	ioeventfds_list;
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 7e00de5c21ea..08170950d655 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -621,6 +621,7 @@ enum {
 	VFIO_PCI_MSIX_IRQ_INDEX,
 	VFIO_PCI_ERR_IRQ_INDEX,
 	VFIO_PCI_REQ_IRQ_INDEX,
+	VFIO_PCI_PME_IRQ_INDEX,
 	VFIO_PCI_NUM_IRQS
 };
 
-- 
2.17.1