From: Alex Williamson <alex.williamson@redhat.com>
To: Abhishek Sahu <abhsahu@nvidia.com>
Cc: Cornelia Huck <cohuck@redhat.com>,
Yishai Hadas <yishaih@nvidia.com>,
Jason Gunthorpe <jgg@nvidia.com>,
Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>,
Kevin Tian <kevin.tian@intel.com>,
"Rafael J . Wysocki" <rafael@kernel.org>,
Max Gurtovoy <mgurtovoy@nvidia.com>,
Bjorn Helgaas <bhelgaas@google.com>,
<linux-kernel@vger.kernel.org>, <kvm@vger.kernel.org>,
<linux-pm@vger.kernel.org>, <linux-pci@vger.kernel.org>
Subject: Re: [PATCH v4 5/6] vfio/pci: Prevent low power re-entry without guest driver
Date: Wed, 6 Jul 2022 09:40:21 -0600 [thread overview]
Message-ID: <20220706094021.39b0ef73.alex.williamson@redhat.com> (raw)
In-Reply-To: <20220701110814.7310-6-abhsahu@nvidia.com>
On Fri, 1 Jul 2022 16:38:13 +0530
Abhishek Sahu <abhsahu@nvidia.com> wrote:
> Some devices (Like NVIDIA VGA or 3D controller) require driver
> involvement each time before going into D3cold. In the regular flow,
> the guest driver do all the required steps inside the guest OS and
> then the IOCTL will be called for D3cold entry by the hypervisor. Now,
> if there is any activity on the host side (For example user has run
> lspci, dump the config space through sysfs, etc.), then the runtime PM
> framework will resume the device first, perform the operation and then
> suspend the device again. This second time suspend will be without
> guest driver involvement. This patch adds the support to prevent
> second-time runtime suspend if there is any wake-up. This prevention
> is either based on the predefined vendor/class id list or the user can
> specify the flag (VFIO_PM_LOW_POWER_REENTERY_DISABLE) during entry for
> the same.
>
> 'pm_runtime_reentry_allowed' flag tracks if this re-entry is allowed.
> It will be set during the entry time.
>
> 'pm_runtime_resumed' flag tracks if there is any wake-up before the
> guest performs the wake-up. If re-entry is not allowed, then during
> runtime resume, the runtime PM count will be incremented, and this
> flag will be set. This flag will be checked during guest D3cold exit
> and then skip the runtime PM-related handling if this flag is set.
>
> During guest low power exit time, all vdev power-related flags are
> accessed under 'memory_lock' and usage count will be incremented. The
> resume will be triggered after releasing the lock since the runtime
> resume callback again requires the lock. pm_runtime_get_noresume()/
> pm_runtime_resume() have been used instead of
> pm_runtime_resume_and_get() to handle the following scenario during
> the race condition.
>
> a. The guest triggered the low power exit.
> b. The guest thread got the lock and cleared the vdev related
> flags and released the locks.
> c. Before pm_runtime_resume_and_get(), the host lspci thread got
> scheduled and triggered the runtime resume.
> d. Now, all the vdev related flags are cleared so there won't be
> any extra handling inside the runtime resume.
> e. The runtime PM put the device again into the suspended state.
> f. The guest triggered pm_runtime_resume_and_get() got called.
>
> So, at step (e), the suspend is happening without the guest driver
> involvement. Now, by using pm_runtime_get_noresume() before releasing
> 'memory_lock', the runtime PM framework can't suspend the device due
> to incremented usage count.
Nak, this policy should be implemented in userspace. Thanks,
Alex
> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
> ---
> drivers/vfio/pci/vfio_pci_core.c | 87 ++++++++++++++++++++++++++++++--
> include/linux/vfio_pci_core.h | 2 +
> 2 files changed, 84 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index 8c17ca41d156..1ddaaa6ccef5 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -191,6 +191,20 @@ static bool vfio_pci_nointx(struct pci_dev *pdev)
> return false;
> }
>
> +static bool vfio_pci_low_power_reentry_allowed(struct pci_dev *pdev)
> +{
> + /*
> + * The NVIDIA display class requires driver involvement for every
> + * D3cold entry. The audio and other classes can go into D3cold
> + * without driver involvement.
> + */
> + if (pdev->vendor == PCI_VENDOR_ID_NVIDIA &&
> + ((pdev->class >> 16) == PCI_BASE_CLASS_DISPLAY))
> + return false;
> +
> + return true;
> +}
> +
> static void vfio_pci_probe_power_state(struct vfio_pci_core_device *vdev)
> {
> struct pci_dev *pdev = vdev->pdev;
> @@ -295,6 +309,27 @@ static int vfio_pci_core_runtime_resume(struct device *dev)
> if (vdev->pm_intx_masked)
> vfio_pci_intx_unmask(vdev);
>
> + down_write(&vdev->memory_lock);
> +
> + /*
> + * The runtime resume callback will be called for one of the following
> + * two cases:
> + *
> + * - If the user has called IOCTL explicitly to move the device out of
> + * the low power state or closed the device.
> + * - If there is device access on the host side.
> + *
> + * For the second case, check if re-entry to the low power state is
> + * allowed. If not, then increment the usage count so that runtime PM
> + * framework won't suspend the device and set the 'pm_runtime_resumed'
> + * flag.
> + */
> + if (vdev->pm_runtime_engaged && !vdev->pm_runtime_reentry_allowed) {
> + pm_runtime_get_noresume(dev);
> + vdev->pm_runtime_resumed = true;
> + }
> + up_write(&vdev->memory_lock);
> +
> return 0;
> }
> #endif /* CONFIG_PM */
> @@ -415,9 +450,12 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
> */
> down_write(&vdev->memory_lock);
> if (vdev->pm_runtime_engaged) {
> + if (!vdev->pm_runtime_resumed) {
> + pm_runtime_get_noresume(&pdev->dev);
> + do_resume = true;
> + }
> + vdev->pm_runtime_resumed = false;
> vdev->pm_runtime_engaged = false;
> - pm_runtime_get_noresume(&pdev->dev);
> - do_resume = true;
> }
> up_write(&vdev->memory_lock);
>
> @@ -1227,12 +1265,17 @@ static int vfio_pci_pm_validate_flags(u32 flags)
> if (!flags)
> return -EINVAL;
> /* Only valid flags should be set */
> - if (flags & ~(VFIO_PM_LOW_POWER_ENTER | VFIO_PM_LOW_POWER_EXIT))
> + if (flags & ~(VFIO_PM_LOW_POWER_ENTER | VFIO_PM_LOW_POWER_EXIT |
> + VFIO_PM_LOW_POWER_REENTERY_DISABLE))
> return -EINVAL;
> /* Both enter and exit should not be set */
> if ((flags & (VFIO_PM_LOW_POWER_ENTER | VFIO_PM_LOW_POWER_EXIT)) ==
> (VFIO_PM_LOW_POWER_ENTER | VFIO_PM_LOW_POWER_EXIT))
> return -EINVAL;
> + /* re-entry disable can only be set with enter */
> + if ((flags & VFIO_PM_LOW_POWER_REENTERY_DISABLE) &&
> + !(flags & VFIO_PM_LOW_POWER_ENTER))
> + return -EINVAL;
>
> return 0;
> }
> @@ -1255,10 +1298,17 @@ static int vfio_pci_core_feature_pm(struct vfio_device *device, u32 flags,
>
> if (flags & VFIO_DEVICE_FEATURE_GET) {
> down_read(&vdev->memory_lock);
> - if (vdev->pm_runtime_engaged)
> + if (vdev->pm_runtime_engaged) {
> vfio_pm.flags = VFIO_PM_LOW_POWER_ENTER;
> - else
> + if (!vdev->pm_runtime_reentry_allowed)
> + vfio_pm.flags |=
> + VFIO_PM_LOW_POWER_REENTERY_DISABLE;
> + } else {
> vfio_pm.flags = VFIO_PM_LOW_POWER_EXIT;
> + if (!vfio_pci_low_power_reentry_allowed(pdev))
> + vfio_pm.flags |=
> + VFIO_PM_LOW_POWER_REENTERY_DISABLE;
> + }
> up_read(&vdev->memory_lock);
>
> if (copy_to_user(arg, &vfio_pm, sizeof(vfio_pm)))
> @@ -1286,6 +1336,19 @@ static int vfio_pci_core_feature_pm(struct vfio_device *device, u32 flags,
> }
>
> vdev->pm_runtime_engaged = true;
> + vdev->pm_runtime_resumed = false;
> +
> + /*
> + * If there is any access when the device is in the runtime
> + * suspended state, then the device will be resumed first
> + * before access and then the device will be suspended again.
> + * Check if this second time suspend is allowed and track the
> + * same in 'pm_runtime_reentry_allowed' flag.
> + */
> + vdev->pm_runtime_reentry_allowed =
> + vfio_pci_low_power_reentry_allowed(pdev) &&
> + !(vfio_pm.flags & VFIO_PM_LOW_POWER_REENTERY_DISABLE);
> +
> up_write(&vdev->memory_lock);
> pm_runtime_put(&pdev->dev);
> } else if (vfio_pm.flags & VFIO_PM_LOW_POWER_EXIT) {
> @@ -1296,6 +1359,20 @@ static int vfio_pci_core_feature_pm(struct vfio_device *device, u32 flags,
> }
>
> vdev->pm_runtime_engaged = false;
> + if (vdev->pm_runtime_resumed) {
> + vdev->pm_runtime_resumed = false;
> + up_write(&vdev->memory_lock);
> + return 0;
> + }
> +
> + /*
> + * The 'memory_lock' will be acquired again inside the runtime
> + * resume callback. So, increment the usage count inside the
> + * lock and call pm_runtime_resume() after releasing the lock.
> + * If there is any race condition between the wake-up generated
> + * at the host and the current path. Then the incremented usage
> + * count will prevent the device to go into the suspended state.
> + */
> pm_runtime_get_noresume(&pdev->dev);
> up_write(&vdev->memory_lock);
> ret = pm_runtime_resume(&pdev->dev);
> diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
> index bf4823b008f9..18cc83b767b8 100644
> --- a/include/linux/vfio_pci_core.h
> +++ b/include/linux/vfio_pci_core.h
> @@ -126,6 +126,8 @@ struct vfio_pci_core_device {
> bool needs_pm_restore;
> bool pm_intx_masked;
> bool pm_runtime_engaged;
> + bool pm_runtime_resumed;
> + bool pm_runtime_reentry_allowed;
> struct pci_saved_state *pci_saved_state;
> struct pci_saved_state *pm_save;
> int ioeventfds_nr;
next prev parent reply other threads:[~2022-07-06 15:48 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-07-01 11:08 [PATCH v4 0/6] vfio/pci: power management changes Abhishek Sahu
2022-07-01 11:08 ` [PATCH v4 1/6] vfio/pci: Mask INTx during runtime suspend Abhishek Sahu
2022-07-06 15:39 ` Alex Williamson
2022-07-08 9:21 ` Abhishek Sahu
2022-07-08 15:45 ` Alex Williamson
2022-07-11 9:18 ` Abhishek Sahu
2022-07-11 12:57 ` Alex Williamson
2022-07-01 11:08 ` [PATCH v4 2/6] vfio: Add a new device feature for the power management Abhishek Sahu
2022-07-06 15:39 ` Alex Williamson
2022-07-08 9:39 ` Abhishek Sahu
2022-07-08 16:36 ` Alex Williamson
2022-07-11 9:43 ` Abhishek Sahu
2022-07-11 13:04 ` Alex Williamson
2022-07-11 17:30 ` Abhishek Sahu
2022-07-01 11:08 ` [PATCH v4 3/6] vfio: Increment the runtime PM usage count during IOCTL call Abhishek Sahu
2022-07-06 15:40 ` Alex Williamson
2022-07-08 9:43 ` Abhishek Sahu
2022-07-08 16:49 ` Alex Williamson
2022-07-11 9:50 ` Abhishek Sahu
2022-07-01 11:08 ` [PATCH v4 4/6] vfio/pci: Add the support for PCI D3cold state Abhishek Sahu
2022-07-06 15:40 ` Alex Williamson
2022-07-01 11:08 ` [PATCH v4 5/6] vfio/pci: Prevent low power re-entry without guest driver Abhishek Sahu
2022-07-06 15:40 ` Alex Williamson [this message]
2022-07-01 11:08 ` [PATCH v4 6/6] vfio/pci: Add support for virtual PME Abhishek Sahu
2022-07-06 15:40 ` Alex Williamson
2022-07-08 9:45 ` Abhishek Sahu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20220706094021.39b0ef73.alex.williamson@redhat.com \
--to=alex.williamson@redhat.com \
--cc=abhsahu@nvidia.com \
--cc=bhelgaas@google.com \
--cc=cohuck@redhat.com \
--cc=jgg@nvidia.com \
--cc=kevin.tian@intel.com \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-pci@vger.kernel.org \
--cc=linux-pm@vger.kernel.org \
--cc=mgurtovoy@nvidia.com \
--cc=rafael@kernel.org \
--cc=shameerali.kolothum.thodi@huawei.com \
--cc=yishaih@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).