kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Alex Williamson <alex.williamson@redhat.com>
To: <abhsahu@nvidia.com>
Cc: <kvm@vger.kernel.org>, Cornelia Huck <cohuck@redhat.com>,
	Max Gurtovoy <mgurtovoy@nvidia.com>,
	Yishai Hadas <yishaih@nvidia.com>,
	Zhen Lei <thunder.leizhen@huawei.com>,
	Jason Gunthorpe <jgg@nvidia.com>, <linux-kernel@vger.kernel.org>
Subject: Re: [RFC 1/3] vfio/pci: register vfio-pci driver with runtime PM framework
Date: Wed, 17 Nov 2021 10:52:54 -0700	[thread overview]
Message-ID: <20211117105254.49227dc1.alex.williamson@redhat.com> (raw)
In-Reply-To: <20211115133640.2231-2-abhsahu@nvidia.com>

On Mon, 15 Nov 2021 19:06:38 +0530
<abhsahu@nvidia.com> wrote:

> From: Abhishek Sahu <abhsahu@nvidia.com>
> 
> Currently, there is very limited power management support
> available in upstream vfio-pci driver. If there is no user of vfio-pci
> device, then the PCI device will be moved into D3Hot state by writing
> directly into PCI PM registers. This D3Hot state help in saving some
> amount of power but we can achieve zero power consumption if we go
> into D3cold state. The D3cold state cannot be possible with Native PCI
> PM. It requires interaction with platform firmware which is
> system-specific. To go into low power states (including D3cold), the
> runtime PM framework can be used which internally interacts with PCI
> and platform firmware and puts the device into the lowest possible
> D-States.
> 
> This patch registers vfio-pci driver with the runtime PM framework.
> 
> 1. The PCI core framework takes care of most of the runtime PM
>    related things. For enabling the runtime PM, the PCI driver needs to
>    decrement the usage count and needs to register the runtime
>    suspend/resume routines. For vfio-pci based driver, these routines can
>    be stubbed since the vfio-pci driver is not doing the PCI device
>    initialization. All the config state saving, and PCI power management
>    related things will be done by PCI core framework itself inside its
>    runtime suspend/resume callbacks.
> 
> 2. To prevent frequent runtime, suspend/resume, it uses autosuspend
>    support with a default delay of 1 second.
> 
> 3. Inside pci_reset_bus(), all the devices in bus/slot will be moved
>    out of D0 state. This state change to D0 can happen directly without
>    going through the runtime PM framework. So if runtime PM is enabled,
>    then pm_runtime_resume() makes the runtime state active. Since the PCI
>    device power state is already D0, so it should return early when it
>    tries to change the state with  pci_set_power_state(). Then
>    pm_request_autosuspend() can be used which will internally check for
>    device usage count and will move the device again into low power
>    state.
> 
> 4. Inside vfio_pci_core_disable(), the device usage count always needs
>    to decremented which was incremented vfio_pci_core_enable() with
>    pm_runtime_get_sync().
> 
> 5. Since the runtime PM framework will provide the same functionality,
>    so directly writing into PCI PM config register can be replaced with
>    use of runtime PM routines. Also, the use of runtime PM can help us in
>    more power saving.
> 
>    In the systems which do not support D3Cold,
> 
>    With the existing implementation:
> 
>    // PCI device
>    # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
>    D3hot
>    // upstream bridge
>    # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
>    D0
> 
>    With runtime PM:
> 
>    // PCI device
>    # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
>    D3hot
>    // upstream bridge
>    # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
>    D3hot
> 
>    So, with runtime PM, the upstream bridge or root port will also go
>    into lower power state which is not possible with existing
>    implementation.
> 
>    In the systems which support D3Cold,
> 
>    // PCI device
>    # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
>    D3hot
>    // upstream bridge
>    # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
>    D0
> 
>    With runtime PM:
> 
>    // PCI device
>    # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
>    D3cold
>    // upstream bridge
>    # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
>    D3cold
> 
>    So, with runtime PM, both the PCI device and upstream bridge will
>    go into D3cold state.
> 
> 6. If 'disable_idle_d3' module parameter is set, then also the runtime
>    PM will be enabled, but in this case, the usage count should not be
>    decremented.
> 
> 7. vfio_pci_dev_set_try_reset() return value is unused now, so this
>    function return type can be changed to void.
> 
> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
> ---
>  drivers/vfio/pci/vfio_pci.c      |   3 +
>  drivers/vfio/pci/vfio_pci_core.c | 104 +++++++++++++++++++++++--------
>  include/linux/vfio_pci_core.h    |   4 ++
>  3 files changed, 84 insertions(+), 27 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index a5ce92beb655..c8695baf3b54 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -193,6 +193,9 @@ static struct pci_driver vfio_pci_driver = {
>  	.remove			= vfio_pci_remove,
>  	.sriov_configure	= vfio_pci_sriov_configure,
>  	.err_handler		= &vfio_pci_core_err_handlers,
> +#if defined(CONFIG_PM)
> +	.driver.pm              = &vfio_pci_core_pm_ops,
> +#endif
>  };
>  
>  static void __init vfio_pci_fill_ids(void)
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index f948e6cd2993..511a52e92b32 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -152,7 +152,7 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_core_device *vdev)
>  }
>  
>  struct vfio_pci_group_info;
> -static bool vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set);
> +static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set);
>  static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
>  				      struct vfio_pci_group_info *groups);
>  
> @@ -245,7 +245,8 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
>  	u16 cmd;
>  	u8 msix_pos;
>  
> -	vfio_pci_set_power_state(vdev, PCI_D0);
> +	if (!disable_idle_d3)
> +		pm_runtime_get_sync(&pdev->dev);

I'm not a pm_runtime expert, but why are we not using
pm_runtime_resume_and_get() here and aborting the function on error?

I hope some folks more familiar with pm_runtime can also review usage
across this series.

>  
>  	/* Don't allow our initial saved state to include busmaster */
>  	pci_clear_master(pdev);
> @@ -405,8 +406,17 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
>  out:
>  	pci_disable_device(pdev);
>  
> -	if (!vfio_pci_dev_set_try_reset(vdev->vdev.dev_set) && !disable_idle_d3)
> -		vfio_pci_set_power_state(vdev, PCI_D3hot);
> +	vfio_pci_dev_set_try_reset(vdev->vdev.dev_set);
> +
> +	/*
> +	 * The device usage count always needs to decremented which was
> +	 * incremented in vfio_pci_core_enable() with
> +	 * pm_runtime_get_sync().
> +	 */

*to be

Maybe:

	/*
	 * Put the pm-runtime usage counter acquired during enable; mark
	 * last use time to delay autosuspend.
	 */

> +	if (!disable_idle_d3) {
> +		pm_runtime_mark_last_busy(&pdev->dev);
> +		pm_runtime_put_autosuspend(&pdev->dev);
> +	}
>  }
>  EXPORT_SYMBOL_GPL(vfio_pci_core_disable);
>  
> @@ -1847,19 +1857,23 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev)
>  
>  	vfio_pci_probe_power_state(vdev);
>  
> -	if (!disable_idle_d3) {
> -		/*
> -		 * pci-core sets the device power state to an unknown value at
> -		 * bootup and after being removed from a driver.  The only
> -		 * transition it allows from this unknown state is to D0, which
> -		 * typically happens when a driver calls pci_enable_device().
> -		 * We're not ready to enable the device yet, but we do want to
> -		 * be able to get to D3.  Therefore first do a D0 transition
> -		 * before going to D3.
> -		 */
> -		vfio_pci_set_power_state(vdev, PCI_D0);
> -		vfio_pci_set_power_state(vdev, PCI_D3hot);
> -	}
> +	/*
> +	 * pci-core sets the device power state to an unknown value at
> +	 * bootup and after being removed from a driver.  The only
> +	 * transition it allows from this unknown state is to D0, which
> +	 * typically happens when a driver calls pci_enable_device().
> +	 * We're not ready to enable the device yet, but we do want to
> +	 * be able to get to D3.  Therefore first do a D0 transition
> +	 * before enabling runtime PM.
> +	 */
> +	pci_set_power_state(pdev, PCI_D0);
> +	pm_runtime_set_autosuspend_delay(&pdev->dev, 1000);

Let's #define this 1000 at the top of the file with some rationale how
we arrived at this heuristic (ie. avoid magic numbers in code).  Thanks,

Alex

> +	pm_runtime_use_autosuspend(&pdev->dev);
> +	pm_runtime_mark_last_busy(&pdev->dev);
> +	pm_runtime_allow(&pdev->dev);
> +
> +	if (!disable_idle_d3)
> +		pm_runtime_put_autosuspend(&pdev->dev);
>  
>  	ret = vfio_register_group_dev(&vdev->vdev);
>  	if (ret)
> @@ -1868,7 +1882,10 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev)
>  
>  out_power:
>  	if (!disable_idle_d3)
> -		vfio_pci_set_power_state(vdev, PCI_D0);
> +		pm_runtime_get_noresume(&pdev->dev);
> +
> +	pm_runtime_dont_use_autosuspend(&pdev->dev);
> +	pm_runtime_forbid(&pdev->dev);
>  out_vf:
>  	vfio_pci_vf_uninit(vdev);
>  	return ret;
> @@ -1887,7 +1904,10 @@ void vfio_pci_core_unregister_device(struct vfio_pci_core_device *vdev)
>  	vfio_pci_vga_uninit(vdev);
>  
>  	if (!disable_idle_d3)
> -		vfio_pci_set_power_state(vdev, PCI_D0);
> +		pm_runtime_get_noresume(&pdev->dev);
> +
> +	pm_runtime_dont_use_autosuspend(&pdev->dev);
> +	pm_runtime_forbid(&pdev->dev);
>  }
>  EXPORT_SYMBOL_GPL(vfio_pci_core_unregister_device);
>  
> @@ -2093,33 +2113,63 @@ static bool vfio_pci_dev_set_needs_reset(struct vfio_device_set *dev_set)
>   *  - At least one of the affected devices is marked dirty via
>   *    needs_reset (such as by lack of FLR support)
>   * Then attempt to perform that bus or slot reset.
> - * Returns true if the dev_set was reset.
>   */
> -static bool vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set)
> +static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set)
>  {
>  	struct vfio_pci_core_device *cur;
>  	struct pci_dev *pdev;
>  	int ret;
>  
>  	if (!vfio_pci_dev_set_needs_reset(dev_set))
> -		return false;
> +		return;
>  
>  	pdev = vfio_pci_dev_set_resettable(dev_set);
>  	if (!pdev)
> -		return false;
> +		return;
>  
>  	ret = pci_reset_bus(pdev);
>  	if (ret)
> -		return false;
> +		return;
>  
>  	list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) {
>  		cur->needs_reset = false;
> -		if (!disable_idle_d3)
> -			vfio_pci_set_power_state(cur, PCI_D3hot);
> +		if (!disable_idle_d3) {
> +			/*
> +			 * Inside pci_reset_bus(), all the devices in bus/slot
> +			 * will be moved out of D0 state. This state change to
> +			 * D0 can happen directly without going through the
> +			 * runtime PM framework. pm_runtime_resume() will
> +			 * help make the runtime state as active and then
> +			 * pm_request_autosuspend() can be used which will
> +			 * internally check for device usage count and will
> +			 * move the device again into the low power state.
> +			 */
> +			pm_runtime_resume(&pdev->dev);
> +			pm_runtime_mark_last_busy(&pdev->dev);
> +			pm_request_autosuspend(&pdev->dev);
> +		}
>  	}
> -	return true;
>  }
>  
> +#ifdef CONFIG_PM
> +static int vfio_pci_core_runtime_suspend(struct device *dev)
> +{
> +	return 0;
> +}
> +
> +static int vfio_pci_core_runtime_resume(struct device *dev)
> +{
> +	return 0;
> +}
> +
> +const struct dev_pm_ops vfio_pci_core_pm_ops = {
> +	SET_RUNTIME_PM_OPS(vfio_pci_core_runtime_suspend,
> +			   vfio_pci_core_runtime_resume,
> +			   NULL)
> +};
> +EXPORT_SYMBOL_GPL(vfio_pci_core_pm_ops);
> +#endif
> +
>  void vfio_pci_core_set_params(bool is_nointxmask, bool is_disable_vga,
>  			      bool is_disable_idle_d3)
>  {
> diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
> index ef9a44b6cf5d..aafe09c9fa64 100644
> --- a/include/linux/vfio_pci_core.h
> +++ b/include/linux/vfio_pci_core.h
> @@ -231,6 +231,10 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev);
>  void vfio_pci_core_disable(struct vfio_pci_core_device *vdev);
>  void vfio_pci_core_finish_enable(struct vfio_pci_core_device *vdev);
>  
> +#ifdef CONFIG_PM
> +extern const struct dev_pm_ops vfio_pci_core_pm_ops;
> +#endif
> +
>  static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
>  {
>  	return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;


  reply	other threads:[~2021-11-17 17:53 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-11-15 13:36 [RFC 0/3] vfio/pci: Enable runtime power management support abhsahu
2021-11-15 13:36 ` [RFC 1/3] vfio/pci: register vfio-pci driver with runtime PM framework abhsahu
2021-11-17 17:52   ` Alex Williamson [this message]
2021-11-18 15:37     ` Abhishek Sahu
2021-11-15 13:36 ` [RFC 2/3] vfio/pci: virtualize PME related registers bits and initialize to zero abhsahu
2021-11-17 17:53   ` Alex Williamson
2021-11-18 15:24     ` Abhishek Sahu
2021-11-15 13:36 ` [RFC 3/3] vfio/pci: use runtime PM for vfio-device into low power state abhsahu
2021-11-17 17:53   ` Alex Williamson
2021-11-18 15:21     ` Abhishek Sahu
2021-11-18 21:09       ` Alex Williamson
2021-11-19 15:45         ` Alex Williamson
2021-11-23  7:27           ` Abhishek Sahu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20211117105254.49227dc1.alex.williamson@redhat.com \
    --to=alex.williamson@redhat.com \
    --cc=abhsahu@nvidia.com \
    --cc=cohuck@redhat.com \
    --cc=jgg@nvidia.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgurtovoy@nvidia.com \
    --cc=thunder.leizhen@huawei.com \
    --cc=yishaih@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).