kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Abhishek Sahu <abhsahu@nvidia.com>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: kvm@vger.kernel.org, Cornelia Huck <cohuck@redhat.com>,
	Max Gurtovoy <mgurtovoy@nvidia.com>,
	Yishai Hadas <yishaih@nvidia.com>,
	Zhen Lei <thunder.leizhen@huawei.com>,
	Jason Gunthorpe <jgg@nvidia.com>,
	linux-kernel@vger.kernel.org
Subject: Re: [RFC 1/3] vfio/pci: register vfio-pci driver with runtime PM framework
Date: Thu, 18 Nov 2021 21:07:12 +0530	[thread overview]
Message-ID: <73d3025f-e720-8062-7b6c-1f2a4c8cb1a6@nvidia.com> (raw)
In-Reply-To: <20211117105254.49227dc1.alex.williamson@redhat.com>

On 11/17/2021 11:22 PM, Alex Williamson wrote:
> On Mon, 15 Nov 2021 19:06:38 +0530
> <abhsahu@nvidia.com> wrote:
> 
>> From: Abhishek Sahu <abhsahu@nvidia.com>
>>
>> Currently, there is very limited power management support
>> available in upstream vfio-pci driver. If there is no user of vfio-pci
>> device, then the PCI device will be moved into D3Hot state by writing
>> directly into PCI PM registers. This D3Hot state help in saving some
>> amount of power but we can achieve zero power consumption if we go
>> into D3cold state. The D3cold state cannot be possible with Native PCI
>> PM. It requires interaction with platform firmware which is
>> system-specific. To go into low power states (including D3cold), the
>> runtime PM framework can be used which internally interacts with PCI
>> and platform firmware and puts the device into the lowest possible
>> D-States.
>>
>> This patch registers vfio-pci driver with the runtime PM framework.
>>
>> 1. The PCI core framework takes care of most of the runtime PM
>>    related things. For enabling the runtime PM, the PCI driver needs to
>>    decrement the usage count and needs to register the runtime
>>    suspend/resume routines. For vfio-pci based driver, these routines can
>>    be stubbed since the vfio-pci driver is not doing the PCI device
>>    initialization. All the config state saving, and PCI power management
>>    related things will be done by PCI core framework itself inside its
>>    runtime suspend/resume callbacks.
>>
>> 2. To prevent frequent runtime, suspend/resume, it uses autosuspend
>>    support with a default delay of 1 second.
>>
>> 3. Inside pci_reset_bus(), all the devices in bus/slot will be moved
>>    out of D0 state. This state change to D0 can happen directly without
>>    going through the runtime PM framework. So if runtime PM is enabled,
>>    then pm_runtime_resume() makes the runtime state active. Since the PCI
>>    device power state is already D0, so it should return early when it
>>    tries to change the state with  pci_set_power_state(). Then
>>    pm_request_autosuspend() can be used which will internally check for
>>    device usage count and will move the device again into low power
>>    state.
>>
>> 4. Inside vfio_pci_core_disable(), the device usage count always needs
>>    to decremented which was incremented vfio_pci_core_enable() with
>>    pm_runtime_get_sync().
>>
>> 5. Since the runtime PM framework will provide the same functionality,
>>    so directly writing into PCI PM config register can be replaced with
>>    use of runtime PM routines. Also, the use of runtime PM can help us in
>>    more power saving.
>>
>>    In the systems which do not support D3Cold,
>>
>>    With the existing implementation:
>>
>>    // PCI device
>>    # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
>>    D3hot
>>    // upstream bridge
>>    # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
>>    D0
>>
>>    With runtime PM:
>>
>>    // PCI device
>>    # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
>>    D3hot
>>    // upstream bridge
>>    # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
>>    D3hot
>>
>>    So, with runtime PM, the upstream bridge or root port will also go
>>    into lower power state which is not possible with existing
>>    implementation.
>>
>>    In the systems which support D3Cold,
>>
>>    // PCI device
>>    # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
>>    D3hot
>>    // upstream bridge
>>    # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
>>    D0
>>
>>    With runtime PM:
>>
>>    // PCI device
>>    # cat /sys/bus/pci/devices/0000\:01\:00.0/power_state
>>    D3cold
>>    // upstream bridge
>>    # cat /sys/bus/pci/devices/0000\:00\:01.0/power_state
>>    D3cold
>>
>>    So, with runtime PM, both the PCI device and upstream bridge will
>>    go into D3cold state.
>>
>> 6. If 'disable_idle_d3' module parameter is set, then also the runtime
>>    PM will be enabled, but in this case, the usage count should not be
>>    decremented.
>>
>> 7. vfio_pci_dev_set_try_reset() return value is unused now, so this
>>    function return type can be changed to void.
>>
>> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
>> ---
>>  drivers/vfio/pci/vfio_pci.c      |   3 +
>>  drivers/vfio/pci/vfio_pci_core.c | 104 +++++++++++++++++++++++--------
>>  include/linux/vfio_pci_core.h    |   4 ++
>>  3 files changed, 84 insertions(+), 27 deletions(-)
>>
>> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
>> index a5ce92beb655..c8695baf3b54 100644
>> --- a/drivers/vfio/pci/vfio_pci.c
>> +++ b/drivers/vfio/pci/vfio_pci.c
>> @@ -193,6 +193,9 @@ static struct pci_driver vfio_pci_driver = {
>>       .remove                 = vfio_pci_remove,
>>       .sriov_configure        = vfio_pci_sriov_configure,
>>       .err_handler            = &vfio_pci_core_err_handlers,
>> +#if defined(CONFIG_PM)
>> +     .driver.pm              = &vfio_pci_core_pm_ops,
>> +#endif
>>  };
>>
>>  static void __init vfio_pci_fill_ids(void)
>> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
>> index f948e6cd2993..511a52e92b32 100644
>> --- a/drivers/vfio/pci/vfio_pci_core.c
>> +++ b/drivers/vfio/pci/vfio_pci_core.c
>> @@ -152,7 +152,7 @@ static void vfio_pci_probe_mmaps(struct vfio_pci_core_device *vdev)
>>  }
>>
>>  struct vfio_pci_group_info;
>> -static bool vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set);
>> +static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set);
>>  static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
>>                                     struct vfio_pci_group_info *groups);
>>
>> @@ -245,7 +245,8 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev)
>>       u16 cmd;
>>       u8 msix_pos;
>>
>> -     vfio_pci_set_power_state(vdev, PCI_D0);
>> +     if (!disable_idle_d3)
>> +             pm_runtime_get_sync(&pdev->dev);
> 
> I'm not a pm_runtime expert, but why are we not using
> pm_runtime_resume_and_get() here and aborting the function on error?
> 

 Thanks for pointing this.
 We should use pm_runtime_resume_and_get() and will abort the function
 in case of error. I will check other places also and see if
 we can use similar API.

> I hope some folks more familiar with pm_runtime can also review usage
> across this series.
> 

 Yes. The runtime PM API usage requires through review.

>>
>>       /* Don't allow our initial saved state to include busmaster */
>>       pci_clear_master(pdev);
>> @@ -405,8 +406,17 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
>>  out:
>>       pci_disable_device(pdev);
>>
>> -     if (!vfio_pci_dev_set_try_reset(vdev->vdev.dev_set) && !disable_idle_d3)
>> -             vfio_pci_set_power_state(vdev, PCI_D3hot);
>> +     vfio_pci_dev_set_try_reset(vdev->vdev.dev_set);
>> +
>> +     /*
>> +      * The device usage count always needs to decremented which was
>> +      * incremented in vfio_pci_core_enable() with
>> +      * pm_runtime_get_sync().
>> +      */
> 
> *to be
> 
> Maybe:
> 
>         /*
>          * Put the pm-runtime usage counter acquired during enable; mark
>          * last use time to delay autosuspend.
>          */
> 

 I will fix this.

>> +     if (!disable_idle_d3) {
>> +             pm_runtime_mark_last_busy(&pdev->dev);
>> +             pm_runtime_put_autosuspend(&pdev->dev);
>> +     }
>>  }
>>  EXPORT_SYMBOL_GPL(vfio_pci_core_disable);
>>
>> @@ -1847,19 +1857,23 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev)
>>
>>       vfio_pci_probe_power_state(vdev);
>>
>> -     if (!disable_idle_d3) {
>> -             /*
>> -              * pci-core sets the device power state to an unknown value at
>> -              * bootup and after being removed from a driver.  The only
>> -              * transition it allows from this unknown state is to D0, which
>> -              * typically happens when a driver calls pci_enable_device().
>> -              * We're not ready to enable the device yet, but we do want to
>> -              * be able to get to D3.  Therefore first do a D0 transition
>> -              * before going to D3.
>> -              */
>> -             vfio_pci_set_power_state(vdev, PCI_D0);
>> -             vfio_pci_set_power_state(vdev, PCI_D3hot);
>> -     }
>> +     /*
>> +      * pci-core sets the device power state to an unknown value at
>> +      * bootup and after being removed from a driver.  The only
>> +      * transition it allows from this unknown state is to D0, which
>> +      * typically happens when a driver calls pci_enable_device().
>> +      * We're not ready to enable the device yet, but we do want to
>> +      * be able to get to D3.  Therefore first do a D0 transition
>> +      * before enabling runtime PM.
>> +      */
>> +     pci_set_power_state(pdev, PCI_D0);
>> +     pm_runtime_set_autosuspend_delay(&pdev->dev, 1000);
> 
> Let's #define this 1000 at the top of the file with some rationale how
> we arrived at this heuristic (ie. avoid magic numbers in code).  Thanks,
> 
> Alex
> 

 Sure.This autosuspend delay was mainly for back-to-back config read/write
 case. I will move at the top with proper reason.

 Thanks,
 Abhishek 
 
>> +     pm_runtime_use_autosuspend(&pdev->dev);
>> +     pm_runtime_mark_last_busy(&pdev->dev);
>> +     pm_runtime_allow(&pdev->dev);
>> +
>> +     if (!disable_idle_d3)
>> +             pm_runtime_put_autosuspend(&pdev->dev);
>>
>>       ret = vfio_register_group_dev(&vdev->vdev);
>>       if (ret)
>> @@ -1868,7 +1882,10 @@ int vfio_pci_core_register_device(struct vfio_pci_core_device *vdev)
>>
>>  out_power:
>>       if (!disable_idle_d3)
>> -             vfio_pci_set_power_state(vdev, PCI_D0);
>> +             pm_runtime_get_noresume(&pdev->dev);
>> +
>> +     pm_runtime_dont_use_autosuspend(&pdev->dev);
>> +     pm_runtime_forbid(&pdev->dev);
>>  out_vf:
>>       vfio_pci_vf_uninit(vdev);
>>       return ret;
>> @@ -1887,7 +1904,10 @@ void vfio_pci_core_unregister_device(struct vfio_pci_core_device *vdev)
>>       vfio_pci_vga_uninit(vdev);
>>
>>       if (!disable_idle_d3)
>> -             vfio_pci_set_power_state(vdev, PCI_D0);
>> +             pm_runtime_get_noresume(&pdev->dev);
>> +
>> +     pm_runtime_dont_use_autosuspend(&pdev->dev);
>> +     pm_runtime_forbid(&pdev->dev);
>>  }
>>  EXPORT_SYMBOL_GPL(vfio_pci_core_unregister_device);
>>
>> @@ -2093,33 +2113,63 @@ static bool vfio_pci_dev_set_needs_reset(struct vfio_device_set *dev_set)
>>   *  - At least one of the affected devices is marked dirty via
>>   *    needs_reset (such as by lack of FLR support)
>>   * Then attempt to perform that bus or slot reset.
>> - * Returns true if the dev_set was reset.
>>   */
>> -static bool vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set)
>> +static void vfio_pci_dev_set_try_reset(struct vfio_device_set *dev_set)
>>  {
>>       struct vfio_pci_core_device *cur;
>>       struct pci_dev *pdev;
>>       int ret;
>>
>>       if (!vfio_pci_dev_set_needs_reset(dev_set))
>> -             return false;
>> +             return;
>>
>>       pdev = vfio_pci_dev_set_resettable(dev_set);
>>       if (!pdev)
>> -             return false;
>> +             return;
>>
>>       ret = pci_reset_bus(pdev);
>>       if (ret)
>> -             return false;
>> +             return;
>>
>>       list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) {
>>               cur->needs_reset = false;
>> -             if (!disable_idle_d3)
>> -                     vfio_pci_set_power_state(cur, PCI_D3hot);
>> +             if (!disable_idle_d3) {
>> +                     /*
>> +                      * Inside pci_reset_bus(), all the devices in bus/slot
>> +                      * will be moved out of D0 state. This state change to
>> +                      * D0 can happen directly without going through the
>> +                      * runtime PM framework. pm_runtime_resume() will
>> +                      * help make the runtime state as active and then
>> +                      * pm_request_autosuspend() can be used which will
>> +                      * internally check for device usage count and will
>> +                      * move the device again into the low power state.
>> +                      */
>> +                     pm_runtime_resume(&pdev->dev);
>> +                     pm_runtime_mark_last_busy(&pdev->dev);
>> +                     pm_request_autosuspend(&pdev->dev);
>> +             }
>>       }
>> -     return true;
>>  }
>>
>> +#ifdef CONFIG_PM
>> +static int vfio_pci_core_runtime_suspend(struct device *dev)
>> +{
>> +     return 0;
>> +}
>> +
>> +static int vfio_pci_core_runtime_resume(struct device *dev)
>> +{
>> +     return 0;
>> +}
>> +
>> +const struct dev_pm_ops vfio_pci_core_pm_ops = {
>> +     SET_RUNTIME_PM_OPS(vfio_pci_core_runtime_suspend,
>> +                        vfio_pci_core_runtime_resume,
>> +                        NULL)
>> +};
>> +EXPORT_SYMBOL_GPL(vfio_pci_core_pm_ops);
>> +#endif
>> +
>>  void vfio_pci_core_set_params(bool is_nointxmask, bool is_disable_vga,
>>                             bool is_disable_idle_d3)
>>  {
>> diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
>> index ef9a44b6cf5d..aafe09c9fa64 100644
>> --- a/include/linux/vfio_pci_core.h
>> +++ b/include/linux/vfio_pci_core.h
>> @@ -231,6 +231,10 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev);
>>  void vfio_pci_core_disable(struct vfio_pci_core_device *vdev);
>>  void vfio_pci_core_finish_enable(struct vfio_pci_core_device *vdev);
>>
>> +#ifdef CONFIG_PM
>> +extern const struct dev_pm_ops vfio_pci_core_pm_ops;
>> +#endif
>> +
>>  static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
>>  {
>>       return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;
> 


  reply	other threads:[~2021-11-18 15:37 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-11-15 13:36 [RFC 0/3] vfio/pci: Enable runtime power management support abhsahu
2021-11-15 13:36 ` [RFC 1/3] vfio/pci: register vfio-pci driver with runtime PM framework abhsahu
2021-11-17 17:52   ` Alex Williamson
2021-11-18 15:37     ` Abhishek Sahu [this message]
2021-11-15 13:36 ` [RFC 2/3] vfio/pci: virtualize PME related registers bits and initialize to zero abhsahu
2021-11-17 17:53   ` Alex Williamson
2021-11-18 15:24     ` Abhishek Sahu
2021-11-15 13:36 ` [RFC 3/3] vfio/pci: use runtime PM for vfio-device into low power state abhsahu
2021-11-17 17:53   ` Alex Williamson
2021-11-18 15:21     ` Abhishek Sahu
2021-11-18 21:09       ` Alex Williamson
2021-11-19 15:45         ` Alex Williamson
2021-11-23  7:27           ` Abhishek Sahu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=73d3025f-e720-8062-7b6c-1f2a4c8cb1a6@nvidia.com \
    --to=abhsahu@nvidia.com \
    --cc=alex.williamson@redhat.com \
    --cc=cohuck@redhat.com \
    --cc=jgg@nvidia.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgurtovoy@nvidia.com \
    --cc=thunder.leizhen@huawei.com \
    --cc=yishaih@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).