Re: [RFC 3/3] vfio/pci: use runtime PM for vfio-device into low power state

From: Abhishek Sahu <abhsahu@nvidia.com>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: kvm@vger.kernel.org, Cornelia Huck <cohuck@redhat.com>,
	Max Gurtovoy <mgurtovoy@nvidia.com>,
	Yishai Hadas <yishaih@nvidia.com>,
	Zhen Lei <thunder.leizhen@huawei.com>,
	Jason Gunthorpe <jgg@nvidia.com>,
	linux-kernel@vger.kernel.org
Subject: Re: [RFC 3/3] vfio/pci: use runtime PM for vfio-device into low power state
Date: Thu, 18 Nov 2021 20:51:41 +0530	[thread overview]
Message-ID: <8a49aa97-5de9-9cc8-d45f-e96456d66603@nvidia.com> (raw)
In-Reply-To: <20211117105323.2866b739.alex.williamson@redhat.com>

On 11/17/2021 11:23 PM, Alex Williamson wrote:

> On Mon, 15 Nov 2021 19:06:40 +0530
> <abhsahu@nvidia.com> wrote:
> 
>> From: Abhishek Sahu <abhsahu@nvidia.com>
>>
>> Currently, if the runtime power management is enabled for vfio-pci
>> device in guest OS, then guest OS will do the register write for
>> PCI_PM_CTRL register. This write request will be handled in
>> vfio_pm_config_write() where it will do the actual register write of
>> PCI_PM_CTRL register. With this, the maximum D3hot state can be
>> achieved for low power. If we can use the runtime PM framework, then
>> we can achieve the D3cold state which will help in saving
>> maximum power.
>>
>> This patch uses runtime PM framework whenever vfio-pci device will
>> be put in the low power state.
>>
>> 1. If runtime PM is enabled, then instead of directly writing
>>    PCI_PM_CTRL register, decrement the device usage counter whenever
>>    the power state is non-D0. The kernel runtime PM framework will
>>    itself put the PCI device in low power state when device usage
>>    counter will become zero. Similarly, when the power state will be
>>    again changed back to D0, then increment the device usage counter
>>    and the kernel runtime PM framework will itself bring the PCI device
>>    out of low power state.
>>
>> 2. The guest OS will read the PCI_PM_CTRL register back to
>>    confirm the current power state so virtual register bits can be
>>    used. For this, before decrementing the usage count, read the actual
>>    PCI_PM_CTRL register, update the power state related bits, and then
>>    update the vconfig bits corresponding to PCI_PM_CTRL offset. For
>>    PCI_PM_CTRL register read, return the virtual value if runtime PM is
>>    requested. This vconfig bits will be cleared when the power state
>>    will be changed back to D0.
>>
>> 3. For the guest OS, the PCI power state will be still D3hot
>>    even if put the actual PCI device into D3cold state. In the D3hot
>>    state, the config space can still be read. So, if there is any request
>>    from guest OS to read the config space, then we need to move the actual
>>    PCI device again back to D0. For this, increment the device usage
>>    count before reading/writing the config space and then decrement it
>>    again after reading/writing the config space. There can be
>>    back-to-back config register read/write request, but since the auto
>>    suspend methods are being used here so only first access will
>>    wake up the PCI device and for the remaining access, the device will
>>    already be active.
>>
>> 4. This above-mentioned wake up is not needed if the register
>>    read/write is done completely with virtual bits. For handling this
>>    condition, the actual resume of device will only being done in
>>    vfio_user_config_read()/vfio_user_config_write(). All the config
>>    register access will come vfio_pci_config_rw(). So, in this
>>    function, use the pm_runtime_get_noresume() so that only usage count
>>    will be incremented without resuming the device. Inside,
>>    vfio_user_config_read()/vfio_user_config_write(), use the routines
>>    with pm_runtime_put_noidle() so that the PCI device won’t be
>>    suspended in the lower level functions. Again in the top level
>>    vfio_pci_config_rw(), use the pm_runtime_put_autosuspend(). Now the
>>    auto suspend timer will be started and the device will be suspended
>>    again. If the device is already runtime suspended, then this routine
>>    will return early.
>>
>> 5. In the host side D3cold will only be used if the platform has
>>    support for this, otherwise some other state will be used. The
>>    config space can be read if the device is not in D3cold state. So in
>>    this case, we can skip the resuming of PCI device. The wrapper
>>    function vfio_pci_config_pm_runtime_get() takes care of this
>>    condition and invoke the pm_runtime_resume() only if current power
>>    state is D3cold.
>>
>> 6. For vfio_pci_config_pm_runtime_get()/vfio_
>>    pci_config_pm_runtime_put(), the reference code is taken from
>>    pci_config_pm_runtime_get()/pci_config_pm_runtime_put() and then it
>>    is modified according to vfio-pci driver requirement.
>>
>> 7. vfio_pci_set_power_state() will be unused after moving to runtime
>>    PM, so this function can be removed along with other related
>>    functions and structure fields.
> 
> 

 Thanks Alex for checking this series and providing your inputs. 

> If we're transitioning a device to D3cold rather than D3hot as
> requested by userspace, isn't that a user visible change? 

  For most of the driver, in linux kernel, the D3hot vs D3cold
  state will be decided at PCI core layer. In the PCI core layer,
  pci_target_state() determines which D3 state to choose. It checks
  for platform_pci_power_manageable() and then it calls
  platform_pci_choose_state() to find the target state.
  In VM, the platform_pci_power_manageable() check will fail if the
  guest is linux OS. So, it uses, D3hot state.

  But there are few drivers which does not use the PCI framework
  generic power related routines during runtime suspend/system suspend
  and set the PCI power state directly with D3hot.
  Also, the guest can be non-Linux OS also and, in that case,
  it will be difficult to know the behavior. So, it may impact
  these cases.

> For instance, a device may report NoSoftRst- indicating that the device
> does not do a soft reset on D3hot->D0 transition.  If we're instead
> putting the device in D3cold, then a transition back to D0 has very
> much undergone a reset.  On one hand we should at least virtualize the
> NoSoftRst bit to allow the guest to restore the device, but I wonder if
> that's really safe.  Is a better option to prevent entering D3cold if
> the device isn't natively reporting NoSoftRst-?
> 

 You mean to say NoSoftRst+ instead of NoSoftRst- as visible in
 the lspci output. For NoSoftRst- case, we do a soft reset on
 D3hot->D0 transition. But, will this case not be handled internally
 in drivers/pci/pci-driver.c ? For both system suspend and runtime suspend,
 we check for pci_dev->state_saved flag and do pci_save_state()
 irrespective of NoSoftRst bit. For NoSoftRst- case, pci_restore_bars()
 will be called in pci_raw_set_power_state() which will reinitialize device
 for D3hot/D3cold-> D0 case. Once the device is initialized in the host,
 then for guest, it should work without re-initializing again in the
 guest side. I am not sure, if my understanding is correct.

> We're also essentially making a policy decision on behalf of userspace
> that favors power saving over latency.  Is that universally the correct
> trade-off? 

 For most drivers, the D3hot vs D3cold should not be favored due
 to latency reasons. In the linux kernel side, I am seeing, the
 PCI framework try to use D3cold state if platform and device
 supports that. But its correct that covertly replacing D3hot with
 D3cold may be concern for some drivers.

> I can imagine this could be desirable for many use cases,
> but if we're going to covertly replace D3hot with D3cold, it seems like
> there should be an opt-in.  Is co-opting the PM capability for this
> even really acceptable or should there be a device ioctl to request
> D3cold and plumbing through QEMU such that a VM guest can make informed
> choices regarding device power management?
> 

 Making IOCTL is also an option but that case, this support needs to
 be added in all hypervisors and user must pass this information
 explicitly for each device. Another option could be to use
 module parameter to explicitly enable D3cold support. If module
 parameter is not set, then we can call pci_d3cold_disable() and
 in that case, runtime PM should not use D3cold state. 

 Also, I was checking we can pass this information though some
 virtualized register bit which will be only defined for passing
 the information between guest and host. In the guest side if the
 target state is being decided with pci_target_state(), then
 the D3cold vs D3hot should not matter for the driver running
 in the guest side and in that case, it depends upon platform support.
 We can set this virtualize bit to 1. But, if driver is either
 setting D3hot state explicitly or has called pci_d3cold_disable() or
 similar API available in the guest OS, then set this bit to 0 and
 in that case, the D3cold state can be disabled in the host side.
 But don't know if is possible to use some non PCI defined
 virtualized register bit. 

 I am not sure what should be best option to make choice
 regarding d3cold but if we can have some option by which this
 can be done without involvement of user, then it will benefit
 for lot of cases. Currently, the D3cold is supported only in
 very few desktops/servers but in future, we will see on
 most of the platforms.  

> Also if the device is not responsive to config space due to the user
> placing it in D3 now, I'd expect there are other ioctl paths that need
> to be blocked, maybe even MMIO paths that might be a gap for existing
> D3hot support.  Thanks,
> 
> Alex
> 

 I was in assumption that most of IOCTL code will be called by the
 hypervisor before guest OS boot and during that time, the device
 will be always in D0. But, if we have paths where IOCTL can be
 called when the device has been suspended by guest OS, then can we
 use runtime_get/put API’s there also ?

 Thanks,
 Abhishek