Re: [RFC 3/3] vfio/pci: use runtime PM for vfio-device into low power state

From: Alex Williamson <alex.williamson@redhat.com>
To: Abhishek Sahu <abhsahu@nvidia.com>
Cc: kvm@vger.kernel.org, Cornelia Huck <cohuck@redhat.com>,
	Max Gurtovoy <mgurtovoy@nvidia.com>,
	Yishai Hadas <yishaih@nvidia.com>,
	Zhen Lei <thunder.leizhen@huawei.com>,
	Jason Gunthorpe <jgg@nvidia.com>,
	linux-kernel@vger.kernel.org
Subject: Re: [RFC 3/3] vfio/pci: use runtime PM for vfio-device into low power state
Date: Fri, 19 Nov 2021 08:45:48 -0700	[thread overview]
Message-ID: <20211119084548.2042d763.alex.williamson@redhat.com> (raw)
In-Reply-To: <20211118140913.180bf94f.alex.williamson@redhat.com>

On Thu, 18 Nov 2021 14:09:13 -0700
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Thu, 18 Nov 2021 20:51:41 +0530
> Abhishek Sahu <abhsahu@nvidia.com> wrote:
> > On 11/17/2021 11:23 PM, Alex Williamson wrote:
> >  Thanks Alex for checking this series and providing your inputs. 
> >    
> > > If we're transitioning a device to D3cold rather than D3hot as
> > > requested by userspace, isn't that a user visible change?     
> > 
> >   For most of the driver, in linux kernel, the D3hot vs D3cold
> >   state will be decided at PCI core layer. In the PCI core layer,
> >   pci_target_state() determines which D3 state to choose. It checks
> >   for platform_pci_power_manageable() and then it calls
> >   platform_pci_choose_state() to find the target state.
> >   In VM, the platform_pci_power_manageable() check will fail if the
> >   guest is linux OS. So, it uses, D3hot state.  
> 
> Right, but my statement is really more that the device PM registers
> cannot be used to put the device into D3cold, so the write of the PM
> register that we're trapping was the user/guest's intention to put the
> device into D3hot.  We therefore need to be careful about differences
> in the resulting device state when it comes out of D3cold vs D3hot.
> 
> >   But there are few drivers which does not use the PCI framework
> >   generic power related routines during runtime suspend/system suspend
> >   and set the PCI power state directly with D3hot.  
> 
> Current vfio-pci being one of those ;)
> 
> >   Also, the guest can be non-Linux OS also and, in that case,
> >   it will be difficult to know the behavior. So, it may impact
> >   these cases.  
> 
> That's what I'm worried about.
> 
> > > For instance, a device may report NoSoftRst- indicating that the device
> > > does not do a soft reset on D3hot->D0 transition.  If we're instead
> > > putting the device in D3cold, then a transition back to D0 has very
> > > much undergone a reset.  On one hand we should at least virtualize the
> > > NoSoftRst bit to allow the guest to restore the device, but I wonder if
> > > that's really safe.  Is a better option to prevent entering D3cold if
> > > the device isn't natively reporting NoSoftRst-?
> > >     
> > 
> >  You mean to say NoSoftRst+ instead of NoSoftRst- as visible in  
> 
> Oops yes.  The concern is if the user/guest is not expecting a soft
> reset when using D3hot, but we transparently promote D3hot to D3cold
> which will always implies a device reset.
> 
> >  the lspci output. For NoSoftRst- case, we do a soft reset on
> >  D3hot->D0 transition. But, will this case not be handled internally
> >  in drivers/pci/pci-driver.c ? For both system suspend and runtime suspend,
> >  we check for pci_dev->state_saved flag and do pci_save_state()
> >  irrespective of NoSoftRst bit. For NoSoftRst- case, pci_restore_bars()
> >  will be called in pci_raw_set_power_state() which will reinitialize device
> >  for D3hot/D3cold-> D0 case. Once the device is initialized in the host,
> >  then for guest, it should work without re-initializing again in the
> >  guest side. I am not sure, if my understanding is correct.  
> 
> The soft reset is not limited to the state that the PCI subsystem can
> save and restore.  Device specific state that the user/guest may
> legitimately expect to be retained may be reset as well.
> 
> [PCIe v5 5.3.1.4]
> 	Functional context is required to be maintained by Functions in
> 	the D3 hot state if the No_Soft_Reset field in the PMCSR is Set.
> 
> Unfortunately I don't see a specific definition of "functional
> context", but I interpret that to include device specific state.  For
> example, if a GPU contains specific frame buffer data and reports
> NoSoftRst+, wouldn't it be reasonable to expect that framebuffer data
> to be retained on D3hot->D0 transition?
>  
> > > We're also essentially making a policy decision on behalf of
> > > userspace that favors power saving over latency.  Is that
> > > universally the correct trade-off?     
> > 
> >  For most drivers, the D3hot vs D3cold should not be favored due
> >  to latency reasons. In the linux kernel side, I am seeing, the
> >  PCI framework try to use D3cold state if platform and device
> >  supports that. But its correct that covertly replacing D3hot with
> >  D3cold may be concern for some drivers.
> >   
> > > I can imagine this could be desirable for many use cases,
> > > but if we're going to covertly replace D3hot with D3cold, it seems
> > > like there should be an opt-in.  Is co-opting the PM capability for
> > > this even really acceptable or should there be a device ioctl to
> > > request D3cold and plumbing through QEMU such that a VM guest can
> > > make informed choices regarding device power management?
> > >     
> > 
> >  Making IOCTL is also an option but that case, this support needs to
> >  be added in all hypervisors and user must pass this information
> >  explicitly for each device. Another option could be to use
> >  module parameter to explicitly enable D3cold support. If module
> >  parameter is not set, then we can call pci_d3cold_disable() and
> >  in that case, runtime PM should not use D3cold state.
> > 
> >  Also, I was checking we can pass this information though some
> >  virtualized register bit which will be only defined for passing
> >  the information between guest and host. In the guest side if the
> >  target state is being decided with pci_target_state(), then
> >  the D3cold vs D3hot should not matter for the driver running
> >  in the guest side and in that case, it depends upon platform support.
> >  We can set this virtualize bit to 1. But, if driver is either
> >  setting D3hot state explicitly or has called pci_d3cold_disable() or
> >  similar API available in the guest OS, then set this bit to 0 and
> >  in that case, the D3cold state can be disabled in the host side.
> >  But don't know if is possible to use some non PCI defined
> >  virtualized register bit.   
> 
> If you're suggesting a device config space register, that's troublesome
> because we can't guarantee that simply because a range of config space
> isn't within a capability that it doesn't have some device specific
> purpose.  However, we could certainly implement virtual registers in
> the hypervisor that implement the ACPI support that an OS would use on
> bare metal to implement D3cold.  Those could trigger this ioctl through
> the vfio device.
> 
> >  I am not sure what should be best option to make choice
> >  regarding d3cold but if we can have some option by which this
> >  can be done without involvement of user, then it will benefit
> >  for lot of cases. Currently, the D3cold is supported only in
> >  very few desktops/servers but in future, we will see on
> >  most of the platforms.    
> 
> I tend to see it as an interesting hack to promote D3hot to D3cold, and
> potentially very useful.  However, we're also introducing potentially
> unexpected device behavior, so I think it would probably need to be an
> opt-in.  Possibly if the device reports NoSoftRst- we could use it by
> default, but even virtualizing the NoSoftRst suggests that there's an
> expectation that the guest driver has that support available.
>  
> > > Also if the device is not responsive to config space due to the user
> > > placing it in D3 now, I'd expect there are other ioctl paths that
> > > need to be blocked, maybe even MMIO paths that might be a gap for
> > > existing D3hot support.  Thanks,  
> > 
> >  I was in assumption that most of IOCTL code will be called by the
> >  hypervisor before guest OS boot and during that time, the device
> >  will be always in D0. But, if we have paths where IOCTL can be
> >  called when the device has been suspended by guest OS, then can we
> >  use runtime_get/put API’s there also ?  
> 
> It's more a matter of preventing user actions that can cause harm
> rather than expecting certain operations only in specific states.  We
> could chose to either resume the device for those operations or fail
> the operation.  We should probably also leverage the memory-disable
> support to fault mmap access to MMIO when the device is in D3* as well.

It also occurred to me last night that a guest triggering D3hot via the
PM registers must be a synchronous power state change, we can't use
auto-suspend.  This is necessary for nested assignment where the guest
might use a D3hot->D0 power state transition with NoSoftRst- devices in
order to perform a reset of the device.  With auto-suspend, the guest
would return the device to D0 before the physical device ever timed out
to enter a D3 state.  Thanks,

Alex