Re: [PATCH v4 2/6] vfio: Add a new device feature for the power management

From: Abhishek Sahu <abhsahu@nvidia.com>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: Cornelia Huck <cohuck@redhat.com>,
	Yishai Hadas <yishaih@nvidia.com>,
	Jason Gunthorpe <jgg@nvidia.com>,
	Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>,
	Kevin Tian <kevin.tian@intel.com>,
	"Rafael J . Wysocki" <rafael@kernel.org>,
	Max Gurtovoy <mgurtovoy@nvidia.com>,
	Bjorn Helgaas <bhelgaas@google.com>,
	linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
	linux-pm@vger.kernel.org, linux-pci@vger.kernel.org
Subject: Re: [PATCH v4 2/6] vfio: Add a new device feature for the power management
Date: Mon, 11 Jul 2022 23:00:10 +0530	[thread overview]
Message-ID: <99be22e1-db69-1ecf-a739-e4f603c96f6a@nvidia.com> (raw)
In-Reply-To: <20220711070446.2bd6ac80.alex.williamson@redhat.com>

On 7/11/2022 6:34 PM, Alex Williamson wrote:
> On Mon, 11 Jul 2022 15:13:13 +0530
> Abhishek Sahu <abhsahu@nvidia.com> wrote:
> 
>> On 7/8/2022 10:06 PM, Alex Williamson wrote:
>>> On Fri, 8 Jul 2022 15:09:22 +0530
>>> Abhishek Sahu <abhsahu@nvidia.com> wrote:
>>>   
>>>> On 7/6/2022 9:09 PM, Alex Williamson wrote:  
>>>>> On Fri, 1 Jul 2022 16:38:10 +0530
>>>>> Abhishek Sahu <abhsahu@nvidia.com> wrote:
>>>>>     
>>>>>> This patch adds the new feature VFIO_DEVICE_FEATURE_POWER_MANAGEMENT
>>>>>> for the power management in the header file. The implementation for the
>>>>>> same will be added in the subsequent patches.
>>>>>>
>>>>>> With the standard registers, all power states cannot be achieved. The
>>>>>> platform-based power management needs to be involved to go into the
>>>>>> lowest power state. For all the platform-based power management, this
>>>>>> device feature can be used.
>>>>>>
>>>>>> This device feature uses flags to specify the different operations. In
>>>>>> the future, if any more power management functionality is needed then
>>>>>> a new flag can be added to it. It supports both GET and SET operations.
>>>>>>
>>>>>> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
>>>>>> ---
>>>>>>  include/uapi/linux/vfio.h | 55 +++++++++++++++++++++++++++++++++++++++
>>>>>>  1 file changed, 55 insertions(+)
>>>>>>
>>>>>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>>>>>> index 733a1cddde30..7e00de5c21ea 100644
>>>>>> --- a/include/uapi/linux/vfio.h
>>>>>> +++ b/include/uapi/linux/vfio.h
>>>>>> @@ -986,6 +986,61 @@ enum vfio_device_mig_state {
>>>>>>  	VFIO_DEVICE_STATE_RUNNING_P2P = 5,
>>>>>>  };
>>>>>>  
>>>>>> +/*
>>>>>> + * Perform power management-related operations for the VFIO device.
>>>>>> + *
>>>>>> + * The low power feature uses platform-based power management to move the
>>>>>> + * device into the low power state.  This low power state is device-specific.
>>>>>> + *
>>>>>> + * This device feature uses flags to specify the different operations.
>>>>>> + * It supports both the GET and SET operations.
>>>>>> + *
>>>>>> + * - VFIO_PM_LOW_POWER_ENTER flag moves the VFIO device into the low power
>>>>>> + *   state with platform-based power management.  This low power state will be
>>>>>> + *   internal to the VFIO driver and the user will not come to know which power
>>>>>> + *   state is chosen.  Once the user has moved the VFIO device into the low
>>>>>> + *   power state, then the user should not do any device access without moving
>>>>>> + *   the device out of the low power state.    
>>>>>
>>>>> Except we're wrapping device accesses to make this possible.  This
>>>>> should probably describe how any discrete access will wake the device
>>>>> but ongoing access through mmaps will generate user faults.
>>>>>     
>>>>
>>>>  Sure. I will add that details also.
>>>>  
>>>>>> + *
>>>>>> + * - VFIO_PM_LOW_POWER_EXIT flag moves the VFIO device out of the low power
>>>>>> + *    state.  This flag should only be set if the user has previously put the
>>>>>> + *    device into low power state with the VFIO_PM_LOW_POWER_ENTER flag.    
>>>>>
>>>>> Indenting.
>>>>>     
>>>>  
>>>>  I will fix this.
>>>>  
>>>>>> + *
>>>>>> + * - VFIO_PM_LOW_POWER_ENTER and VFIO_PM_LOW_POWER_EXIT are mutually exclusive.
>>>>>> + *
>>>>>> + * - VFIO_PM_LOW_POWER_REENTERY_DISABLE flag is only valid with
>>>>>> + *   VFIO_PM_LOW_POWER_ENTER.  If there is any access for the VFIO device on
>>>>>> + *   the host side, then the device will be moved out of the low power state
>>>>>> + *   without the user's guest driver involvement.  Some devices require the
>>>>>> + *   user's guest driver involvement for each low-power entry.  If this flag is
>>>>>> + *   set, then the re-entry to the low power state will be disabled, and the
>>>>>> + *   host kernel will not move the device again into the low power state.
>>>>>> + *   The VFIO driver internally maintains a list of devices for which low
>>>>>> + *   power re-entry is disabled by default and for those devices, the
>>>>>> + *   re-entry will be disabled even if the user has not set this flag
>>>>>> + *   explicitly.    
>>>>>
>>>>> Wrong polarity.  The kernel should not maintain the policy.  By default
>>>>> every wakeup, whether from host kernel accesses or via user accesses
>>>>> that do a pm-get should signal a wakeup to userspace.  Userspace needs
>>>>> to opt-out of that wakeup to let the kernel automatically re-enter low
>>>>> power and userspace needs to maintain the policy for which devices it
>>>>> wants that to occur.
>>>>>     
>>>>  
>>>>  Okay. So that means, in the kernel side, we don’t have to maintain
>>>>  the list which currently contains NVIDIA device ID. Also, in our
>>>>  updated approach, this opt-out of that wake-up means that user
>>>>  has not provided eventfd in the feature SET ioctl. Correct ?  
>>>
>>> Yes, I'm imagining that if the user hasn't provided a one-shot wake-up
>>> eventfd, that's the opt-out for being notified of device wakes.  For
>>> example, pm-resume would have something like:
>>>
>>> 	
>>> 	if (vdev->pm_wake_eventfd) {
>>> 		eventfd_signal(vdev->pm_wake_eventfd, 1);
>>> 		vdev->pm_wake_eventfd = NULL;
>>> 		pm_runtime_get_noresume(dev);
>>> 	}
>>>
>>> (eventfd pseudo handling substantially simplified)
>>>
>>> So w/o a wake-up eventfd, the user would need to call the pm feature
>>> exit ioctl to elevate the pm reference to prevent it going back to low
>>> power.  The pm feature exit ioctl would be optional if a wake eventfd is
>>> provided, so some piece of the eventfd context would need to remain to
>>> determine whether a pm-get is necessary.
>>>   
>>>>>> + *
>>>>>> + * For the IOCTL call with VFIO_DEVICE_FEATURE_GET:
>>>>>> + *
>>>>>> + * - VFIO_PM_LOW_POWER_ENTER will be set if the user has put the device into
>>>>>> + *   the low power state, otherwise, VFIO_PM_LOW_POWER_EXIT will be set.
>>>>>> + *
>>>>>> + * - If the device is in a normal power state currently, then
>>>>>> + *   VFIO_PM_LOW_POWER_REENTERY_DISABLE will be set for the devices where low
>>>>>> + *   power re-entry is disabled by default.  If the device is in the low power
>>>>>> + *   state currently, then VFIO_PM_LOW_POWER_REENTERY_DISABLE will be set
>>>>>> + *   according to the current transition.    
>>>>>
>>>>> Very confusing semantics.
>>>>>
>>>>> What if the feature SET ioctl took an eventfd and that eventfd was one
>>>>> time use.  Calling the ioctl would setup the eventfd to notify the user
>>>>> on wakeup and call pm-put.  Any access to the device via host, ioctl,
>>>>> or region would be wrapped in pm-get/put and the pm-resume handler
>>>>> would perform the matching pm-get to balance the feature SET and signal
>>>>> the eventfd.     
>>>>
>>>>  This seems a better option. It will help in making the ioctl simpler
>>>>  and we don’t have to add a separate index for PME which I added in
>>>>  patch 6. 
>>>>  
>>>>> If the user opts-out by not providing a wakeup eventfd,
>>>>> then the pm-resume handler does not perform a pm-get. Possibly we
>>>>> could even allow mmap access if a wake-up eventfd is provided.    
>>>>
>>>>  Sorry. I am not clear on this mmap part. We currently invalidates
>>>>  mapping before going into runtime-suspend. Now, if use tries do
>>>>  mmap then do we need some extra handling in the fault handler ?
>>>>  Need your help in understanding this part.  
>>>
>>> The option that I'm thinking about is if the mmap fault handler is
>>> wrapped in a pm-get/put then we could actually populate the mmap.  In
>>> the case where the pm-get triggers the wake-eventfd in pm-resume, the
>>> device doesn't return to low power when the mmap fault handler calls
>>> pm-put.  This possibly allows that we could actually invalidate mmaps on
>>> pm-suspend rather than in the pm feature enter ioctl, essentially the
>>> same as we're doing for intx.  I wonder though if this allows the
>>> possibility that we just bounce between mmap fault and pm-suspend.  So
>>> long as some work can be done, for instance the pm-suspend occurs
>>> asynchronously to the pm-put, this might be ok.
>>>   
>>
>>  We can do this. But in the normal use case, the situation should
>>  never arise where user should access any mmaped region when user has
>>  already put the device into D3 (D3hot or D3cold). This can only happen
>>  if there is some bug in the guest driver or user is doing wrong
>>  sequence. Do we need to add handling to officially support this part ?
> 
> We cannot rely on userspace drivers to be bug free or non-malicious,
> but if we want to impose that an mmap access while low power is
> enabled always triggers a fault, that's ok.
> 
>>  pm-get can take more than a second for resume for some devices and
>>  will doing this in fault handler be safe ?
>>
>>  Also, we will add this support only when wake-eventfd is provided so
>>  still w/o wake-eventfd case, the mmap access will still generate fault.
>>  So, we will have different behavior. Will that be acceptable ?
> 
> Let's keep it simple, generate a fault for all cases.
> 

 Thanks Alex for confirmation.

>>>>> The
>>>>> feature GET ioctl would be used to exit low power behavior and would be
>>>>> a no-op if the wakeup eventfd had already been signaled.  Thanks,
>>>>>    
>>>>  
>>>>  I will use the GET ioctl for low power exit instead of returning the
>>>>  current status.  
>>>
>>> Note that Yishai is proposing a device DMA dirty logging feature where
>>> the stop and start are exposed via SET on separate features, rather
>>> than SET/GET.  We should probably maintain some consistency between
>>> these use cases.  Possibly we might even want two separate pm enter
>>> ioctls, one with the wake eventfd and one without.  I think this is the
>>> sort of thing Jason is describing for future expansion of the dirty
>>> tracking uAPI.  Thanks,
>>>
>>> Alex
>>>   
>>
>>  Okay. So, we need to add 3 device features in total.
>>
>>  VFIO_DEVICE_FEATURE_PM_ENTRY
>>  VFIO_DEVICE_FEATURE_PM_ENTRY_WITH_WAKEUP
>>  VFIO_DEVICE_FEATURE_PM_EXIT
>>
>>  And only the second one need structure which will have only one field
>>  for eventfd and we need to return error if wakeup-eventfd is not
>>  provided in the second feature ?
> 
> Yes, we'd use eventfd_ctx and fail on a bad fileget.
> 
>>  Do we need to support GET operation also for these ?
>>  We can skip GET operation since that won’t be very useful.
> 
> What would they do?  Thanks,
> 
> Alex
> 

 If we implement GET operation then it can return the
 current status. For example, for VFIO_DEVICE_FEATURE_PM_ENTRY
 can return the information whether user has put the device into
 low power previously. But this information is not much useful as such
 and it requires to add a structure where this information needs to
 be filled. Also, the GET will again cause the device wake-up.
 So, for these device features, we can support only SET operation.

 I checked the Yishai DMA logging patches and there start
 and stop seems to be supporting only SET operation and there is
 separate feature which supports only GET operation.

 Regards,
 Abhishek