Re: [PATCH v3 8/8] vfio/pci: Add the support for PCI D3cold state

From: Abhishek Sahu <abhsahu@nvidia.com>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: Cornelia Huck <cohuck@redhat.com>,
	Yishai Hadas <yishaih@nvidia.com>,
	Jason Gunthorpe <jgg@nvidia.com>,
	Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>,
	Kevin Tian <kevin.tian@intel.com>,
	"Rafael J . Wysocki" <rafael@kernel.org>,
	Max Gurtovoy <mgurtovoy@nvidia.com>,
	Bjorn Helgaas <bhelgaas@google.com>,
	linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
	linux-pm@vger.kernel.org, linux-pci@vger.kernel.org
Subject: Re: [PATCH v3 8/8] vfio/pci: Add the support for PCI D3cold state
Date: Mon, 30 May 2022 16:45:59 +0530	[thread overview]
Message-ID: <42518bd5-da8b-554f-2612-80278b527bf5@nvidia.com> (raw)
In-Reply-To: <68463d9b-98ee-b9ec-1a3e-1375e50a2ad2@nvidia.com>

On 5/10/2022 6:56 PM, Abhishek Sahu wrote:
> On 5/10/2022 3:18 AM, Alex Williamson wrote:
>> On Thu, 5 May 2022 17:46:20 +0530
>> Abhishek Sahu <abhsahu@nvidia.com> wrote:
>>
>>> On 5/5/2022 1:15 AM, Alex Williamson wrote:
>>>> On Mon, 25 Apr 2022 14:56:15 +0530
>>>> Abhishek Sahu <abhsahu@nvidia.com> wrote:
>>>>

<snip>

>>>>> diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c
>>>>> index af0ae80ef324..65b1bc9586ab 100644
>>>>> --- a/drivers/vfio/pci/vfio_pci_config.c
>>>>> +++ b/drivers/vfio/pci/vfio_pci_config.c
>>>>> @@ -25,6 +25,7 @@
>>>>>  #include <linux/uaccess.h>
>>>>>  #include <linux/vfio.h>
>>>>>  #include <linux/slab.h>
>>>>> +#include <linux/pm_runtime.h>
>>>>>  
>>>>>  #include <linux/vfio_pci_core.h>
>>>>>  
>>>>> @@ -1936,16 +1937,23 @@ static ssize_t vfio_config_do_rw(struct vfio_pci_core_device *vdev, char __user
>>>>>  ssize_t vfio_pci_config_rw(struct vfio_pci_core_device *vdev, char __user *buf,
>>>>>  			   size_t count, loff_t *ppos, bool iswrite)
>>>>>  {
>>>>> +	struct device *dev = &vdev->pdev->dev;
>>>>>  	size_t done = 0;
>>>>>  	int ret = 0;
>>>>>  	loff_t pos = *ppos;
>>>>>  
>>>>>  	pos &= VFIO_PCI_OFFSET_MASK;
>>>>>  
>>>>> +	ret = pm_runtime_resume_and_get(dev);
>>>>> +	if (ret < 0)
>>>>> +		return ret;  
>>>>
>>>> Alternatively we could just check platform_pm_engaged here and return
>>>> -EINVAL, right?  Why is waking the device the better option?
>>>>   
>>>
>>>  This is mainly to prevent race condition where config space access
>>>  happens parallelly with IOCTL access. So, lets consider the following case.
>>>
>>>  1. Config space access happens and vfio_pci_config_rw() will be called.
>>>  2. The IOCTL to move into low power state is called.
>>>  3. The IOCTL will move the device into d3cold.
>>>  4. Exit from vfio_pci_config_rw() happened.
>>>
>>>  Now, if we just check platform_pm_engaged, then in the above
>>>  sequence it won’t work. I checked this parallel access by writing
>>>  a small program where I opened the 2 instances and then
>>>  created 2 threads for config space and IOCTL.
>>>  In my case, I got the above sequence.
>>>
>>>  The pm_runtime_resume_and_get() will make sure that device
>>>  usage count keep incremented throughout the config space
>>>  access (or IOCTL access in the previous patch) and the
>>>  runtime PM framework will not move the device into suspended
>>>  state.
>>
>> I think we're inventing problems here.  If we define that config space
>> is not accessible while the device is in low power and the only way to
>> get the device out of low power is via ioctl, then we should be denying
>> access to the device while in low power.  If the user races exiting the
>> device from low power and a config space access, that's their problem.
>>
> 
>  But what about malicious user who intentionally tries to create
>  this sequence. If the platform_pm_engaged check passed and
>  then user put the device into low power state. In that case,
>  there may be chances where config read happens while the device
>  is in low power state.
> 

 Hi Alex,

 I need help in concluding below part to proceed further on my
 implementation.

>  Can we prevent this concurrent access somehow or make sure
>  that nothing else is running when the low power ioctl runs?
> 

 If I add the 'platform_pm_engaged' alone and return early. 

 vfio_pci_config_rw()
 {
 ...
     down_read(&vdev->memory_lock);
     if (vdev->platform_pm_engaged) {
         up_read(&vdev->memory_lock);
         return -EIO;
     }
 ...
 }

 Then from user side, if two threads are running then there are chances
 that 'platform_pm_engaged' is false while we do check but it gets true
 before returning from this function. If runtime PM framework puts the
 device into D3cold state, then there are chances that config
 read/write happens with D3cold internally. I have added prints in this
 function locally at entry and exit. In entry, the 'platform_pm_engaged'
 is coming false while in exit it is coming as true, if I create 2
 threads from user space. It will be similar to memory access issue
 on disabled memory.

 So, we need to make sure that the VFIO_DEVICE_FEATURE_POWER_MANAGEMENT
 ioctl request should be exclusive and no other config or ioctl
 request should be running in parallel.

 Could you or someone else please suggest a way to handle this case.

 From my side, I have following solution to handle this but not sure if
 this will be acceptable and work for all the cases.

 1. In real use case, config or any other ioctl should not come along
    with VFIO_DEVICE_FEATURE_POWER_MANAGEMENT ioctl request.

 2. Maintain some 'access_count' which will be incremented when we
    do any config space access or ioctl.

 3. At the beginning of config space access or ioctl, we can do
    something like this

         down_read(&vdev->memory_lock);
         atomic_inc(&vdev->access_count);
         if (vdev->platform_pm_engaged) {
                 atomic_dec(&vdev->access_count);
                 up_read(&vdev->memory_lock);
                 return -EIO;
         }
         up_read(&vdev->memory_lock);

     And before returning, we can decrement the 'access_count'.

         down_read(&vdev->memory_lock);
         atomic_dec(&vdev->access_count);
         up_read(&vdev->memory_lock);

     The atmoic_dec() is put under 'memory_lock' to maintain
     lock ordering rules for the arch where atomic_t is internally
     implemented using locks.

 4. Inside vfio_pci_core_feature_pm(), we can do something like this
         down_write(&vdev->memory_lock);
         if (atomic_read(&vdev->access_count) != 1) {
                 up_write(&vdev->memory_lock);
                 return -EBUSY;
         }
         vdev->platform_pm_engaged = true;
         up_write(&vdev->memory_lock);

 5. The idea here is to check the 'access_count' in
    vfio_pci_core_feature_pm(). If 'access_count' is greater than 1,
    that means some other ioctl or config space is happening,
    and we return early. Otherwise, we can set 'platform_pm_engaged' and
    release the lock.

 6. In case of race condition, if vfio_pci_core_feature_pm() gets
    lock and found 'access_count' 1, then its sets 'platform_pm_engaged'.
    Now at the config space access or ioctl, the 'platform_pm_engaged'
    will get as true and it will return early.

    If config space access or ioctl happens first, then
    'platform_pm_engaged' will be false and the request will be
    successful. But the 'access_count' will be kept incremented till
    the last. Now, in vfio_pci_core_feature_pm(), it will get
    refcount as 2 and will return -EBUSY.

 7. For ioctl access, I need to add two callbacks functions (one
    for start and one for end) in the struct vfio_device_ops and call
    the same at start and end of ioctl from vfio_device_fops_unl_ioctl().

 Another option was to add one more lock like 'memory_lock' and maintain
 it throughout the config and ioctl access but maintaining
 two locks won't be easy since memory lock is already being
 used inside inside config and ioctl. 

 Thanks,
 Abhishek