linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Abhishek Sahu <abhsahu@nvidia.com>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: kvm@vger.kernel.org, Cornelia Huck <cohuck@redhat.com>,
	Max Gurtovoy <mgurtovoy@nvidia.com>,
	Yishai Hadas <yishaih@nvidia.com>,
	Zhen Lei <thunder.leizhen@huawei.com>,
	Jason Gunthorpe <jgg@nvidia.com>,
	linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH v2 3/5] vfio/pci: fix memory leak during D3hot to D0 tranistion
Date: Mon, 31 Jan 2022 17:04:12 +0530	[thread overview]
Message-ID: <b0f25525-362a-c2dc-f255-22fa533fda26@nvidia.com> (raw)
In-Reply-To: <20220127170525.51043f23.alex.williamson@redhat.com>

On 1/28/2022 5:35 AM, Alex Williamson wrote:
> External email: Use caution opening links or attachments
> 
> 
> On Mon, 24 Jan 2022 23:47:24 +0530
> Abhishek Sahu <abhsahu@nvidia.com> wrote:
> 
>> If needs_pm_restore is set (PCI device does not have support for no
>> soft reset), then the current PCI state will be saved during D0->D3hot
>> transition and same will be restored back during D3hot->D0 transition.
>> For saving the PCI state locally, pci_store_saved_state() is being
>> used and the pci_load_and_free_saved_state() will free the allocated
>> memory.
>>
>> But for reset related IOCTLs, vfio driver calls PCI reset related
>> API's which will internally change the PCI power state back to D0. So,
>> when the guest resumes, then it will get the current state as D0 and it
>> will skip the call to vfio_pci_set_power_state() for changing the
>> power state to D0 explicitly. In this case, the memory pointed by
>> pm_save will never be freed.
>>
>> Also, in malicious sequence, the state changing to D3hot followed by
>> VFIO_DEVICE_RESET/VFIO_DEVICE_PCI_HOT_RESET can be run in loop and
>> it can cause an OOM situation. This patch stores the power state locally
>> and uses the same for comparing the current power state. For the
>> places where D0 transition can happen, call vfio_pci_set_power_state()
>> to transition to D0 state. Since the vfio power state is still D3hot,
>> so this D0 transition will help in running the logic required
>> from D3hot->D0 transition. Also, to prevent any miss during
>> future development to detect this condition, this patch puts a
>> check and frees the memory after printing warning.
>>
>> This locally saved power state will help in subsequent patches
>> also.
> 
> Ideally let's put fixes patches at the start of the series, or better
> yet send them separately, and don't include changes that only make
> sense in the context of a subsequent patch.
> 
> Fixes: 51ef3a004b1e ("vfio/pci: Restore device state on PM transition")
> 

 Thanks Alex for reviewing this patch.
 I have added Fixes tag and sent this patch separately.

 Should I update this patch series or you are planning to review the
 other patches first of this patch series first. 

>> Signed-off-by: Abhishek Sahu <abhsahu@nvidia.com>
>> ---
>>  drivers/vfio/pci/vfio_pci_core.c | 53 ++++++++++++++++++++++++++++++--
>>  include/linux/vfio_pci_core.h    |  1 +
>>  2 files changed, 51 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
>> index c6e4fe9088c3..ee2fb8af57fa 100644
>> --- a/drivers/vfio/pci/vfio_pci_core.c
>> +++ b/drivers/vfio/pci/vfio_pci_core.c
>> @@ -206,6 +206,14 @@ static void vfio_pci_probe_power_state(struct vfio_pci_core_device *vdev)
>>   * restore when returned to D0.  Saved separately from pci_saved_state for use
>>   * by PM capability emulation and separately from pci_dev internal saved state
>>   * to avoid it being overwritten and consumed around other resets.
>> + *
>> + * There are few cases where the PCI power state can be changed to D0
>> + * without the involvement of this API. So, cache the power state locally
>> + * and call this API to update the D0 state. It will help in running the
>> + * logic that is needed for transitioning to the D0 state. For example,
>> + * if needs_pm_restore is set, then the PCI state will be saved locally.
>> + * The memory taken for saving this PCI state needs to be freed to
>> + * prevent memory leak.
>>   */
>>  int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t state)
>>  {
>> @@ -214,20 +222,34 @@ int vfio_pci_set_power_state(struct vfio_pci_core_device *vdev, pci_power_t stat
>>       int ret;
>>
>>       if (vdev->needs_pm_restore) {
>> -             if (pdev->current_state < PCI_D3hot && state >= PCI_D3hot) {
>> +             if (vdev->power_state < PCI_D3hot && state >= PCI_D3hot) {
>>                       pci_save_state(pdev);
>>                       needs_save = true;
>>               }
>>
>> -             if (pdev->current_state >= PCI_D3hot && state <= PCI_D0)
>> +             if (vdev->power_state >= PCI_D3hot && state <= PCI_D0)
>>                       needs_restore = true;
>>       }
>>
>>       ret = pci_set_power_state(pdev, state);
>>
>>       if (!ret) {
>> +             vdev->power_state = pdev->current_state;
>> +
>>               /* D3 might be unsupported via quirk, skip unless in D3 */
>> -             if (needs_save && pdev->current_state >= PCI_D3hot) {
>> +             if (needs_save && vdev->power_state >= PCI_D3hot) {
>> +                     /*
>> +                      * If somehow, the vfio driver was not able to free the
>> +                      * memory allocated in pm_save, then free the earlier
>> +                      * memory first before overwriting pm_save to prevent
>> +                      * memory leak.
>> +                      */
>> +                     if (vdev->pm_save) {
>> +                             pci_warn(pdev,
>> +                                      "Overwriting saved PCI state pointer so freeing the earlier memory\n");
>> +                             kfree(vdev->pm_save);
>> +                     }
> 
> The minimal fix for the described issue would simply be:
> 
>                         kfree(vdev->pm_save);
> 
> It seems like the only purpose of the warning is try to make sure we
> haven't missed any wake-up calls, where this would be a pretty small
> breadcrumb to actually debug such an issue.
> 

 I have removed the warning in the updated patch.

>> +
>>                       vdev->pm_save = pci_store_saved_state(pdev);
>>               } else if (needs_restore) {
>>                       pci_load_and_free_saved_state(pdev, &vdev->pm_save);
>> @@ -326,6 +348,14 @@ void vfio_pci_core_disable(struct vfio_pci_core_device *vdev)
>>       /* For needs_reset */
>>       lockdep_assert_held(&vdev->vdev.dev_set->lock);
>>
>> +     /*
>> +      * If disable has been called while the power state is other than D0,
>> +      * then set the power state in vfio driver to D0. It will help
>> +      * in running the logic needed for D0 power state. The subsequent
>> +      * runtime PM API's will put the device into the low power state again.
>> +      */
>> +     vfio_pci_set_power_state(vdev, PCI_D0);
>> +
> 
> I do think we have an issue here, but the reason is that pci_pm_reset()
> returns -EINVAL if we try to reset a device that isn't currently in D0.
> Therefore any path where we're triggering a function reset that could
> use a PM reset and we don't know if the device is in D0, should wake up
> the device before we try that reset.
> 
> We're about to load the initial state of the device that was saved when
> it was opened, so I don't think pdev->current_state vs
> vdev->power_state matters here, we only care that the device is in D0.
> 

 I have added this point in the commit message of the updated patch.

>>       /* Stop the device from further DMA */
>>       pci_clear_master(pdev);
>>
>> @@ -929,6 +959,15 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
>>
>>               vfio_pci_zap_and_down_write_memory_lock(vdev);
>>               ret = pci_try_reset_function(vdev->pdev);
>> +
>> +             /*
>> +              * If pci_try_reset_function() has been called while the power
>> +              * state is other than D0, then pci_try_reset_function() will
>> +              * internally set the device state to D0 without vfio driver
>> +              * interaction. Update the power state in vfio driver to perform
>> +              * the logic needed for D0 power state.
>> +              */
>> +             vfio_pci_set_power_state(vdev, PCI_D0);
> 
> For the case where pci_try_reset_function() might use a PM reset, we
> should set D0 before that call.  In doing so, the pdev->current_state
> should match the actual device power state, so we still don't need to
> stash power state on the vdev.
> 

 I have set D0 state before calling pci_try_reset_function() in
 the updated patch.

>>               up_write(&vdev->memory_lock);
>>
>>               return ret;
>> @@ -2071,6 +2110,14 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
>>
>>  err_undo:
>>       list_for_each_entry(cur, &dev_set->device_list, vdev.dev_set_list) {
>> +             /*
>> +              * If pci_reset_bus() has been called while the power
>> +              * state is other than D0, then pci_reset_bus() will
>> +              * internally set the device state to D0 without vfio driver
>> +              * interaction. Update the power state in vfio driver to perform
>> +              * the logic needed for D0 power state.
>> +              */
>> +             vfio_pci_set_power_state(cur, PCI_D0);
> 
> Here pci_reset_bus() will wakeup the device and I think the concern is
> that around that bus reset we'll save and restore the device state, but
> that's potentially bogus device state if waking it triggers a soft
> reset.  We could again wake devices before the reset to make the state
> correct, or we could test pm_save and perform the load and restore if
> it exists.  Either of those would avoid needing to cache the power
> state on the vdev.  Thanks,
> 

 I have made the changes to wake-up the devices.

 Thanks
 Abhishek

> Alex
> 
>>               if (cur == cur_mem)
>>                       is_mem = false;
>>               if (cur == cur_vma)
>> diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
>> index aafe09c9fa64..05db838e72cc 100644
>> --- a/include/linux/vfio_pci_core.h
>> +++ b/include/linux/vfio_pci_core.h
>> @@ -124,6 +124,7 @@ struct vfio_pci_core_device {
>>       bool                    needs_reset;
>>       bool                    nointx;
>>       bool                    needs_pm_restore;
>> +     pci_power_t             power_state;
>>       struct pci_saved_state  *pci_saved_state;
>>       struct pci_saved_state  *pm_save;
>>       int                     ioeventfds_nr;
> 


  reply	other threads:[~2022-01-31 11:47 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-01-24 18:17 [RFC PATCH v2 0/5] vfio/pci: Enable runtime power management support Abhishek Sahu
2022-01-24 18:17 ` [RFC PATCH v2 1/5] vfio/pci: register vfio-pci driver with runtime PM framework Abhishek Sahu
2022-02-16 23:48   ` Alex Williamson
2022-02-21  6:35     ` Abhishek Sahu
2022-01-24 18:17 ` [RFC PATCH v2 2/5] vfio/pci: virtualize PME related registers bits and initialize to zero Abhishek Sahu
2022-01-24 18:17 ` [RFC PATCH v2 3/5] vfio/pci: fix memory leak during D3hot to D0 tranistion Abhishek Sahu
2022-01-28  0:05   ` Alex Williamson
2022-01-31 11:34     ` Abhishek Sahu [this message]
2022-01-31 15:33       ` Alex Williamson
2022-01-24 18:17 ` [RFC PATCH v2 4/5] vfio/pci: Invalidate mmaps and block the access in D3hot power state Abhishek Sahu
2022-02-17 23:14   ` Alex Williamson
2022-02-21  8:12     ` Abhishek Sahu
2022-01-24 18:17 ` [RFC PATCH v2 5/5] vfio/pci: add the support for PCI D3cold state Abhishek Sahu
2022-03-09 17:26   ` Alex Williamson
2022-03-11 15:45     ` Abhishek Sahu
2022-03-11 23:06       ` Alex Williamson
2022-03-16  5:41         ` Abhishek Sahu
2022-03-16 18:44           ` Alex Williamson
2022-03-24 14:27             ` Abhishek Sahu
2022-03-11 16:17     ` Jason Gunthorpe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b0f25525-362a-c2dc-f255-22fa533fda26@nvidia.com \
    --to=abhsahu@nvidia.com \
    --cc=alex.williamson@redhat.com \
    --cc=cohuck@redhat.com \
    --cc=jgg@nvidia.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgurtovoy@nvidia.com \
    --cc=thunder.leizhen@huawei.com \
    --cc=yishaih@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).