All of lore.kernel.org
 help / color / mirror / Atom feed
From: si-wei liu <si-wei.liu@oracle.com>
To: Jason Wang <jasowang@redhat.com>, mst@redhat.com, lingshan.zhu@intel.com
Cc: joao.m.martins@oracle.com, boris.ostrovsky@oracle.com,
	linux-kernel@vger.kernel.org,
	virtualization@lists.linux-foundation.org,
	netdev@vger.kernel.org
Subject: Re: [PATCH v3 2/2] vhost-vdpa: fix page pinning leakage in error path
Date: Tue, 13 Oct 2020 16:42:59 -0700	[thread overview]
Message-ID: <5F863B83.6030204@oracle.com> (raw)
In-Reply-To: <574a64e3-8873-0639-fe32-248cb99204bc@redhat.com>


On 10/9/2020 7:27 PM, Jason Wang wrote:
>
> On 2020/10/3 下午1:02, Si-Wei Liu wrote:
>> Pinned pages are not properly accounted particularly when
>> mapping error occurs on IOTLB update. Clean up dangling
>> pinned pages for the error path. As the inflight pinned
>> pages, specifically for memory region that strides across
>> multiple chunks, would need more than one free page for
>> book keeping and accounting. For simplicity, pin pages
>> for all memory in the IOVA range in one go rather than
>> have multiple pin_user_pages calls to make up the entire
>> region. This way it's easier to track and account the
>> pages already mapped, particularly for clean-up in the
>> error path.
>>
>> Fixes: 4c8cf31885f6 ("vhost: introduce vDPA-based backend")
>> Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>
>> ---
>> Changes in v3:
>> - Factor out vhost_vdpa_map() change to a separate patch
>>
>> Changes in v2:
>> - Fix incorrect target SHA1 referenced
>>
>>   drivers/vhost/vdpa.c | 119 
>> ++++++++++++++++++++++++++++++---------------------
>>   1 file changed, 71 insertions(+), 48 deletions(-)
>>
>> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
>> index 0f27919..dad41dae 100644
>> --- a/drivers/vhost/vdpa.c
>> +++ b/drivers/vhost/vdpa.c
>> @@ -595,21 +595,19 @@ static int 
>> vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
>>       struct vhost_dev *dev = &v->vdev;
>>       struct vhost_iotlb *iotlb = dev->iotlb;
>>       struct page **page_list;
>> -    unsigned long list_size = PAGE_SIZE / sizeof(struct page *);
>> +    struct vm_area_struct **vmas;
>>       unsigned int gup_flags = FOLL_LONGTERM;
>> -    unsigned long npages, cur_base, map_pfn, last_pfn = 0;
>> -    unsigned long locked, lock_limit, pinned, i;
>> +    unsigned long map_pfn, last_pfn = 0;
>> +    unsigned long npages, lock_limit;
>> +    unsigned long i, nmap = 0;
>>       u64 iova = msg->iova;
>> +    long pinned;
>>       int ret = 0;
>>         if (vhost_iotlb_itree_first(iotlb, msg->iova,
>>                       msg->iova + msg->size - 1))
>>           return -EEXIST;
>>   -    page_list = (struct page **) __get_free_page(GFP_KERNEL);
>> -    if (!page_list)
>> -        return -ENOMEM;
>> -
>>       if (msg->perm & VHOST_ACCESS_WO)
>>           gup_flags |= FOLL_WRITE;
>>   @@ -617,61 +615,86 @@ static int 
>> vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
>>       if (!npages)
>>           return -EINVAL;
>>   +    page_list = kvmalloc_array(npages, sizeof(struct page *), 
>> GFP_KERNEL);
>> +    vmas = kvmalloc_array(npages, sizeof(struct vm_area_struct *),
>> +                  GFP_KERNEL);
>
>
> This will result high order memory allocation which was what the code 
> tried to avoid originally.
>
> Using an unlimited size will cause a lot of side effects consider VM 
> or userspace may try to pin several TB of memory.
Hmmm, that's a good point. Indeed, if the guest memory demand is huge or 
the host system is running short of free pages, kvmalloc will be 
problematic and less efficient than the __get_free_page implementation.

>
>
>> +    if (!page_list || !vmas) {
>> +        ret = -ENOMEM;
>> +        goto free;
>> +    }
>
>
> Any reason that you want to use vmas?
Without providing custom vmas, it's subject to high order allocation 
failure. While page_list and vmas can now fallback to virtual memory 
allocation if need be.

>
>
>> +
>>       mmap_read_lock(dev->mm);
>>   -    locked = atomic64_add_return(npages, &dev->mm->pinned_vm);
>>       lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
>> -
>> -    if (locked > lock_limit) {
>> +    if (npages + atomic64_read(&dev->mm->pinned_vm) > lock_limit) {
>>           ret = -ENOMEM;
>> -        goto out;
>> +        goto unlock;
>>       }
>>   -    cur_base = msg->uaddr & PAGE_MASK;
>> -    iova &= PAGE_MASK;
>> +    pinned = pin_user_pages(msg->uaddr & PAGE_MASK, npages, gup_flags,
>> +                page_list, vmas);
>> +    if (npages != pinned) {
>> +        if (pinned < 0) {
>> +            ret = pinned;
>> +        } else {
>> +            unpin_user_pages(page_list, pinned);
>> +            ret = -ENOMEM;
>> +        }
>> +        goto unlock;
>> +    }
>>   -    while (npages) {
>> -        pinned = min_t(unsigned long, npages, list_size);
>> -        ret = pin_user_pages(cur_base, pinned,
>> -                     gup_flags, page_list, NULL);
>> -        if (ret != pinned)
>> -            goto out;
>> -
>> -        if (!last_pfn)
>> -            map_pfn = page_to_pfn(page_list[0]);
>> -
>> -        for (i = 0; i < ret; i++) {
>> -            unsigned long this_pfn = page_to_pfn(page_list[i]);
>> -            u64 csize;
>> -
>> -            if (last_pfn && (this_pfn != last_pfn + 1)) {
>> -                /* Pin a contiguous chunk of memory */
>> -                csize = (last_pfn - map_pfn + 1) << PAGE_SHIFT;
>> -                if (vhost_vdpa_map(v, iova, csize,
>> -                           map_pfn << PAGE_SHIFT,
>> -                           msg->perm))
>> -                    goto out;
>> -                map_pfn = this_pfn;
>> -                iova += csize;
>> +    iova &= PAGE_MASK;
>> +    map_pfn = page_to_pfn(page_list[0]);
>> +
>> +    /* One more iteration to avoid extra vdpa_map() call out of 
>> loop. */
>> +    for (i = 0; i <= npages; i++) {
>> +        unsigned long this_pfn;
>> +        u64 csize;
>> +
>> +        /* The last chunk may have no valid PFN next to it */
>> +        this_pfn = i < npages ? page_to_pfn(page_list[i]) : -1UL;
>> +
>> +        if (last_pfn && (this_pfn == -1UL ||
>> +                 this_pfn != last_pfn + 1)) {
>> +            /* Pin a contiguous chunk of memory */
>> +            csize = last_pfn - map_pfn + 1;
>> +            ret = vhost_vdpa_map(v, iova, csize << PAGE_SHIFT,
>> +                         map_pfn << PAGE_SHIFT,
>> +                         msg->perm);
>> +            if (ret) {
>> +                /*
>> +                 * Unpin the rest chunks of memory on the
>> +                 * flight with no corresponding vdpa_map()
>> +                 * calls having been made yet. On the other
>> +                 * hand, vdpa_unmap() in the failure path
>> +                 * is in charge of accounting the number of
>> +                 * pinned pages for its own.
>> +                 * This asymmetrical pattern of accounting
>> +                 * is for efficiency to pin all pages at
>> +                 * once, while there is no other callsite
>> +                 * of vdpa_map() than here above.
>> +                 */
>> +                unpin_user_pages(&page_list[nmap],
>> +                         npages - nmap);
>> +                goto out;
>>               }
>> -
>> -            last_pfn = this_pfn;
>> +            atomic64_add(csize, &dev->mm->pinned_vm);
>> +            nmap += csize;
>> +            iova += csize << PAGE_SHIFT;
>> +            map_pfn = this_pfn;
>>           }
>> -
>> -        cur_base += ret << PAGE_SHIFT;
>> -        npages -= ret;
>> +        last_pfn = this_pfn;
>>       }
>
>
> So what I suggest is to fix the pinning leakage first and do the 
> possible optimization on top (which is still questionable to me).
OK. Unfortunately, this was picked and got merged in upstream. So I will 
post a follow up patch set to 1) revert the commit to the original 
__get_free_page() implementation, and 2) fix the accounting and leakage 
on top. Will it be fine?


-Siwei
>
> Thanks
>
>
>>   -    /* Pin the rest chunk */
>> -    ret = vhost_vdpa_map(v, iova, (last_pfn - map_pfn + 1) << 
>> PAGE_SHIFT,
>> -                 map_pfn << PAGE_SHIFT, msg->perm);
>> +    WARN_ON(nmap != npages);
>>   out:
>> -    if (ret) {
>> +    if (ret)
>>           vhost_vdpa_unmap(v, msg->iova, msg->size);
>> -        atomic64_sub(npages, &dev->mm->pinned_vm);
>> -    }
>> +unlock:
>>       mmap_read_unlock(dev->mm);
>> -    free_page((unsigned long)page_list);
>> +free:
>> +    kvfree(vmas);
>> +    kvfree(page_list);
>>       return ret;
>>   }
>


WARNING: multiple messages have this Message-ID (diff)
From: si-wei liu <si-wei.liu@oracle.com>
To: Jason Wang <jasowang@redhat.com>, mst@redhat.com, lingshan.zhu@intel.com
Cc: netdev@vger.kernel.org,
	virtualization@lists.linux-foundation.org,
	boris.ostrovsky@oracle.com, linux-kernel@vger.kernel.org
Subject: Re: [PATCH v3 2/2] vhost-vdpa: fix page pinning leakage in error path
Date: Tue, 13 Oct 2020 16:42:59 -0700	[thread overview]
Message-ID: <5F863B83.6030204@oracle.com> (raw)
In-Reply-To: <574a64e3-8873-0639-fe32-248cb99204bc@redhat.com>


On 10/9/2020 7:27 PM, Jason Wang wrote:
>
> On 2020/10/3 下午1:02, Si-Wei Liu wrote:
>> Pinned pages are not properly accounted particularly when
>> mapping error occurs on IOTLB update. Clean up dangling
>> pinned pages for the error path. As the inflight pinned
>> pages, specifically for memory region that strides across
>> multiple chunks, would need more than one free page for
>> book keeping and accounting. For simplicity, pin pages
>> for all memory in the IOVA range in one go rather than
>> have multiple pin_user_pages calls to make up the entire
>> region. This way it's easier to track and account the
>> pages already mapped, particularly for clean-up in the
>> error path.
>>
>> Fixes: 4c8cf31885f6 ("vhost: introduce vDPA-based backend")
>> Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>
>> ---
>> Changes in v3:
>> - Factor out vhost_vdpa_map() change to a separate patch
>>
>> Changes in v2:
>> - Fix incorrect target SHA1 referenced
>>
>>   drivers/vhost/vdpa.c | 119 
>> ++++++++++++++++++++++++++++++---------------------
>>   1 file changed, 71 insertions(+), 48 deletions(-)
>>
>> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
>> index 0f27919..dad41dae 100644
>> --- a/drivers/vhost/vdpa.c
>> +++ b/drivers/vhost/vdpa.c
>> @@ -595,21 +595,19 @@ static int 
>> vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
>>       struct vhost_dev *dev = &v->vdev;
>>       struct vhost_iotlb *iotlb = dev->iotlb;
>>       struct page **page_list;
>> -    unsigned long list_size = PAGE_SIZE / sizeof(struct page *);
>> +    struct vm_area_struct **vmas;
>>       unsigned int gup_flags = FOLL_LONGTERM;
>> -    unsigned long npages, cur_base, map_pfn, last_pfn = 0;
>> -    unsigned long locked, lock_limit, pinned, i;
>> +    unsigned long map_pfn, last_pfn = 0;
>> +    unsigned long npages, lock_limit;
>> +    unsigned long i, nmap = 0;
>>       u64 iova = msg->iova;
>> +    long pinned;
>>       int ret = 0;
>>         if (vhost_iotlb_itree_first(iotlb, msg->iova,
>>                       msg->iova + msg->size - 1))
>>           return -EEXIST;
>>   -    page_list = (struct page **) __get_free_page(GFP_KERNEL);
>> -    if (!page_list)
>> -        return -ENOMEM;
>> -
>>       if (msg->perm & VHOST_ACCESS_WO)
>>           gup_flags |= FOLL_WRITE;
>>   @@ -617,61 +615,86 @@ static int 
>> vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
>>       if (!npages)
>>           return -EINVAL;
>>   +    page_list = kvmalloc_array(npages, sizeof(struct page *), 
>> GFP_KERNEL);
>> +    vmas = kvmalloc_array(npages, sizeof(struct vm_area_struct *),
>> +                  GFP_KERNEL);
>
>
> This will result high order memory allocation which was what the code 
> tried to avoid originally.
>
> Using an unlimited size will cause a lot of side effects consider VM 
> or userspace may try to pin several TB of memory.
Hmmm, that's a good point. Indeed, if the guest memory demand is huge or 
the host system is running short of free pages, kvmalloc will be 
problematic and less efficient than the __get_free_page implementation.

>
>
>> +    if (!page_list || !vmas) {
>> +        ret = -ENOMEM;
>> +        goto free;
>> +    }
>
>
> Any reason that you want to use vmas?
Without providing custom vmas, it's subject to high order allocation 
failure. While page_list and vmas can now fallback to virtual memory 
allocation if need be.

>
>
>> +
>>       mmap_read_lock(dev->mm);
>>   -    locked = atomic64_add_return(npages, &dev->mm->pinned_vm);
>>       lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
>> -
>> -    if (locked > lock_limit) {
>> +    if (npages + atomic64_read(&dev->mm->pinned_vm) > lock_limit) {
>>           ret = -ENOMEM;
>> -        goto out;
>> +        goto unlock;
>>       }
>>   -    cur_base = msg->uaddr & PAGE_MASK;
>> -    iova &= PAGE_MASK;
>> +    pinned = pin_user_pages(msg->uaddr & PAGE_MASK, npages, gup_flags,
>> +                page_list, vmas);
>> +    if (npages != pinned) {
>> +        if (pinned < 0) {
>> +            ret = pinned;
>> +        } else {
>> +            unpin_user_pages(page_list, pinned);
>> +            ret = -ENOMEM;
>> +        }
>> +        goto unlock;
>> +    }
>>   -    while (npages) {
>> -        pinned = min_t(unsigned long, npages, list_size);
>> -        ret = pin_user_pages(cur_base, pinned,
>> -                     gup_flags, page_list, NULL);
>> -        if (ret != pinned)
>> -            goto out;
>> -
>> -        if (!last_pfn)
>> -            map_pfn = page_to_pfn(page_list[0]);
>> -
>> -        for (i = 0; i < ret; i++) {
>> -            unsigned long this_pfn = page_to_pfn(page_list[i]);
>> -            u64 csize;
>> -
>> -            if (last_pfn && (this_pfn != last_pfn + 1)) {
>> -                /* Pin a contiguous chunk of memory */
>> -                csize = (last_pfn - map_pfn + 1) << PAGE_SHIFT;
>> -                if (vhost_vdpa_map(v, iova, csize,
>> -                           map_pfn << PAGE_SHIFT,
>> -                           msg->perm))
>> -                    goto out;
>> -                map_pfn = this_pfn;
>> -                iova += csize;
>> +    iova &= PAGE_MASK;
>> +    map_pfn = page_to_pfn(page_list[0]);
>> +
>> +    /* One more iteration to avoid extra vdpa_map() call out of 
>> loop. */
>> +    for (i = 0; i <= npages; i++) {
>> +        unsigned long this_pfn;
>> +        u64 csize;
>> +
>> +        /* The last chunk may have no valid PFN next to it */
>> +        this_pfn = i < npages ? page_to_pfn(page_list[i]) : -1UL;
>> +
>> +        if (last_pfn && (this_pfn == -1UL ||
>> +                 this_pfn != last_pfn + 1)) {
>> +            /* Pin a contiguous chunk of memory */
>> +            csize = last_pfn - map_pfn + 1;
>> +            ret = vhost_vdpa_map(v, iova, csize << PAGE_SHIFT,
>> +                         map_pfn << PAGE_SHIFT,
>> +                         msg->perm);
>> +            if (ret) {
>> +                /*
>> +                 * Unpin the rest chunks of memory on the
>> +                 * flight with no corresponding vdpa_map()
>> +                 * calls having been made yet. On the other
>> +                 * hand, vdpa_unmap() in the failure path
>> +                 * is in charge of accounting the number of
>> +                 * pinned pages for its own.
>> +                 * This asymmetrical pattern of accounting
>> +                 * is for efficiency to pin all pages at
>> +                 * once, while there is no other callsite
>> +                 * of vdpa_map() than here above.
>> +                 */
>> +                unpin_user_pages(&page_list[nmap],
>> +                         npages - nmap);
>> +                goto out;
>>               }
>> -
>> -            last_pfn = this_pfn;
>> +            atomic64_add(csize, &dev->mm->pinned_vm);
>> +            nmap += csize;
>> +            iova += csize << PAGE_SHIFT;
>> +            map_pfn = this_pfn;
>>           }
>> -
>> -        cur_base += ret << PAGE_SHIFT;
>> -        npages -= ret;
>> +        last_pfn = this_pfn;
>>       }
>
>
> So what I suggest is to fix the pinning leakage first and do the 
> possible optimization on top (which is still questionable to me).
OK. Unfortunately, this was picked and got merged in upstream. So I will 
post a follow up patch set to 1) revert the commit to the original 
__get_free_page() implementation, and 2) fix the accounting and leakage 
on top. Will it be fine?


-Siwei
>
> Thanks
>
>
>>   -    /* Pin the rest chunk */
>> -    ret = vhost_vdpa_map(v, iova, (last_pfn - map_pfn + 1) << 
>> PAGE_SHIFT,
>> -                 map_pfn << PAGE_SHIFT, msg->perm);
>> +    WARN_ON(nmap != npages);
>>   out:
>> -    if (ret) {
>> +    if (ret)
>>           vhost_vdpa_unmap(v, msg->iova, msg->size);
>> -        atomic64_sub(npages, &dev->mm->pinned_vm);
>> -    }
>> +unlock:
>>       mmap_read_unlock(dev->mm);
>> -    free_page((unsigned long)page_list);
>> +free:
>> +    kvfree(vmas);
>> +    kvfree(page_list);
>>       return ret;
>>   }
>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

  reply	other threads:[~2020-10-13 23:43 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-03  5:02 [PATCH v3 0/2] vhost-vdpa mapping error path fixes Si-Wei Liu
2020-10-03  5:02 ` Si-Wei Liu
2020-10-03  5:02 ` [PATCH v3 1/2] vhost-vdpa: fix vhost_vdpa_map() on error condition Si-Wei Liu
2020-10-03  5:02   ` Si-Wei Liu
2020-10-10  1:48   ` Jason Wang
2020-10-10  1:48     ` Jason Wang
2020-10-11  6:45     ` Michael S. Tsirkin
2020-10-11  6:45       ` Michael S. Tsirkin
2020-10-03  5:02 ` [PATCH v3 2/2] vhost-vdpa: fix page pinning leakage in error path Si-Wei Liu
2020-10-03  5:02   ` Si-Wei Liu
2020-10-10  2:27   ` Jason Wang
2020-10-10  2:27     ` Jason Wang
2020-10-13 23:42     ` si-wei liu [this message]
2020-10-13 23:42       ` si-wei liu
2020-10-14  6:52       ` Michael S. Tsirkin
2020-10-14  6:52         ` Michael S. Tsirkin
2020-10-14 11:41         ` Jason Wang
2020-10-14 11:41           ` Jason Wang
2020-10-15  6:15       ` Jason Wang
2020-10-15  6:15         ` Jason Wang
2020-10-15 13:11         ` Michael S. Tsirkin
2020-10-15 13:11           ` Michael S. Tsirkin
2020-10-15 20:17           ` si-wei liu
2020-10-15 20:17             ` si-wei liu
2020-10-29 21:53             ` Michael S. Tsirkin
2020-10-29 21:53               ` Michael S. Tsirkin
2020-10-29 23:22               ` si-wei liu
2020-10-29 23:22                 ` si-wei liu
2020-11-03  5:34               ` si-wei liu
2020-11-03  5:34                 ` si-wei liu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5F863B83.6030204@oracle.com \
    --to=si-wei.liu@oracle.com \
    --cc=boris.ostrovsky@oracle.com \
    --cc=jasowang@redhat.com \
    --cc=joao.m.martins@oracle.com \
    --cc=lingshan.zhu@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mst@redhat.com \
    --cc=netdev@vger.kernel.org \
    --cc=virtualization@lists.linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.