From: Alex Williamson <alex.williamson@redhat.com>
To: Kirti Wankhede <kwankhede@nvidia.com>
Cc: Zhengxiao.zx@Alibaba-inc.com, kevin.tian@intel.com,
yi.l.liu@intel.com, cjia@nvidia.com, kvm@vger.kernel.org,
eskultet@redhat.com, ziye.yang@intel.com, qemu-devel@nongnu.org,
cohuck@redhat.com, shuangtai.tst@alibaba-inc.com,
dgilbert@redhat.com, zhi.a.wang@intel.com, mlevitsk@redhat.com,
pasic@linux.ibm.com, aik@ozlabs.ru, eauger@redhat.com,
felipe@nutanix.com, jonathan.davies@nutanix.com,
yan.y.zhao@intel.com, changpeng.liu@intel.com, Ken.Xue@amd.com
Subject: Re: [PATCH v11 Kernel 3/6] vfio iommu: Implementation of ioctl to for dirty pages tracking.
Date: Thu, 9 Jan 2020 07:53:04 -0700 [thread overview]
Message-ID: <20200109075304.27c962dd@x1.home> (raw)
In-Reply-To: <f7db5eae-b650-6078-edb2-7fe20d71bd47@nvidia.com>
On Thu, 9 Jan 2020 18:59:40 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:
> On 1/9/2020 3:59 AM, Alex Williamson wrote:
> > On Thu, 9 Jan 2020 01:31:16 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >
> >> On 1/8/2020 3:32 AM, Alex Williamson wrote:
> >>> On Wed, 8 Jan 2020 01:37:03 +0530
> >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>
> >>
> >> <snip>
> >>
> >>>>>> +
> >>>>>> + unlocked = vfio_iova_put_vfio_pfn(dma, vpfn, dirty_tracking);
> >>>>>>
> >>>>>> if (do_accounting)
> >>>>>> vfio_lock_acct(dma, -unlocked, true);
> >>>>>> @@ -571,8 +606,12 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
> >>>>>>
> >>>>>> vpfn = vfio_iova_get_vfio_pfn(dma, iova);
> >>>>>> if (vpfn) {
> >>>>>> - phys_pfn[i] = vpfn->pfn;
> >>>>>> - continue;
> >>>>>> + if (vpfn->unpinned)
> >>>>>> + vfio_remove_from_pfn_list(dma, vpfn);
> >>>>>
> >>>>> This seems inefficient, we have an allocated vpfn at the right places
> >>>>> in the list, wouldn't it be better to repin the page?
> >>>>>
> >>>>
> >>>> vfio_pin_page_external() takes care of pinning and accounting as well.
> >>>
> >>> Yes, but could we call vfio_pin_page_external() without {unlinking,
> >>> freeing} and {re-allocating, linking} on either side of it since it's
> >>> already in the list? That's the inefficient part. Maybe at least a
> >>> TODO comment?
> >>>
> >>
> >> Changing it as below:
> >>
> >> vpfn = vfio_iova_get_vfio_pfn(dma, iova);
> >> if (vpfn) {
> >> - phys_pfn[i] = vpfn->pfn;
> >> - continue;
> >> + if (vpfn->ref_count > 1) {
> >> + phys_pfn[i] = vpfn->pfn;
> >> + continue;
> >> + }
> >> }
> >>
> >> remote_vaddr = dma->vaddr + iova - dma->iova;
> >> ret = vfio_pin_page_external(dma, remote_vaddr,
> >> &phys_pfn[i],
> >> do_accounting);
> >> if (ret)
> >> goto pin_unwind;
> >> -
> >> - ret = vfio_add_to_pfn_list(dma, iova, phys_pfn[i]);
> >> - if (ret) {
> >> - vfio_unpin_page_external(dma, iova, do_accounting);
> >> - goto pin_unwind;
> >> - }
> >> + if (!vpfn) {
> >> + ret = vfio_add_to_pfn_list(dma, iova, phys_pfn[i]);
> >> + if (ret) {
> >> + vfio_unpin_page_external(dma, iova,
> >> + do_accounting,
> >> false);
> >> + goto pin_unwind;
> >> + }
> >> + } else
> >> + vpfn->pfn = phys_pfn[i];
> >> }
> >>
> >>
> >>
> >>
> >>>>>> + else {
> >>>>>> + phys_pfn[i] = vpfn->pfn;
> >>>>>> + continue;
> >>>>>> + }
> >>>>>> }
> >>>>>>
> >>>>>> remote_vaddr = dma->vaddr + iova - dma->iova;
> >>>>>> @@ -583,7 +622,8 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
> >>>>>>
> >>>>>> ret = vfio_add_to_pfn_list(dma, iova, phys_pfn[i]);
> >>>>>> if (ret) {
> >>>>>> - vfio_unpin_page_external(dma, iova, do_accounting);
> >>>>>> + vfio_unpin_page_external(dma, iova, do_accounting,
> >>>>>> + false);
> >>>>>> goto pin_unwind;
> >>>>>> }
> >>>>>> }
> >>
> >> <snip>
> >>
> >>>>
> >>>>>> + if (range.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_START) {
> >>>>>> + iommu->dirty_page_tracking = true;
> >>>>>> + return 0;
> >>>>>> + } else if (range.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP) {
> >>>>>> + iommu->dirty_page_tracking = false;
> >>>>>> +
> >>>>>> + mutex_lock(&iommu->lock);
> >>>>>> + vfio_remove_unpinned_from_dma_list(iommu);
> >>>>>> + mutex_unlock(&iommu->lock);
> >>>>>> + return 0;
> >>>>>> +
> >>>>>> + } else if (range.flags &
> >>>>>> + VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP) {
> >>>>>> + uint64_t iommu_pgmask;
> >>>>>> + unsigned long pgshift = __ffs(range.pgsize);
> >>>>>> + unsigned long *bitmap;
> >>>>>> + long bsize;
> >>>>>> +
> >>>>>> + iommu_pgmask =
> >>>>>> + ((uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu))) - 1;
> >>>>>> +
> >>>>>> + if (((range.pgsize - 1) & iommu_pgmask) !=
> >>>>>> + (range.pgsize - 1))
> >>>>>> + return -EINVAL;
> >>>>>> +
> >>>>>> + if (range.iova & iommu_pgmask)
> >>>>>> + return -EINVAL;
> >>>>>> + if (!range.size || range.size > SIZE_MAX)
> >>>>>> + return -EINVAL;
> >>>>>> + if (range.iova + range.size < range.iova)
> >>>>>> + return -EINVAL;
> >>>>>> +
> >>>>>> + bsize = verify_bitmap_size(range.size >> pgshift,
> >>>>>> + range.bitmap_size);
> >>>>>> + if (bsize < 0)
> >>>>>> + return ret;
> >>>>>> +
> >>>>>> + bitmap = kmalloc(bsize, GFP_KERNEL);
> >>>>>
> >>>>> I think I remember mentioning previously that we cannot allocate an
> >>>>> arbitrary buffer on behalf of the user, it's far too easy for them to
> >>>>> kill the kernel that way. kmalloc is also limited in what it can
> >>>>> alloc.
> >>>>
> >>>> That's the reason added verify_bitmap_size(), so that size is verified
> >>>
> >>> That's only a consistency test, it only verifies that the user claims
> >>> to provide a bitmap sized sufficiently for the range they're trying to
> >>> request. range.size is limited to SIZE_MAX, so 2^64, divided by page
> >>> size for 2^52 bits, 8bits per byte for 2^49 bytes of bitmap that we'd
> >>> try to kmalloc (512TB). kmalloc is good for a couple MB AIUI.
> >>> Meanwhile the user doesn't actually need to allocate that bitmap in
> >>> order to crash the kernel.
> >>>
> >>>>> Can't we use the user buffer directly or only work on a part of
> >>>>> it at a time?
> >>>>>
> >>>>
> >>>> without copy_from_user(), how?
> >>>
> >>> For starters, what's the benefit of copying the bitmap from the user
> >>> in the first place? We presume the data is zero'd and if it's not,
> >>> that's the user's bug to sort out (we just need to define the API to
> >>> specify that). Therefore the copy_from_user() is unnecessary anyway and
> >>> we really just need to copy_to_user() for any places we're setting
> >>> bits. We could just walk through the range with an unsigned long
> >>> bitmap buffer, writing it out to userspace any time we reach the end
> >>> with bits set, zeroing it and shifting it as a window to the user
> >>> buffer. We could improve batching by allocating a larger buffer in the
> >>> kernel, with a kernel defined maximum size and performing the same
> >>> windowing scheme.
> >>>
> >>
> >> Ok removing copy_from_user().
> >> But AFAIK, calling copy_to_user() multiple times is not efficient in
> >> terms of performance.
> >
> > Right, but even with a modestly sized internal buffer for batching we
> > can cover quite a large address space. 128MB for a 4KB buffer, 32GB
> > with 1MB buffer. __put_user() is more lightweight than copy_to_user(),
> > I wonder where the inflection point is in batching the latter versus
> > more iterations of the former.
> >
> >> Checked code in virt/kvm/kvm_main.c: __kvm_set_memory_region() where
> >> dirty_bitmap is allocated, that has generic checks, user space address
> >> check, memory overflow check and KVM_MEM_MAX_NR_PAGES as below. I'll add
> >> access_ok check. I already added overflow check.
> >>
> >> /* General sanity checks */
> >> if (mem->memory_size & (PAGE_SIZE - 1))
> >> goto out;
> >>
> >> !access_ok((void __user *)(unsigned long)mem->userspace_addr,
> >> mem->memory_size)))
> >>
> >> if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr)
> >> goto out;
> >>
> >> if (npages > KVM_MEM_MAX_NR_PAGES)
> >> goto out;
> >>
> >>
> >> Where KVM_MEM_MAX_NR_PAGES is:
> >>
> >> /*
> >> * Some of the bitops functions do not support too long bitmaps.
> >> * This number must be determined not to exceed such limits.
> >> */
> >> #define KVM_MEM_MAX_NR_PAGES ((1UL << 31) - 1)
> >>
> >> But we can't use KVM specific KVM_MEM_MAX_NR_PAGES check in vfio module.
> >> Should we define similar limit in vfio module instead of SIZE_MAX?
> >
> > If we have ranges that are up to 2^31 pages, that's still 2^28 bytes.
> > Does it seem reasonable to have a kernel interface that potentially
> > allocates 256MB of kernel space with kmalloc accessible to users? That
> > still seems like a DoS attack vector to me, especially since the user
> > doesn't need to be able to map that much memory (8TB) to access it.
> >
> > I notice that KVM allocate the bitmap (kvzalloc) relative to the actual
> > size of the memory slot when dirty logging is enabled, maybe that's the
> > right approach rather than walking vpfn lists and maintaining unpinned
> > vpfns for the purposes of tracking. For example, when dirty logging is
> > enabled, walk all vfio_dmas and allocate a dirty bitmap anywhere the
> > vpfn list is not empty and walk the vpfn list to set dirty bits in the
> > bitmap.
>
> Bitmap will be allocated per vfio_dma, not as per
> VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP request, right?
Per vfio_dma when dirty logging is enabled, ie. between
VFIO_IOMMU_DIRTY_PAGES_FLAG_START and VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP.
> > When new pages are pinned, allocate a bitmap if not already
> > present and set the dirty bit. When unpinned, update the vpfn list but
> > leave the dirty bit set. When the dirty bitmap is read, copy out the
> > current bitmap to the user, memset it to zero, then re-walk the vpfn
> > list to set currently dirty pages.
>
> Why re-walk is required again? Pinning /unpinning or reporting dirty
> pages are done holding iommu->lock, there shouldn't be race condition.
In order to "remember" that a page was dirtied, I proposed above that
we don't change the bitmap when a page is unpinned. We can "forget"
that a page was dirtied if it's no longer pinned and we've told the
user about it. Therefore we need to purge the not-currently-pinned
pages from the bitmap and rebuild it.
> > A vfio_dma without a dirty bitmap
> > would consider the entire range dirty.
>
> That will depend on (!iommu->pinned_page_dirty_scope &&
> dma->iommu_mapped) condition to mark entire range dirty.
I assumed we wouldn't bother to maintain a bitmap unless these
conditions are already met.
> Here even if vpfn list is empty, memory for dirty_bitmap need to be
> allocated, memset all bits to 1, then copy_to_user().
I was assuming we might use a __put_user() loop to fill such ranges
rather than waste memory tracking fully populated bitmaps.
> If we go with this approach, then I think there should be restriction to
> get bitmap as per the way mappings were created, multiple mappings can
> be clubbed together, but shouldn't bisect the mappings - similar to
> unmap restriction.
Why? It seems like it's just some pointer arithmetic to copy out the
section of the bitmap that the user requests. Thanks,
Alex
next prev parent reply other threads:[~2020-01-09 14:56 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-12-17 17:10 [PATCH v11 Kernel 0/6] KABIs to support migration for VFIO devices Kirti Wankhede
2019-12-17 17:10 ` [PATCH v11 Kernel 1/6] vfio: KABI for migration interface for device state Kirti Wankhede
2019-12-17 17:10 ` [PATCH v11 Kernel 2/6] vfio iommu: Add ioctl definition for dirty pages tracking Kirti Wankhede
2019-12-17 17:10 ` [PATCH v11 Kernel 3/6] vfio iommu: Implementation of ioctl to " Kirti Wankhede
2019-12-17 22:12 ` Alex Williamson
2020-01-07 20:07 ` Kirti Wankhede
2020-01-07 22:02 ` Alex Williamson
2020-01-08 20:01 ` Kirti Wankhede
2020-01-08 22:29 ` Alex Williamson
2020-01-09 13:29 ` Kirti Wankhede
2020-01-09 14:53 ` Alex Williamson [this message]
2019-12-17 17:10 ` [PATCH v11 Kernel 4/6] vfio iommu: Update UNMAP_DMA ioctl to get dirty bitmap before unmap Kirti Wankhede
2019-12-17 22:55 ` Alex Williamson
2019-12-17 17:10 ` [PATCH v11 Kernel 5/6] vfio iommu: Adds flag to indicate dirty pages tracking capability support Kirti Wankhede
2019-12-17 17:10 ` [PATCH v11 Kernel 6/6] vfio: Selective dirty page tracking if IOMMU backed device pins pages Kirti Wankhede
2019-12-18 0:12 ` Alex Williamson
2020-01-07 20:45 ` Kirti Wankhede
2020-01-08 0:09 ` Alex Williamson
2020-01-08 20:52 ` Kirti Wankhede
2020-01-08 22:59 ` Alex Williamson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200109075304.27c962dd@x1.home \
--to=alex.williamson@redhat.com \
--cc=Ken.Xue@amd.com \
--cc=Zhengxiao.zx@Alibaba-inc.com \
--cc=aik@ozlabs.ru \
--cc=changpeng.liu@intel.com \
--cc=cjia@nvidia.com \
--cc=cohuck@redhat.com \
--cc=dgilbert@redhat.com \
--cc=eauger@redhat.com \
--cc=eskultet@redhat.com \
--cc=felipe@nutanix.com \
--cc=jonathan.davies@nutanix.com \
--cc=kevin.tian@intel.com \
--cc=kvm@vger.kernel.org \
--cc=kwankhede@nvidia.com \
--cc=mlevitsk@redhat.com \
--cc=pasic@linux.ibm.com \
--cc=qemu-devel@nongnu.org \
--cc=shuangtai.tst@alibaba-inc.com \
--cc=yan.y.zhao@intel.com \
--cc=yi.l.liu@intel.com \
--cc=zhi.a.wang@intel.com \
--cc=ziye.yang@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).