Re: [PATCH v11 Kernel 3/6] vfio iommu: Implementation of ioctl to for dirty pages tracking.

From: Alex Williamson <alex.williamson@redhat.com>
To: Kirti Wankhede <kwankhede@nvidia.com>
Cc: Zhengxiao.zx@Alibaba-inc.com, kevin.tian@intel.com,
	yi.l.liu@intel.com, cjia@nvidia.com, kvm@vger.kernel.org,
	eskultet@redhat.com, ziye.yang@intel.com, qemu-devel@nongnu.org,
	cohuck@redhat.com, shuangtai.tst@alibaba-inc.com,
	dgilbert@redhat.com, zhi.a.wang@intel.com, mlevitsk@redhat.com,
	pasic@linux.ibm.com, aik@ozlabs.ru, eauger@redhat.com,
	felipe@nutanix.com, jonathan.davies@nutanix.com,
	yan.y.zhao@intel.com, changpeng.liu@intel.com, Ken.Xue@amd.com
Subject: Re: [PATCH v11 Kernel 3/6] vfio iommu: Implementation of ioctl to for dirty pages tracking.
Date: Wed, 8 Jan 2020 15:29:34 -0700	[thread overview]
Message-ID: <20200108152934.68cd0e85@w520.home> (raw)
In-Reply-To: <d2faa3fe-d656-5ba7-475a-9646298e3d50@nvidia.com>

On Thu, 9 Jan 2020 01:31:16 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 1/8/2020 3:32 AM, Alex Williamson wrote:
> > On Wed, 8 Jan 2020 01:37:03 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> 
> <snip>
> 
> >>>> +
> >>>> +	unlocked = vfio_iova_put_vfio_pfn(dma, vpfn, dirty_tracking);
> >>>>    
> >>>>    	if (do_accounting)
> >>>>    		vfio_lock_acct(dma, -unlocked, true);
> >>>> @@ -571,8 +606,12 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
> >>>>    
> >>>>    		vpfn = vfio_iova_get_vfio_pfn(dma, iova);
> >>>>    		if (vpfn) {
> >>>> -			phys_pfn[i] = vpfn->pfn;
> >>>> -			continue;
> >>>> +			if (vpfn->unpinned)
> >>>> +				vfio_remove_from_pfn_list(dma, vpfn);  
> >>>
> >>> This seems inefficient, we have an allocated vpfn at the right places
> >>> in the list, wouldn't it be better to repin the page?
> >>>      
> >>
> >> vfio_pin_page_external() takes care of pinning and accounting as well.  
> > 
> > Yes, but could we call vfio_pin_page_external() without {unlinking,
> > freeing} and {re-allocating, linking} on either side of it since it's
> > already in the list?  That's the inefficient part.  Maybe at least a
> > TODO comment?
> >   
> 
> Changing it as below:
> 
>                  vpfn = vfio_iova_get_vfio_pfn(dma, iova);
>                  if (vpfn) {
> -                       phys_pfn[i] = vpfn->pfn;
> -                       continue;
> +                       if (vpfn->ref_count > 1) {
> +                               phys_pfn[i] = vpfn->pfn;
> +                               continue;
> +                       }
>                  }
> 
>                  remote_vaddr = dma->vaddr + iova - dma->iova;
>                  ret = vfio_pin_page_external(dma, remote_vaddr, 
> &phys_pfn[i],
>                                               do_accounting);
>                  if (ret)
>                          goto pin_unwind;
> -
> -               ret = vfio_add_to_pfn_list(dma, iova, phys_pfn[i]);
> -               if (ret) {
> -                       vfio_unpin_page_external(dma, iova, do_accounting);
> -                       goto pin_unwind;
> -               }
> +               if (!vpfn) {
> +                       ret = vfio_add_to_pfn_list(dma, iova, phys_pfn[i]);
> +                       if (ret) {
> +                               vfio_unpin_page_external(dma, iova,
> +                                                        do_accounting, 
> false);
> +                               goto pin_unwind;
> +                       }
> +               } else
> +                       vpfn->pfn = phys_pfn[i];
>          }
> 
> 
> 
> 
> >>>> +			else {
> >>>> +				phys_pfn[i] = vpfn->pfn;
> >>>> +				continue;
> >>>> +			}
> >>>>    		}
> >>>>    
> >>>>    		remote_vaddr = dma->vaddr + iova - dma->iova;
> >>>> @@ -583,7 +622,8 @@ static int vfio_iommu_type1_pin_pages(void *iommu_data,
> >>>>    
> >>>>    		ret = vfio_add_to_pfn_list(dma, iova, phys_pfn[i]);
> >>>>    		if (ret) {
> >>>> -			vfio_unpin_page_external(dma, iova, do_accounting);
> >>>> +			vfio_unpin_page_external(dma, iova, do_accounting,
> >>>> +						 false);
> >>>>    			goto pin_unwind;
> >>>>    		}
> >>>>    	}  
> 
> <snip>
> 
> >>  
> >>>> +		if (range.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_START) {
> >>>> +			iommu->dirty_page_tracking = true;
> >>>> +			return 0;
> >>>> +		} else if (range.flags & VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP) {
> >>>> +			iommu->dirty_page_tracking = false;
> >>>> +
> >>>> +			mutex_lock(&iommu->lock);
> >>>> +			vfio_remove_unpinned_from_dma_list(iommu);
> >>>> +			mutex_unlock(&iommu->lock);
> >>>> +			return 0;
> >>>> +
> >>>> +		} else if (range.flags &
> >>>> +				 VFIO_IOMMU_DIRTY_PAGES_FLAG_GET_BITMAP) {
> >>>> +			uint64_t iommu_pgmask;
> >>>> +			unsigned long pgshift = __ffs(range.pgsize);
> >>>> +			unsigned long *bitmap;
> >>>> +			long bsize;
> >>>> +
> >>>> +			iommu_pgmask =
> >>>> +			 ((uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu))) - 1;
> >>>> +
> >>>> +			if (((range.pgsize - 1) & iommu_pgmask) !=
> >>>> +			    (range.pgsize - 1))
> >>>> +				return -EINVAL;
> >>>> +
> >>>> +			if (range.iova & iommu_pgmask)
> >>>> +				return -EINVAL;
> >>>> +			if (!range.size || range.size > SIZE_MAX)
> >>>> +				return -EINVAL;
> >>>> +			if (range.iova + range.size < range.iova)
> >>>> +				return -EINVAL;
> >>>> +
> >>>> +			bsize = verify_bitmap_size(range.size >> pgshift,
> >>>> +						   range.bitmap_size);
> >>>> +			if (bsize < 0)
> >>>> +				return ret;
> >>>> +
> >>>> +			bitmap = kmalloc(bsize, GFP_KERNEL);  
> >>>
> >>> I think I remember mentioning previously that we cannot allocate an
> >>> arbitrary buffer on behalf of the user, it's far too easy for them to
> >>> kill the kernel that way.  kmalloc is also limited in what it can
> >>> alloc.  
> >>
> >> That's the reason added verify_bitmap_size(), so that size is verified  
> > 
> > That's only a consistency test, it only verifies that the user claims
> > to provide a bitmap sized sufficiently for the range they're trying to
> > request.  range.size is limited to SIZE_MAX, so 2^64, divided by page
> > size for 2^52 bits, 8bits per byte for 2^49 bytes of bitmap that we'd
> > try to kmalloc (512TB).  kmalloc is good for a couple MB AIUI.
> > Meanwhile the user doesn't actually need to allocate that bitmap in
> > order to crash the kernel.
> >   
> >>> Can't we use the user buffer directly or only work on a part of
> >>> it at a time?
> >>>      
> >>
> >> without copy_from_user(), how?  
> > 
> > For starters, what's the benefit of copying the bitmap from the user
> > in the first place?  We presume the data is zero'd and if it's not,
> > that's the user's bug to sort out (we just need to define the API to
> > specify that).  Therefore the copy_from_user() is unnecessary anyway and
> > we really just need to copy_to_user() for any places we're setting
> > bits.  We could just walk through the range with an unsigned long
> > bitmap buffer, writing it out to userspace any time we reach the end
> > with bits set, zeroing it and shifting it as a window to the user
> > buffer.  We could improve batching by allocating a larger buffer in the
> > kernel, with a kernel defined maximum size and performing the same
> > windowing scheme.
> >   
> 
> Ok removing copy_from_user().
> But AFAIK, calling copy_to_user() multiple times is not efficient in 
> terms of performance.

Right, but even with a modestly sized internal buffer for batching we
can cover quite a large address space.  128MB for a 4KB buffer, 32GB
with 1MB buffer.  __put_user() is more lightweight than copy_to_user(),
I wonder where the inflection point is in batching the latter versus
more iterations of the former.

> Checked code in virt/kvm/kvm_main.c: __kvm_set_memory_region() where 
> dirty_bitmap is allocated, that has generic checks, user space address 
> check, memory overflow check and KVM_MEM_MAX_NR_PAGES as below. I'll add 
> access_ok check. I already added overflow check.
> 
>          /* General sanity checks */
>          if (mem->memory_size & (PAGE_SIZE - 1))
>                  goto out;
> 
>         !access_ok((void __user *)(unsigned long)mem->userspace_addr,
>                          mem->memory_size)))
> 
>          if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr)
>                  goto out;
> 
>          if (npages > KVM_MEM_MAX_NR_PAGES)
>                  goto out;
> 
> 
> Where KVM_MEM_MAX_NR_PAGES is:
> 
> /*
>   * Some of the bitops functions do not support too long bitmaps.
>   * This number must be determined not to exceed such limits.
>   */
> #define KVM_MEM_MAX_NR_PAGES ((1UL << 31) - 1)
> 
> But we can't use KVM specific KVM_MEM_MAX_NR_PAGES check in vfio module.
> Should we define similar limit in vfio module instead of SIZE_MAX?

If we have ranges that are up to 2^31 pages, that's still 2^28 bytes.
Does it seem reasonable to have a kernel interface that potentially
allocates 256MB of kernel space with kmalloc accessible to users?  That
still seems like a DoS attack vector to me, especially since the user
doesn't need to be able to map that much memory (8TB) to access it.

I notice that KVM allocate the bitmap (kvzalloc) relative to the actual
size of the memory slot when dirty logging is enabled, maybe that's the
right approach rather than walking vpfn lists and maintaining unpinned
vpfns for the purposes of tracking.  For example, when dirty logging is
enabled, walk all vfio_dmas and allocate a dirty bitmap anywhere the
vpfn list is not empty and walk the vpfn list to set dirty bits in the
bitmap.  When new pages are pinned, allocate a bitmap if not already
present and set the dirty bit.  When unpinned, update the vpfn list but
leave the dirty bit set.  When the dirty bitmap is read, copy out the
current bitmap to the user, memset it to zero, then re-walk the vpfn
list to set currently dirty pages.  A vfio_dma without a dirty bitmap
would consider the entire range dirty.  At least that way the overhead
of the bitmap is just that, overhead rather than a future exploit.
Does this seem like a better approach?  Thanks,

Alex