Re: [PATCH v12 Kernel 4/7] vfio iommu: Implementation of ioctl to for dirty pages tracking.

From: Alex Williamson <alex.williamson@redhat.com>
To: Kirti Wankhede <kwankhede@nvidia.com>
Cc: <cjia@nvidia.com>, <kevin.tian@intel.com>, <ziye.yang@intel.com>,
	<changpeng.liu@intel.com>, <yi.l.liu@intel.com>,
	<mlevitsk@redhat.com>, <eskultet@redhat.com>, <cohuck@redhat.com>,
	<dgilbert@redhat.com>, <jonathan.davies@nutanix.com>,
	<eauger@redhat.com>, <aik@ozlabs.ru>, <pasic@linux.ibm.com>,
	<felipe@nutanix.com>, <Zhengxiao.zx@Alibaba-inc.com>,
	<shuangtai.tst@alibaba-inc.com>, <Ken.Xue@amd.com>,
	<zhi.a.wang@intel.com>, <yan.y.zhao@intel.com>,
	<qemu-devel@nongnu.org>, <kvm@vger.kernel.org>
Subject: Re: [PATCH v12 Kernel 4/7] vfio iommu: Implementation of ioctl to for dirty pages tracking.
Date: Tue, 18 Feb 2020 21:53:30 -0700	[thread overview]
Message-ID: <20200218215330.5bc8fc6a@w520.home> (raw)
In-Reply-To: <96d9990f-b27f-759e-1395-9b2eff218bfa@nvidia.com>

On Wed, 19 Feb 2020 09:51:32 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 2/19/2020 3:11 AM, Alex Williamson wrote:
> > On Tue, 18 Feb 2020 11:28:53 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> <snip>
> >>  
> >>>>>>>     As I understand the above algorithm, we find a vfio_dma
> >>>>>>> overlapping the request and populate the bitmap for that range.  Then
> >>>>>>> we go back and put_user() for each byte that we touched.  We could
> >>>>>>> instead simply work on a one byte buffer as we enumerate the requested
> >>>>>>> range and do a put_user() ever time we reach the end of it and have bits
> >>>>>>> set. That would greatly simplify the above example.  But I would expect
> >>>>>>> that we're a) more likely to get asked for ranges covering a single
> >>>>>>> vfio_dma  
> >>>>>>
> >>>>>> QEMU ask for single vfio_dma during each iteration.
> >>>>>>
> >>>>>> If we restrict this ABI to cover single vfio_dma only, then it
> >>>>>> simplifies the logic here. That was my original suggestion. Should we
> >>>>>> think about that again?  
> >>>>>
> >>>>> But we currently allow unmaps that overlap multiple vfio_dmas as long
> >>>>> as no vfio_dma is bisected, so I think that implies that an unmap while
> >>>>> asking for the dirty bitmap has even further restricted semantics.  I'm
> >>>>> also reluctant to design an ABI around what happens to be the current
> >>>>> QEMU implementation.
> >>>>>
> >>>>> If we take your example above, ranges {0x0000,0xa000} and
> >>>>> {0xa000,0x10000} ({start,end}), I think you're working with the
> >>>>> following two bitmaps in this implementation:
> >>>>>
> >>>>> 00000011 11111111b
> >>>>> 00111111b
> >>>>>
> >>>>> And we need to combine those into:
> >>>>>
> >>>>> 11111111 11111111b
> >>>>>
> >>>>> Right?
> >>>>>
> >>>>> But it seems like that would be easier if the second bitmap was instead:
> >>>>>
> >>>>> 11111100b
> >>>>>
> >>>>> Then we wouldn't need to worry about the entire bitmap being shifted by
> >>>>> the bit offset within the byte, which limits our fixes to the boundary
> >>>>> byte and allows us to use copy_to_user() directly for the bulk of the
> >>>>> copy.  So how do we get there?
> >>>>>
> >>>>> I think we start with allocating the vfio_dma bitmap to account for
> >>>>> this initial offset, so we calculate bitmap_base_iova as:
> >>>>>      (iova & ~((PAGE_SIZE << 3) - 1))
> >>>>> We then use bitmap_base_iova in calculating which bits to set.
> >>>>>
> >>>>> The user needs to follow the same rules, and maybe this adds some value
> >>>>> to the user providing the bitmap size rather than the kernel
> >>>>> calculating it.  For example, if the user wanted the dirty bitmap for
> >>>>> the range {0xa000,0x10000} above, they'd provide at least a 1 byte
> >>>>> bitmap, but we'd return bit #2 set to indicate 0xa000 is dirty.
> >>>>>
> >>>>> Effectively the user can ask for any iova range, but the buffer will be
> >>>>> filled relative to the zeroth bit of the bitmap following the above
> >>>>> bitmap_base_iova formula (and replacing PAGE_SIZE with the user
> >>>>> requested pgsize).  I'm tempted to make this explicit in the user
> >>>>> interface (ie. only allow bitmaps starting on aligned pages), but a
> >>>>> user is able to map and unmap single pages and we need to support
> >>>>> returning a dirty bitmap with an unmap, so I don't think we can do that.
> >>>>>         
> >>>>
> >>>> Sigh, finding adjacent vfio_dmas within the same byte seems simpler than
> >>>> this.  
> >>>
> >>> How does KVM do this?  My intent was that if all of our bitmaps share
> >>> the same alignment then we can merge the intersection and continue to
> >>> use copy_to_user() on either side.  However, if QEMU doesn't do the
> >>> same, it doesn't really help us.  Is QEMU stuck with an implementation
> >>> of only retrieving dirty bits per MemoryRegionSection exactly because
> >>> of this issue and therefore we can rely on it in our implementation as
> >>> well?  Thanks,
> >>>      
> >>
> >> QEMU sync dirty_bitmap per MemoryRegionSection. Within
> >> MemoryRegionSection there could be multiple KVMSlots. QEMU queries
> >> dirty_bitmap per KVMSlot and mark dirty for each KVMSlot.
> >> On kernel side, KVM_GET_DIRTY_LOG ioctl calls
> >> kvm_get_dirty_log_protect(), where it uses copy_to_user() to copy bitmap
> >> of that memSlot.
> >> vfio_dma is per MemoryRegionSection. We can reply on MemoryRegionSection
> >> in our implementation. But to get bitmap during unmap, we have to take
> >> care of concatenating bitmaps.  
> > 
> > So KVM does not worry about bitmap alignment because the interface is
> > based on slots, a dirty bitmap can only be retrieved for a single,
> > entire slot.  We need VFIO_IOMMU_UNMAP_DMA to maintain its support for
> > spanning multiple vfio_dmas, but maybe we have some leeway that we
> > don't need to support both multiple vfio_dmas and dirty bitmap at the
> > same time.  It seems like it would be a massive simplification if we
> > required an unmap with dirty bitmap to span exactly one vfio_dma,
> > right?   
> 
> Yes.
> 
> > I don't see that we'd break any existing users with that, it's
> > unfortunate that we can't have the flexibility of the existing calling
> > convention, but I think there's good reason for it here.  Our separate
> > dirty bitmap log reporting would follow the same semantics.  I think
> > this all aligns with how the MemoryListener works in QEMU right now,
> > correct?  For example we wouldn't need any extra per MAP_DMA tracking
> > in QEMU like KVM has for its slots.
> >   
> 
> That right.
> Should we go ahead with the implementation to get dirty bitmap for one 
> vfio_dma for GET_DIRTY ioctl and unmap with dirty ioctl? Accordingly we 
> can have sanity checks in these ioctls.

Yes, I'm convinced that bitmap alignment is sufficiently too difficult
and unnecessary to restrict the calling convention of UNMAP_DMA, when
using the dirty bitmap extension, to exactly unmap a single vfio_dma.
Thanks,

Alex