QEMU-Devel Archive on lore.kernel.org
 help / color / Atom feed
From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: Zhengxiao.zx@alibaba-inc.com, kevin.tian@intel.com,
	yi.l.liu@intel.com, cjia@nvidia.com, kvm@vger.kernel.org,
	eskultet@redhat.com, ziye.yang@intel.com, cohuck@redhat.com,
	shuangtai.tst@alibaba-inc.com, qemu-devel@nongnu.org,
	zhi.a.wang@intel.com, mlevitsk@redhat.com, pasic@linux.ibm.com,
	aik@ozlabs.ru, Kirti Wankhede <kwankhede@nvidia.com>,
	eauger@redhat.com, felipe@nutanix.com,
	jonathan.davies@nutanix.com, yan.y.zhao@intel.com,
	changpeng.liu@intel.com, Ken.Xue@amd.com
Subject: Re: [PATCH v15 Kernel 4/7] vfio iommu: Implementation of ioctl for dirty pages tracking.
Date: Mon, 23 Mar 2020 18:51:14 +0000
Message-ID: <20200323185114.GF3017@work-vm> (raw)
In-Reply-To: <20200323124448.2d3bc315@w520.home>

* Alex Williamson (alex.williamson@redhat.com) wrote:
> On Mon, 23 Mar 2020 23:24:37 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
> > On 3/21/2020 12:29 AM, Alex Williamson wrote:
> > > On Sat, 21 Mar 2020 00:12:04 +0530
> > > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > >   
> > >> On 3/20/2020 11:31 PM, Alex Williamson wrote:  
> > >>> On Fri, 20 Mar 2020 23:19:14 +0530
> > >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > >>>      
> > >>>> On 3/20/2020 4:27 AM, Alex Williamson wrote:  
> > >>>>> On Fri, 20 Mar 2020 01:46:41 +0530
> > >>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > >>>>>         
> > >>
> > >> <snip>
> > >>  
> > >>>>>> +static int vfio_iova_dirty_bitmap(struct vfio_iommu *iommu, dma_addr_t iova,
> > >>>>>> +				  size_t size, uint64_t pgsize,
> > >>>>>> +				  u64 __user *bitmap)
> > >>>>>> +{
> > >>>>>> +	struct vfio_dma *dma;
> > >>>>>> +	unsigned long pgshift = __ffs(pgsize);
> > >>>>>> +	unsigned int npages, bitmap_size;
> > >>>>>> +
> > >>>>>> +	dma = vfio_find_dma(iommu, iova, 1);
> > >>>>>> +
> > >>>>>> +	if (!dma)
> > >>>>>> +		return -EINVAL;
> > >>>>>> +
> > >>>>>> +	if (dma->iova != iova || dma->size != size)
> > >>>>>> +		return -EINVAL;
> > >>>>>> +
> > >>>>>> +	npages = dma->size >> pgshift;
> > >>>>>> +	bitmap_size = DIRTY_BITMAP_BYTES(npages);
> > >>>>>> +
> > >>>>>> +	/* mark all pages dirty if all pages are pinned and mapped. */
> > >>>>>> +	if (dma->iommu_mapped)
> > >>>>>> +		bitmap_set(dma->bitmap, 0, npages);
> > >>>>>> +
> > >>>>>> +	if (copy_to_user((void __user *)bitmap, dma->bitmap, bitmap_size))
> > >>>>>> +		return -EFAULT;  
> > >>>>>
> > >>>>> We still need to reset the bitmap here, clearing and re-adding the
> > >>>>> pages that are still pinned.
> > >>>>>
> > >>>>> https://lore.kernel.org/kvm/20200319070635.2ff5db56@x1.home/
> > >>>>>         
> > >>>>
> > >>>> I thought you agreed on my reply to it
> > >>>> https://lore.kernel.org/kvm/31621b70-02a9-2ea5-045f-f72b671fe703@nvidia.com/
> > >>>>     
> > >>>>    > Why re-populate when there will be no change since
> > >>>>    > vfio_iova_dirty_bitmap() is called holding iommu->lock? If there is any
> > >>>>    > pin request while vfio_iova_dirty_bitmap() is still working, it will
> > >>>>    > wait till iommu->lock is released. Bitmap will be populated when page is
> > >>>>    > pinned.  
> > >>>
> > >>> As coded, dirty bits are only ever set in the bitmap, never cleared.
> > >>> If a page is unpinned between iterations of the user recording the
> > >>> dirty bitmap, it should be marked dirty in the iteration immediately
> > >>> after the unpinning and not marked dirty in the following iteration.
> > >>> That doesn't happen here.  We're reporting cumulative dirty pages since
> > >>> logging was enabled, we need to be reporting dirty pages since the user
> > >>> last retrieved the dirty bitmap.  The bitmap should be cleared and
> > >>> currently pinned pages re-added after copying to the user.  Thanks,
> > >>>      
> > >>
> > >> Does that mean, we have to track every iteration? do we really need that
> > >> tracking?
> > >>
> > >> Generally the flow is:
> > >> - vendor driver pin x pages
> > >> - Enter pre-copy-phase where vCPUs are running - user starts dirty pages
> > >> tracking, then user asks dirty bitmap, x pages reported dirty by
> > >> VFIO_IOMMU_DIRTY_PAGES ioctl with _GET flag
> > >> - In pre-copy phase, vendor driver pins y more pages, now bitmap
> > >> consists of x+y bits set
> > >> - In pre-copy phase, vendor driver unpins z pages, but bitmap is not
> > >> updated, so again bitmap consists of x+y bits set.
> > >> - Enter in stop-and-copy phase, vCPUs are stopped, mdev devices are stopped
> > >> - user asks dirty bitmap - Since here vCPU and mdev devices are stopped,
> > >> pages should not get dirty by guest driver or the physical device.
> > >> Hence, x+y dirty pages would be reported.
> > >>
> > >> I don't think we need to track every iteration of bitmap reporting.  
> > > 
> > > Yes, once a bitmap is read, it's reset.  In your example, after
> > > unpinning z pages the user should still see a bitmap with x+y pages,
> > > but once they've read that bitmap, the next bitmap should be x+y-z.
> > > Userspace can make decisions about when to switch from pre-copy to
> > > stop-and-copy based on convergence, ie. the slope of the line recording
> > > dirty pages per iteration.  The implementation here never allows an
> > > inflection point, dirty pages reported through vfio would always either
> > > be flat or climbing.  There might also be a case that an iommu backed
> > > device could start pinning pages during the course of a migration, how
> > > would the bitmap ever revert from fully populated to only tracking the
> > > pinned pages?  Thanks,
> > >   
> > 
> > At KVM forum we discussed this - if guest driver pins say 1024 pages 
> > before migration starts, during pre-copy phase device can dirty 0 pages 
> > in best case and 1024 pages in worst case. In that case, user will 
> > transfer content of 1024 pages during pre-copy phase and in 
> > stop-and-copy phase also, that will be pages will be copied twice. So we 
> > decided to only get dirty pages bitmap at stop-and-copy phase. If user 
> > is going to get dirty pages in stop-and-copy phase only, then that will 
> > be single iteration.
> > There aren't any devices yet that can track sys memory dirty pages. So 
> > we can go ahead with this patch and support for dirty pages tracking 
> > during pre-copy phase can be added later when there will be consumers of 
> > that functionality.
> 
> So if I understand this right, you're expecting the dirty bitmap to
> accumulate dirty bits, in perpetuity, so that the user can only
> retrieve them once at the end of migration?  But if that's the case,
> the user could simply choose to not retrieve the bitmap until the end
> of migration, the result would be the same.  What we have here is that
> dirty bits are never cleared, regardless of whether the user has seen
> them, which is wrong.  Sorry, we had a lot of discussions at KVM forum,
> I don't recall this specific one 5 months later and maybe we weren't
> considering all aspects.  I see the behavior we have here as incorrect,
> but it also seems relatively trivial to make correct.  I hope the QEMU
> code isn't making us go through all this trouble to report a dirty
> bitmap that gets thrown away because it expects the final one to be
> cumulative since the beginning of dirty logging.  Thanks,

I remember the discussion that we couldn't track the system memory
dirtying with current hardware; so the question then is just to track
what has been pinned and then ideally put that memory off until the end.
(Which is interesting because I don't think we currently have  a way
to delay RAM pages till the end in qemu).

[I still worry whether migration will be usable with any
significant amount of system ram that's pinned in this way; the
downside will very easily get above the threshold that people like]

Dave

> Alex
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



  reply index

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-19 20:16 [PATCH v15 Kernel 0/7] KABIs to support migration for VFIO devices Kirti Wankhede
2020-03-19 20:16 ` [PATCH v15 Kernel 1/7] vfio: KABI for migration interface for device state Kirti Wankhede
2020-03-23 20:30   ` Auger Eric
2020-03-24 19:31     ` Kirti Wankhede
2020-03-26  9:33   ` Christoph Hellwig
2020-03-26 21:39     ` Kirti Wankhede
2020-03-19 20:16 ` [PATCH v15 Kernel 2/7] vfio iommu: Remove atomicity of ref_count of pinned pages Kirti Wankhede
2020-03-23 20:30   ` Auger Eric
2020-03-24 19:34     ` Kirti Wankhede
2020-03-19 20:16 ` [PATCH v15 Kernel 3/7] vfio iommu: Add ioctl definition for dirty pages tracking Kirti Wankhede
2020-03-23 21:11   ` Auger Eric
2020-03-24 19:49     ` Kirti Wankhede
2020-03-19 20:16 ` [PATCH v15 Kernel 4/7] vfio iommu: Implementation of ioctl " Kirti Wankhede
2020-03-19 22:57   ` Alex Williamson
2020-03-20 17:49     ` Kirti Wankhede
2020-03-20 18:01       ` Alex Williamson
2020-03-20 18:42         ` Kirti Wankhede
2020-03-20 18:59           ` Alex Williamson
2020-03-23 17:54             ` Kirti Wankhede
2020-03-23 18:44               ` Alex Williamson
2020-03-23 18:51                 ` Dr. David Alan Gilbert [this message]
2020-03-24  3:01                   ` Yan Zhao
2020-03-24  9:45                     ` Kirti Wankhede
2020-03-24 14:36                     ` Alex Williamson
2020-03-24 20:23                       ` Dr. David Alan Gilbert
2020-03-25  5:31                         ` Tian, Kevin
2020-03-19 20:16 ` [PATCH v15 Kernel 5/7] vfio iommu: Update UNMAP_DMA ioctl to get dirty bitmap before unmap Kirti Wankhede
2020-03-19 20:16 ` [PATCH v15 Kernel 6/7] vfio iommu: Adds flag to indicate dirty pages tracking capability support Kirti Wankhede
2020-03-19 20:16 ` [PATCH v15 Kernel 7/7] vfio: Selective dirty page tracking if IOMMU backed device pins pages Kirti Wankhede

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200323185114.GF3017@work-vm \
    --to=dgilbert@redhat.com \
    --cc=Ken.Xue@amd.com \
    --cc=Zhengxiao.zx@alibaba-inc.com \
    --cc=aik@ozlabs.ru \
    --cc=alex.williamson@redhat.com \
    --cc=changpeng.liu@intel.com \
    --cc=cjia@nvidia.com \
    --cc=cohuck@redhat.com \
    --cc=eauger@redhat.com \
    --cc=eskultet@redhat.com \
    --cc=felipe@nutanix.com \
    --cc=jonathan.davies@nutanix.com \
    --cc=kevin.tian@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=kwankhede@nvidia.com \
    --cc=mlevitsk@redhat.com \
    --cc=pasic@linux.ibm.com \
    --cc=qemu-devel@nongnu.org \
    --cc=shuangtai.tst@alibaba-inc.com \
    --cc=yan.y.zhao@intel.com \
    --cc=yi.l.liu@intel.com \
    --cc=zhi.a.wang@intel.com \
    --cc=ziye.yang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

QEMU-Devel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/qemu-devel/0 qemu-devel/git/0.git
	git clone --mirror https://lore.kernel.org/qemu-devel/1 qemu-devel/git/1.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 qemu-devel qemu-devel/ https://lore.kernel.org/qemu-devel \
		qemu-devel@nongnu.org
	public-inbox-index qemu-devel

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.nongnu.qemu-devel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git