QEMU-Devel Archive on lore.kernel.org
 help / color / Atom feed
From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: "Zhengxiao.zx@alibaba-inc.com" <Zhengxiao.zx@alibaba-inc.com>,
	"Tian, Kevin" <kevin.tian@intel.com>,
	"Liu, Yi L" <yi.l.liu@intel.com>,
	"cjia@nvidia.com" <cjia@nvidia.com>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	"eskultet@redhat.com" <eskultet@redhat.com>,
	"Yang, Ziye" <ziye.yang@intel.com>,
	"cohuck@redhat.com" <cohuck@redhat.com>,
	"shuangtai.tst@alibaba-inc.com" <shuangtai.tst@alibaba-inc.com>,
	"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
	"Wang, Zhi A" <zhi.a.wang@intel.com>,
	"mlevitsk@redhat.com" <mlevitsk@redhat.com>,
	"pasic@linux.ibm.com" <pasic@linux.ibm.com>,
	"aik@ozlabs.ru" <aik@ozlabs.ru>,
	Kirti Wankhede <kwankhede@nvidia.com>,
	"eauger@redhat.com" <eauger@redhat.com>,
	"felipe@nutanix.com" <felipe@nutanix.com>,
	"jonathan.davies@nutanix.com" <jonathan.davies@nutanix.com>,
	Yan Zhao <yan.y.zhao@intel.com>,
	"Liu, Changpeng" <changpeng.liu@intel.com>,
	"Ken.Xue@amd.com" <Ken.Xue@amd.com>
Subject: Re: [PATCH v15 Kernel 4/7] vfio iommu: Implementation of ioctl for dirty pages tracking.
Date: Tue, 24 Mar 2020 20:23:04 +0000
Message-ID: <20200324202304.GJ2645@work-vm> (raw)
In-Reply-To: <20200324083644.36494641@w520.home>

* Alex Williamson (alex.williamson@redhat.com) wrote:
> On Mon, 23 Mar 2020 23:01:18 -0400
> Yan Zhao <yan.y.zhao@intel.com> wrote:
> 
> > On Tue, Mar 24, 2020 at 02:51:14AM +0800, Dr. David Alan Gilbert wrote:
> > > * Alex Williamson (alex.williamson@redhat.com) wrote:  
> > > > On Mon, 23 Mar 2020 23:24:37 +0530
> > > > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > > >   
> > > > > On 3/21/2020 12:29 AM, Alex Williamson wrote:  
> > > > > > On Sat, 21 Mar 2020 00:12:04 +0530
> > > > > > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > > > > >     
> > > > > >> On 3/20/2020 11:31 PM, Alex Williamson wrote:    
> > > > > >>> On Fri, 20 Mar 2020 23:19:14 +0530
> > > > > >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > > > > >>>        
> > > > > >>>> On 3/20/2020 4:27 AM, Alex Williamson wrote:    
> > > > > >>>>> On Fri, 20 Mar 2020 01:46:41 +0530
> > > > > >>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > > > > >>>>>           
> > > > > >>
> > > > > >> <snip>
> > > > > >>    
> > > > > >>>>>> +static int vfio_iova_dirty_bitmap(struct vfio_iommu *iommu, dma_addr_t iova,
> > > > > >>>>>> +				  size_t size, uint64_t pgsize,
> > > > > >>>>>> +				  u64 __user *bitmap)
> > > > > >>>>>> +{
> > > > > >>>>>> +	struct vfio_dma *dma;
> > > > > >>>>>> +	unsigned long pgshift = __ffs(pgsize);
> > > > > >>>>>> +	unsigned int npages, bitmap_size;
> > > > > >>>>>> +
> > > > > >>>>>> +	dma = vfio_find_dma(iommu, iova, 1);
> > > > > >>>>>> +
> > > > > >>>>>> +	if (!dma)
> > > > > >>>>>> +		return -EINVAL;
> > > > > >>>>>> +
> > > > > >>>>>> +	if (dma->iova != iova || dma->size != size)
> > > > > >>>>>> +		return -EINVAL;
> > > > > >>>>>> +
> > > > > >>>>>> +	npages = dma->size >> pgshift;
> > > > > >>>>>> +	bitmap_size = DIRTY_BITMAP_BYTES(npages);
> > > > > >>>>>> +
> > > > > >>>>>> +	/* mark all pages dirty if all pages are pinned and mapped. */
> > > > > >>>>>> +	if (dma->iommu_mapped)
> > > > > >>>>>> +		bitmap_set(dma->bitmap, 0, npages);
> > > > > >>>>>> +
> > > > > >>>>>> +	if (copy_to_user((void __user *)bitmap, dma->bitmap, bitmap_size))
> > > > > >>>>>> +		return -EFAULT;    
> > > > > >>>>>
> > > > > >>>>> We still need to reset the bitmap here, clearing and re-adding the
> > > > > >>>>> pages that are still pinned.
> > > > > >>>>>
> > > > > >>>>> https://lore.kernel.org/kvm/20200319070635.2ff5db56@x1.home/
> > > > > >>>>>           
> > > > > >>>>
> > > > > >>>> I thought you agreed on my reply to it
> > > > > >>>> https://lore.kernel.org/kvm/31621b70-02a9-2ea5-045f-f72b671fe703@nvidia.com/
> > > > > >>>>       
> > > > > >>>>    > Why re-populate when there will be no change since
> > > > > >>>>    > vfio_iova_dirty_bitmap() is called holding iommu->lock? If there is any
> > > > > >>>>    > pin request while vfio_iova_dirty_bitmap() is still working, it will
> > > > > >>>>    > wait till iommu->lock is released. Bitmap will be populated when page is
> > > > > >>>>    > pinned.    
> > > > > >>>
> > > > > >>> As coded, dirty bits are only ever set in the bitmap, never cleared.
> > > > > >>> If a page is unpinned between iterations of the user recording the
> > > > > >>> dirty bitmap, it should be marked dirty in the iteration immediately
> > > > > >>> after the unpinning and not marked dirty in the following iteration.
> > > > > >>> That doesn't happen here.  We're reporting cumulative dirty pages since
> > > > > >>> logging was enabled, we need to be reporting dirty pages since the user
> > > > > >>> last retrieved the dirty bitmap.  The bitmap should be cleared and
> > > > > >>> currently pinned pages re-added after copying to the user.  Thanks,
> > > > > >>>        
> > > > > >>
> > > > > >> Does that mean, we have to track every iteration? do we really need that
> > > > > >> tracking?
> > > > > >>
> > > > > >> Generally the flow is:
> > > > > >> - vendor driver pin x pages
> > > > > >> - Enter pre-copy-phase where vCPUs are running - user starts dirty pages
> > > > > >> tracking, then user asks dirty bitmap, x pages reported dirty by
> > > > > >> VFIO_IOMMU_DIRTY_PAGES ioctl with _GET flag
> > > > > >> - In pre-copy phase, vendor driver pins y more pages, now bitmap
> > > > > >> consists of x+y bits set
> > > > > >> - In pre-copy phase, vendor driver unpins z pages, but bitmap is not
> > > > > >> updated, so again bitmap consists of x+y bits set.
> > > > > >> - Enter in stop-and-copy phase, vCPUs are stopped, mdev devices are stopped
> > > > > >> - user asks dirty bitmap - Since here vCPU and mdev devices are stopped,
> > > > > >> pages should not get dirty by guest driver or the physical device.
> > > > > >> Hence, x+y dirty pages would be reported.
> > > > > >>
> > > > > >> I don't think we need to track every iteration of bitmap reporting.    
> > > > > > 
> > > > > > Yes, once a bitmap is read, it's reset.  In your example, after
> > > > > > unpinning z pages the user should still see a bitmap with x+y pages,
> > > > > > but once they've read that bitmap, the next bitmap should be x+y-z.
> > > > > > Userspace can make decisions about when to switch from pre-copy to
> > > > > > stop-and-copy based on convergence, ie. the slope of the line recording
> > > > > > dirty pages per iteration.  The implementation here never allows an
> > > > > > inflection point, dirty pages reported through vfio would always either
> > > > > > be flat or climbing.  There might also be a case that an iommu backed
> > > > > > device could start pinning pages during the course of a migration, how
> > > > > > would the bitmap ever revert from fully populated to only tracking the
> > > > > > pinned pages?  Thanks,
> > > > > >     
> > > > > 
> > > > > At KVM forum we discussed this - if guest driver pins say 1024 pages 
> > > > > before migration starts, during pre-copy phase device can dirty 0 pages 
> > > > > in best case and 1024 pages in worst case. In that case, user will 
> > > > > transfer content of 1024 pages during pre-copy phase and in 
> > > > > stop-and-copy phase also, that will be pages will be copied twice. So we 
> > > > > decided to only get dirty pages bitmap at stop-and-copy phase. If user 
> > > > > is going to get dirty pages in stop-and-copy phase only, then that will 
> > > > > be single iteration.
> > > > > There aren't any devices yet that can track sys memory dirty pages. So 
> > > > > we can go ahead with this patch and support for dirty pages tracking 
> > > > > during pre-copy phase can be added later when there will be consumers of 
> > > > > that functionality.  
> > > > 
> > > > So if I understand this right, you're expecting the dirty bitmap to
> > > > accumulate dirty bits, in perpetuity, so that the user can only
> > > > retrieve them once at the end of migration?  But if that's the case,
> > > > the user could simply choose to not retrieve the bitmap until the end
> > > > of migration, the result would be the same.  What we have here is that
> > > > dirty bits are never cleared, regardless of whether the user has seen
> > > > them, which is wrong.  Sorry, we had a lot of discussions at KVM forum,
> > > > I don't recall this specific one 5 months later and maybe we weren't
> > > > considering all aspects.  I see the behavior we have here as incorrect,
> > > > but it also seems relatively trivial to make correct.  I hope the QEMU
> > > > code isn't making us go through all this trouble to report a dirty
> > > > bitmap that gets thrown away because it expects the final one to be
> > > > cumulative since the beginning of dirty logging.  Thanks,  
> > > 
> > > I remember the discussion that we couldn't track the system memory
> > > dirtying with current hardware; so the question then is just to track  
> > hi Dave
> > there are already devices that are able to track the system memory,
> > through two ways:
> > (1) software method. like VFs for "Intel(R) Ethernet Controller XL710 Family
> > support".
> > (2) hardware method. through hardware internal buffer (as one Intel
> > internal hardware not yet to public, but very soon) or through VTD-3.0
> > IOMMU.
> > 
> > we have already had code verified using the two ways to track system memory
> > in fine-grained level.
> > 
> > 
> > > what has been pinned and then ideally put that memory off until the end.
> > > (Which is interesting because I don't think we currently have  a way
> > > to delay RAM pages till the end in qemu).  
> > 
> > I think the problem here is that we mixed pinned pages with dirty pages.
> 
> We are reporting dirty pages, pinned pages are just assumed to be dirty.
> 
> > yes, pinned pages for mdev devices are continuously likely to be dirty
> > until device stopped.
> > But for devices that are able to report dirty pages, dirtied pages
> > will be marked again if hardware writes them later.
> > 
> > So, is it good to introduce a capability to let vfio/qemu know how to
> > treat the dirty pages?
> 
> Dirty pages are dirty, QEMU doesn't need any special flag, instead we
> need to evolve different mechanisms for the vendor driver so that we
> can differentiate pages pinned for read vs pages pinned for write.
> Perhaps interfaces to pin pages without dirtying them, and a separate
> mechanism to dirty a previously pinned-page, ie. promote it permanently
> or transiently to a writable page.
> 
> > (1) for devices have no fine-grained dirty page tracking capability
> >   a. pinned pages are regarded as dirty pages. they are not cleared by
> >   dirty page query
> >   b. unpinned pages are regarded as dirty pages. they are cleared by
> >   dirty page query or UNMAP ioctl.
> > (2) for devices that have fine-grained dirty page tracking capability
> >    a. pinned/unpinned pages are not regarded as dirty pages
> 
> We need a pin-read-only interface for this.
> 
> >    b. only pages they reported are regarded as dirty pages and are to be
> >    cleared by dirty page query and UNMAP ioctl.
> 
> We need a set-dirty or promote-writable interface for this.
> 
> > (3) for dirty pages marking APIs, like vfio_dma_rw()...
> >    pages marked by them are regared as dirty and are to be cleared by
> >    dirty page query and UNMAP ioctl
> > 
> > For (1), qemu VFIO only reports dirty page amount and would not transfer
> > those pages until last round.
> > for (2) and (3), qemu VFIO should report and transfer them in each
> > round.
> 
> IMO, QEMU should not be aware of any of this.  Userspace has an
> interface to retrieve dirtied pages (period).  We should adjust the
> pages that we report as dirtied to be accurate based on the
> capabilities of the vendor driver.  We can evolve those internal APIs
> between the vendor driver and vfio iommu over time without modifying
> this user interface.

I'm not sure;  if you have a block of memory that's constantly marked
dirty in (1) - we need to avoid constantly retransmitting that memory to
the destination; there's no point in sending it until the end of the
iterations - so it shouldn't even get sent once in the iteration.
But at the same time, we can't ignore the fact that those pages are
going to be dirty - because that influences the downtime; so we need
to know we're going to be getting them later, even if we don't
initially mark them as dirty.

> > > [I still worry whether migration will be usable with any
> > > significant amount of system ram that's pinned in this way; the
> > > downside will very easily get above the threshold that people like]
> > >   
> > yes. that's why we have to do multi-round dirty page query and
> > transfer and clear the dirty bitmaps in each round for devices that are
> > able to track in fine grain.
> > and that's why we have to report the amount of dirty pages before
> > stop-and-copy phase for mdev devices, so that people are able to know
> > the real downtime as much as possible.
> 
> Yes, the dirty bitmap should be accurate to report the pages dirtied
> since it was last retrieved and over time we can add internal
> interfaces to give vendor drivers more granularity in marking pinned
> pages dirty and perhaps even exposing the bitmap to the vendor drivers
> to set pages themselves.  I don't necessarily think it's worthwhile to
> create a new class of dirtied pages to transfer at the end, we're
> fighting a losing battle at that point.  We should be focusing on
> improving the granularity of page dirtying in order to reduce the pages
> transferred at the end of migration.  Thanks,

Dave

> Alex
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



  reply index

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-19 20:16 [PATCH v15 Kernel 0/7] KABIs to support migration for VFIO devices Kirti Wankhede
2020-03-19 20:16 ` [PATCH v15 Kernel 1/7] vfio: KABI for migration interface for device state Kirti Wankhede
2020-03-23 20:30   ` Auger Eric
2020-03-24 19:31     ` Kirti Wankhede
2020-03-26  9:33   ` Christoph Hellwig
2020-03-26 21:39     ` Kirti Wankhede
2020-03-19 20:16 ` [PATCH v15 Kernel 2/7] vfio iommu: Remove atomicity of ref_count of pinned pages Kirti Wankhede
2020-03-23 20:30   ` Auger Eric
2020-03-24 19:34     ` Kirti Wankhede
2020-03-19 20:16 ` [PATCH v15 Kernel 3/7] vfio iommu: Add ioctl definition for dirty pages tracking Kirti Wankhede
2020-03-23 21:11   ` Auger Eric
2020-03-24 19:49     ` Kirti Wankhede
2020-03-19 20:16 ` [PATCH v15 Kernel 4/7] vfio iommu: Implementation of ioctl " Kirti Wankhede
2020-03-19 22:57   ` Alex Williamson
2020-03-20 17:49     ` Kirti Wankhede
2020-03-20 18:01       ` Alex Williamson
2020-03-20 18:42         ` Kirti Wankhede
2020-03-20 18:59           ` Alex Williamson
2020-03-23 17:54             ` Kirti Wankhede
2020-03-23 18:44               ` Alex Williamson
2020-03-23 18:51                 ` Dr. David Alan Gilbert
2020-03-24  3:01                   ` Yan Zhao
2020-03-24  9:45                     ` Kirti Wankhede
2020-03-24 14:36                     ` Alex Williamson
2020-03-24 20:23                       ` Dr. David Alan Gilbert [this message]
2020-03-25  5:31                         ` Tian, Kevin
2020-03-19 20:16 ` [PATCH v15 Kernel 5/7] vfio iommu: Update UNMAP_DMA ioctl to get dirty bitmap before unmap Kirti Wankhede
2020-03-19 20:16 ` [PATCH v15 Kernel 6/7] vfio iommu: Adds flag to indicate dirty pages tracking capability support Kirti Wankhede
2020-03-19 20:16 ` [PATCH v15 Kernel 7/7] vfio: Selective dirty page tracking if IOMMU backed device pins pages Kirti Wankhede

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200324202304.GJ2645@work-vm \
    --to=dgilbert@redhat.com \
    --cc=Ken.Xue@amd.com \
    --cc=Zhengxiao.zx@alibaba-inc.com \
    --cc=aik@ozlabs.ru \
    --cc=alex.williamson@redhat.com \
    --cc=changpeng.liu@intel.com \
    --cc=cjia@nvidia.com \
    --cc=cohuck@redhat.com \
    --cc=eauger@redhat.com \
    --cc=eskultet@redhat.com \
    --cc=felipe@nutanix.com \
    --cc=jonathan.davies@nutanix.com \
    --cc=kevin.tian@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=kwankhede@nvidia.com \
    --cc=mlevitsk@redhat.com \
    --cc=pasic@linux.ibm.com \
    --cc=qemu-devel@nongnu.org \
    --cc=shuangtai.tst@alibaba-inc.com \
    --cc=yan.y.zhao@intel.com \
    --cc=yi.l.liu@intel.com \
    --cc=zhi.a.wang@intel.com \
    --cc=ziye.yang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

QEMU-Devel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/qemu-devel/0 qemu-devel/git/0.git
	git clone --mirror https://lore.kernel.org/qemu-devel/1 qemu-devel/git/1.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 qemu-devel qemu-devel/ https://lore.kernel.org/qemu-devel \
		qemu-devel@nongnu.org
	public-inbox-index qemu-devel

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.nongnu.qemu-devel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git