QEMU-Devel Archive on lore.kernel.org
 help / color / Atom feed
From: Kirti Wankhede <kwankhede@nvidia.com>
To: Yan Zhao <yan.y.zhao@intel.com>,
	"Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: "Zhengxiao.zx@alibaba-inc.com" <Zhengxiao.zx@alibaba-inc.com>,
	"Tian, Kevin" <kevin.tian@intel.com>,
	"Liu, Yi L" <yi.l.liu@intel.com>,
	"cjia@nvidia.com" <cjia@nvidia.com>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	"eskultet@redhat.com" <eskultet@redhat.com>,
	"Yang, Ziye" <ziye.yang@intel.com>,
	"cohuck@redhat.com" <cohuck@redhat.com>,
	"shuangtai.tst@alibaba-inc.com" <shuangtai.tst@alibaba-inc.com>,
	"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
	"Wang,  Zhi A" <zhi.a.wang@intel.com>,
	"mlevitsk@redhat.com" <mlevitsk@redhat.com>,
	"pasic@linux.ibm.com" <pasic@linux.ibm.com>,
	"aik@ozlabs.ru" <aik@ozlabs.ru>,
	Alex Williamson <alex.williamson@redhat.com>,
	"eauger@redhat.com" <eauger@redhat.com>,
	"felipe@nutanix.com" <felipe@nutanix.com>,
	"jonathan.davies@nutanix.com" <jonathan.davies@nutanix.com>,
	"Liu, Changpeng" <changpeng.liu@intel.com>,
	"Ken.Xue@amd.com" <Ken.Xue@amd.com>
Subject: Re: [PATCH v15 Kernel 4/7] vfio iommu: Implementation of ioctl for dirty pages tracking.
Date: Tue, 24 Mar 2020 15:15:07 +0530
Message-ID: <d271acf4-7fd6-c93d-aa3d-399a7ace20cf@nvidia.com> (raw)
In-Reply-To: <20200324030118.GD5456@joy-OptiPlex-7040>



On 3/24/2020 8:31 AM, Yan Zhao wrote:
> On Tue, Mar 24, 2020 at 02:51:14AM +0800, Dr. David Alan Gilbert wrote:
>> * Alex Williamson (alex.williamson@redhat.com) wrote:
>>> On Mon, 23 Mar 2020 23:24:37 +0530
>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>
>>>> On 3/21/2020 12:29 AM, Alex Williamson wrote:
>>>>> On Sat, 21 Mar 2020 00:12:04 +0530
>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>>>    
>>>>>> On 3/20/2020 11:31 PM, Alex Williamson wrote:
>>>>>>> On Fri, 20 Mar 2020 23:19:14 +0530
>>>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>>>>>       
>>>>>>>> On 3/20/2020 4:27 AM, Alex Williamson wrote:
>>>>>>>>> On Fri, 20 Mar 2020 01:46:41 +0530
>>>>>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>>>>>>>          
>>>>>>
>>>>>> <snip>
>>>>>>   
>>>>>>>>>> +static int vfio_iova_dirty_bitmap(struct vfio_iommu *iommu, dma_addr_t iova,
>>>>>>>>>> +				  size_t size, uint64_t pgsize,
>>>>>>>>>> +				  u64 __user *bitmap)
>>>>>>>>>> +{
>>>>>>>>>> +	struct vfio_dma *dma;
>>>>>>>>>> +	unsigned long pgshift = __ffs(pgsize);
>>>>>>>>>> +	unsigned int npages, bitmap_size;
>>>>>>>>>> +
>>>>>>>>>> +	dma = vfio_find_dma(iommu, iova, 1);
>>>>>>>>>> +
>>>>>>>>>> +	if (!dma)
>>>>>>>>>> +		return -EINVAL;
>>>>>>>>>> +
>>>>>>>>>> +	if (dma->iova != iova || dma->size != size)
>>>>>>>>>> +		return -EINVAL;
>>>>>>>>>> +
>>>>>>>>>> +	npages = dma->size >> pgshift;
>>>>>>>>>> +	bitmap_size = DIRTY_BITMAP_BYTES(npages);
>>>>>>>>>> +
>>>>>>>>>> +	/* mark all pages dirty if all pages are pinned and mapped. */
>>>>>>>>>> +	if (dma->iommu_mapped)
>>>>>>>>>> +		bitmap_set(dma->bitmap, 0, npages);
>>>>>>>>>> +
>>>>>>>>>> +	if (copy_to_user((void __user *)bitmap, dma->bitmap, bitmap_size))
>>>>>>>>>> +		return -EFAULT;
>>>>>>>>>
>>>>>>>>> We still need to reset the bitmap here, clearing and re-adding the
>>>>>>>>> pages that are still pinned.
>>>>>>>>>
>>>>>>>>> https://lore.kernel.org/kvm/20200319070635.2ff5db56@x1.home/
>>>>>>>>>          
>>>>>>>>
>>>>>>>> I thought you agreed on my reply to it
>>>>>>>> https://lore.kernel.org/kvm/31621b70-02a9-2ea5-045f-f72b671fe703@nvidia.com/
>>>>>>>>      
>>>>>>>>     > Why re-populate when there will be no change since
>>>>>>>>     > vfio_iova_dirty_bitmap() is called holding iommu->lock? If there is any
>>>>>>>>     > pin request while vfio_iova_dirty_bitmap() is still working, it will
>>>>>>>>     > wait till iommu->lock is released. Bitmap will be populated when page is
>>>>>>>>     > pinned.
>>>>>>>
>>>>>>> As coded, dirty bits are only ever set in the bitmap, never cleared.
>>>>>>> If a page is unpinned between iterations of the user recording the
>>>>>>> dirty bitmap, it should be marked dirty in the iteration immediately
>>>>>>> after the unpinning and not marked dirty in the following iteration.
>>>>>>> That doesn't happen here.  We're reporting cumulative dirty pages since
>>>>>>> logging was enabled, we need to be reporting dirty pages since the user
>>>>>>> last retrieved the dirty bitmap.  The bitmap should be cleared and
>>>>>>> currently pinned pages re-added after copying to the user.  Thanks,
>>>>>>>       
>>>>>>
>>>>>> Does that mean, we have to track every iteration? do we really need that
>>>>>> tracking?
>>>>>>
>>>>>> Generally the flow is:
>>>>>> - vendor driver pin x pages
>>>>>> - Enter pre-copy-phase where vCPUs are running - user starts dirty pages
>>>>>> tracking, then user asks dirty bitmap, x pages reported dirty by
>>>>>> VFIO_IOMMU_DIRTY_PAGES ioctl with _GET flag
>>>>>> - In pre-copy phase, vendor driver pins y more pages, now bitmap
>>>>>> consists of x+y bits set
>>>>>> - In pre-copy phase, vendor driver unpins z pages, but bitmap is not
>>>>>> updated, so again bitmap consists of x+y bits set.
>>>>>> - Enter in stop-and-copy phase, vCPUs are stopped, mdev devices are stopped
>>>>>> - user asks dirty bitmap - Since here vCPU and mdev devices are stopped,
>>>>>> pages should not get dirty by guest driver or the physical device.
>>>>>> Hence, x+y dirty pages would be reported.
>>>>>>
>>>>>> I don't think we need to track every iteration of bitmap reporting.
>>>>>
>>>>> Yes, once a bitmap is read, it's reset.  In your example, after
>>>>> unpinning z pages the user should still see a bitmap with x+y pages,
>>>>> but once they've read that bitmap, the next bitmap should be x+y-z.
>>>>> Userspace can make decisions about when to switch from pre-copy to
>>>>> stop-and-copy based on convergence, ie. the slope of the line recording
>>>>> dirty pages per iteration.  The implementation here never allows an
>>>>> inflection point, dirty pages reported through vfio would always either
>>>>> be flat or climbing.  There might also be a case that an iommu backed
>>>>> device could start pinning pages during the course of a migration, how
>>>>> would the bitmap ever revert from fully populated to only tracking the
>>>>> pinned pages?  Thanks,
>>>>>    
>>>>
>>>> At KVM forum we discussed this - if guest driver pins say 1024 pages
>>>> before migration starts, during pre-copy phase device can dirty 0 pages
>>>> in best case and 1024 pages in worst case. In that case, user will
>>>> transfer content of 1024 pages during pre-copy phase and in
>>>> stop-and-copy phase also, that will be pages will be copied twice. So we
>>>> decided to only get dirty pages bitmap at stop-and-copy phase. If user
>>>> is going to get dirty pages in stop-and-copy phase only, then that will
>>>> be single iteration.
>>>> There aren't any devices yet that can track sys memory dirty pages. So
>>>> we can go ahead with this patch and support for dirty pages tracking
>>>> during pre-copy phase can be added later when there will be consumers of
>>>> that functionality.
>>>
>>> So if I understand this right, you're expecting the dirty bitmap to
>>> accumulate dirty bits, in perpetuity, so that the user can only
>>> retrieve them once at the end of migration?  But if that's the case,
>>> the user could simply choose to not retrieve the bitmap until the end
>>> of migration, the result would be the same.  What we have here is that
>>> dirty bits are never cleared, regardless of whether the user has seen
>>> them, which is wrong.  Sorry, we had a lot of discussions at KVM forum,
>>> I don't recall this specific one 5 months later and maybe we weren't
>>> considering all aspects.  I see the behavior we have here as incorrect,
>>> but it also seems relatively trivial to make correct.  I hope the QEMU
>>> code isn't making us go through all this trouble to report a dirty
>>> bitmap that gets thrown away because it expects the final one to be
>>> cumulative since the beginning of dirty logging.  Thanks,
>>
>> I remember the discussion that we couldn't track the system memory
>> dirtying with current hardware; so the question then is just to track

> hi Dave
> there are already devices that are able to track the system memory,
> through two ways:
> (1) software method. like VFs for "Intel(R) Ethernet Controller XL710 Family
> support".
> (2) hardware method. through hardware internal buffer (as one Intel
> internal hardware not yet to public, but very soon) or through VTD-3.0
> IOMMU.
> 
> we have already had code verified using the two ways to track system memory
> in fine-grained level.
> 
> 
>> what has been pinned and then ideally put that memory off until the end.
>> (Which is interesting because I don't think we currently have  a way
>> to delay RAM pages till the end in qemu).
> 
> I think the problem here is that we mixed pinned pages with dirty pages.
> yes, pinned pages for mdev devices are continuously likely to be dirty
> until device stopped.
> But for devices that are able to report dirty pages, dirtied pages
> will be marked again if hardware writes them later.
> 
> So, is it good to introduce a capability to let vfio/qemu know how to
> treat the dirty pages?
> (1) for devices have no fine-grained dirty page tracking capability
>    a. pinned pages are regarded as dirty pages. they are not cleared by
>    dirty page query
>    b. unpinned pages are regarded as dirty pages. they are cleared by
>    dirty page query or UNMAP ioctl.
> (2) for devices that have fine-grained dirty page tracking capability
>     a. pinned/unpinned pages are not regarded as dirty pages
>     b. only pages they reported are regarded as dirty pages and are to be
>     cleared by dirty page query and UNMAP ioctl.
> (3) for dirty pages marking APIs, like vfio_dma_rw()...
>     pages marked by them are regared as dirty and are to be cleared by
>     dirty page query and UNMAP ioctl
> 
> For (1), qemu VFIO only reports dirty page amount and would not transfer
> those pages until last round.

(1) and (3) can be implemented now. I don't think QEMU have support to 
delay certain marked pages dirty as of now. If QEMU queries dirty pages 
bitmap in pre-copy phase, all pinned pages reported as dirty will be 
transferred in pre-copy and again in stop-and-copy phase.

> for (2) and (3), qemu VFIO should report and transfer them in each
> round.
>

(2) can be added later

Thanks,
Kirti

> 
>> [I still worry whether migration will be usable with any
>> significant amount of system ram that's pinned in this way; the
>> downside will very easily get above the threshold that people like]
>>
> yes. that's why we have to do multi-round dirty page query and
> transfer and clear the dirty bitmaps in each round for devices that are
> able to track in fine grain.
> and that's why we have to report the amount of dirty pages before
> stop-and-copy phase for mdev devices, so that people are able to know
> the real downtime as much as possible.
> 
> Thanks
> Yan
> 


  reply index

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-19 20:16 [PATCH v15 Kernel 0/7] KABIs to support migration for VFIO devices Kirti Wankhede
2020-03-19 20:16 ` [PATCH v15 Kernel 1/7] vfio: KABI for migration interface for device state Kirti Wankhede
2020-03-23 20:30   ` Auger Eric
2020-03-24 19:31     ` Kirti Wankhede
2020-03-26  9:33   ` Christoph Hellwig
2020-03-26 21:39     ` Kirti Wankhede
2020-03-19 20:16 ` [PATCH v15 Kernel 2/7] vfio iommu: Remove atomicity of ref_count of pinned pages Kirti Wankhede
2020-03-23 20:30   ` Auger Eric
2020-03-24 19:34     ` Kirti Wankhede
2020-03-19 20:16 ` [PATCH v15 Kernel 3/7] vfio iommu: Add ioctl definition for dirty pages tracking Kirti Wankhede
2020-03-23 21:11   ` Auger Eric
2020-03-24 19:49     ` Kirti Wankhede
2020-03-19 20:16 ` [PATCH v15 Kernel 4/7] vfio iommu: Implementation of ioctl " Kirti Wankhede
2020-03-19 22:57   ` Alex Williamson
2020-03-20 17:49     ` Kirti Wankhede
2020-03-20 18:01       ` Alex Williamson
2020-03-20 18:42         ` Kirti Wankhede
2020-03-20 18:59           ` Alex Williamson
2020-03-23 17:54             ` Kirti Wankhede
2020-03-23 18:44               ` Alex Williamson
2020-03-23 18:51                 ` Dr. David Alan Gilbert
2020-03-24  3:01                   ` Yan Zhao
2020-03-24  9:45                     ` Kirti Wankhede [this message]
2020-03-24 14:36                     ` Alex Williamson
2020-03-24 20:23                       ` Dr. David Alan Gilbert
2020-03-25  5:31                         ` Tian, Kevin
2020-03-19 20:16 ` [PATCH v15 Kernel 5/7] vfio iommu: Update UNMAP_DMA ioctl to get dirty bitmap before unmap Kirti Wankhede
2020-03-19 20:16 ` [PATCH v15 Kernel 6/7] vfio iommu: Adds flag to indicate dirty pages tracking capability support Kirti Wankhede
2020-03-19 20:16 ` [PATCH v15 Kernel 7/7] vfio: Selective dirty page tracking if IOMMU backed device pins pages Kirti Wankhede

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d271acf4-7fd6-c93d-aa3d-399a7ace20cf@nvidia.com \
    --to=kwankhede@nvidia.com \
    --cc=Ken.Xue@amd.com \
    --cc=Zhengxiao.zx@alibaba-inc.com \
    --cc=aik@ozlabs.ru \
    --cc=alex.williamson@redhat.com \
    --cc=changpeng.liu@intel.com \
    --cc=cjia@nvidia.com \
    --cc=cohuck@redhat.com \
    --cc=dgilbert@redhat.com \
    --cc=eauger@redhat.com \
    --cc=eskultet@redhat.com \
    --cc=felipe@nutanix.com \
    --cc=jonathan.davies@nutanix.com \
    --cc=kevin.tian@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=mlevitsk@redhat.com \
    --cc=pasic@linux.ibm.com \
    --cc=qemu-devel@nongnu.org \
    --cc=shuangtai.tst@alibaba-inc.com \
    --cc=yan.y.zhao@intel.com \
    --cc=yi.l.liu@intel.com \
    --cc=zhi.a.wang@intel.com \
    --cc=ziye.yang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

QEMU-Devel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/qemu-devel/0 qemu-devel/git/0.git
	git clone --mirror https://lore.kernel.org/qemu-devel/1 qemu-devel/git/1.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 qemu-devel qemu-devel/ https://lore.kernel.org/qemu-devel \
		qemu-devel@nongnu.org
	public-inbox-index qemu-devel

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.nongnu.qemu-devel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git