From: Kirti Wankhede <kwankhede@nvidia.com>
To: Yan Zhao <yan.y.zhao@intel.com>,
"Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: "Zhengxiao.zx@alibaba-inc.com" <Zhengxiao.zx@alibaba-inc.com>,
"Tian, Kevin" <kevin.tian@intel.com>,
"Liu, Yi L" <yi.l.liu@intel.com>,
"cjia@nvidia.com" <cjia@nvidia.com>,
"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
"eskultet@redhat.com" <eskultet@redhat.com>,
"Yang, Ziye" <ziye.yang@intel.com>,
"cohuck@redhat.com" <cohuck@redhat.com>,
"shuangtai.tst@alibaba-inc.com" <shuangtai.tst@alibaba-inc.com>,
"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
"Wang, Zhi A" <zhi.a.wang@intel.com>,
"mlevitsk@redhat.com" <mlevitsk@redhat.com>,
"pasic@linux.ibm.com" <pasic@linux.ibm.com>,
"aik@ozlabs.ru" <aik@ozlabs.ru>,
Alex Williamson <alex.williamson@redhat.com>,
"eauger@redhat.com" <eauger@redhat.com>,
"felipe@nutanix.com" <felipe@nutanix.com>,
"jonathan.davies@nutanix.com" <jonathan.davies@nutanix.com>,
"Liu, Changpeng" <changpeng.liu@intel.com>,
"Ken.Xue@amd.com" <Ken.Xue@amd.com>
Subject: Re: [PATCH v15 Kernel 4/7] vfio iommu: Implementation of ioctl for dirty pages tracking.
Date: Tue, 24 Mar 2020 15:15:07 +0530 [thread overview]
Message-ID: <d271acf4-7fd6-c93d-aa3d-399a7ace20cf@nvidia.com> (raw)
In-Reply-To: <20200324030118.GD5456@joy-OptiPlex-7040>
On 3/24/2020 8:31 AM, Yan Zhao wrote:
> On Tue, Mar 24, 2020 at 02:51:14AM +0800, Dr. David Alan Gilbert wrote:
>> * Alex Williamson (alex.williamson@redhat.com) wrote:
>>> On Mon, 23 Mar 2020 23:24:37 +0530
>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>
>>>> On 3/21/2020 12:29 AM, Alex Williamson wrote:
>>>>> On Sat, 21 Mar 2020 00:12:04 +0530
>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>>>
>>>>>> On 3/20/2020 11:31 PM, Alex Williamson wrote:
>>>>>>> On Fri, 20 Mar 2020 23:19:14 +0530
>>>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>>>>>
>>>>>>>> On 3/20/2020 4:27 AM, Alex Williamson wrote:
>>>>>>>>> On Fri, 20 Mar 2020 01:46:41 +0530
>>>>>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>>>>>>>
>>>>>>
>>>>>> <snip>
>>>>>>
>>>>>>>>>> +static int vfio_iova_dirty_bitmap(struct vfio_iommu *iommu, dma_addr_t iova,
>>>>>>>>>> + size_t size, uint64_t pgsize,
>>>>>>>>>> + u64 __user *bitmap)
>>>>>>>>>> +{
>>>>>>>>>> + struct vfio_dma *dma;
>>>>>>>>>> + unsigned long pgshift = __ffs(pgsize);
>>>>>>>>>> + unsigned int npages, bitmap_size;
>>>>>>>>>> +
>>>>>>>>>> + dma = vfio_find_dma(iommu, iova, 1);
>>>>>>>>>> +
>>>>>>>>>> + if (!dma)
>>>>>>>>>> + return -EINVAL;
>>>>>>>>>> +
>>>>>>>>>> + if (dma->iova != iova || dma->size != size)
>>>>>>>>>> + return -EINVAL;
>>>>>>>>>> +
>>>>>>>>>> + npages = dma->size >> pgshift;
>>>>>>>>>> + bitmap_size = DIRTY_BITMAP_BYTES(npages);
>>>>>>>>>> +
>>>>>>>>>> + /* mark all pages dirty if all pages are pinned and mapped. */
>>>>>>>>>> + if (dma->iommu_mapped)
>>>>>>>>>> + bitmap_set(dma->bitmap, 0, npages);
>>>>>>>>>> +
>>>>>>>>>> + if (copy_to_user((void __user *)bitmap, dma->bitmap, bitmap_size))
>>>>>>>>>> + return -EFAULT;
>>>>>>>>>
>>>>>>>>> We still need to reset the bitmap here, clearing and re-adding the
>>>>>>>>> pages that are still pinned.
>>>>>>>>>
>>>>>>>>> https://lore.kernel.org/kvm/20200319070635.2ff5db56@x1.home/
>>>>>>>>>
>>>>>>>>
>>>>>>>> I thought you agreed on my reply to it
>>>>>>>> https://lore.kernel.org/kvm/31621b70-02a9-2ea5-045f-f72b671fe703@nvidia.com/
>>>>>>>>
>>>>>>>> > Why re-populate when there will be no change since
>>>>>>>> > vfio_iova_dirty_bitmap() is called holding iommu->lock? If there is any
>>>>>>>> > pin request while vfio_iova_dirty_bitmap() is still working, it will
>>>>>>>> > wait till iommu->lock is released. Bitmap will be populated when page is
>>>>>>>> > pinned.
>>>>>>>
>>>>>>> As coded, dirty bits are only ever set in the bitmap, never cleared.
>>>>>>> If a page is unpinned between iterations of the user recording the
>>>>>>> dirty bitmap, it should be marked dirty in the iteration immediately
>>>>>>> after the unpinning and not marked dirty in the following iteration.
>>>>>>> That doesn't happen here. We're reporting cumulative dirty pages since
>>>>>>> logging was enabled, we need to be reporting dirty pages since the user
>>>>>>> last retrieved the dirty bitmap. The bitmap should be cleared and
>>>>>>> currently pinned pages re-added after copying to the user. Thanks,
>>>>>>>
>>>>>>
>>>>>> Does that mean, we have to track every iteration? do we really need that
>>>>>> tracking?
>>>>>>
>>>>>> Generally the flow is:
>>>>>> - vendor driver pin x pages
>>>>>> - Enter pre-copy-phase where vCPUs are running - user starts dirty pages
>>>>>> tracking, then user asks dirty bitmap, x pages reported dirty by
>>>>>> VFIO_IOMMU_DIRTY_PAGES ioctl with _GET flag
>>>>>> - In pre-copy phase, vendor driver pins y more pages, now bitmap
>>>>>> consists of x+y bits set
>>>>>> - In pre-copy phase, vendor driver unpins z pages, but bitmap is not
>>>>>> updated, so again bitmap consists of x+y bits set.
>>>>>> - Enter in stop-and-copy phase, vCPUs are stopped, mdev devices are stopped
>>>>>> - user asks dirty bitmap - Since here vCPU and mdev devices are stopped,
>>>>>> pages should not get dirty by guest driver or the physical device.
>>>>>> Hence, x+y dirty pages would be reported.
>>>>>>
>>>>>> I don't think we need to track every iteration of bitmap reporting.
>>>>>
>>>>> Yes, once a bitmap is read, it's reset. In your example, after
>>>>> unpinning z pages the user should still see a bitmap with x+y pages,
>>>>> but once they've read that bitmap, the next bitmap should be x+y-z.
>>>>> Userspace can make decisions about when to switch from pre-copy to
>>>>> stop-and-copy based on convergence, ie. the slope of the line recording
>>>>> dirty pages per iteration. The implementation here never allows an
>>>>> inflection point, dirty pages reported through vfio would always either
>>>>> be flat or climbing. There might also be a case that an iommu backed
>>>>> device could start pinning pages during the course of a migration, how
>>>>> would the bitmap ever revert from fully populated to only tracking the
>>>>> pinned pages? Thanks,
>>>>>
>>>>
>>>> At KVM forum we discussed this - if guest driver pins say 1024 pages
>>>> before migration starts, during pre-copy phase device can dirty 0 pages
>>>> in best case and 1024 pages in worst case. In that case, user will
>>>> transfer content of 1024 pages during pre-copy phase and in
>>>> stop-and-copy phase also, that will be pages will be copied twice. So we
>>>> decided to only get dirty pages bitmap at stop-and-copy phase. If user
>>>> is going to get dirty pages in stop-and-copy phase only, then that will
>>>> be single iteration.
>>>> There aren't any devices yet that can track sys memory dirty pages. So
>>>> we can go ahead with this patch and support for dirty pages tracking
>>>> during pre-copy phase can be added later when there will be consumers of
>>>> that functionality.
>>>
>>> So if I understand this right, you're expecting the dirty bitmap to
>>> accumulate dirty bits, in perpetuity, so that the user can only
>>> retrieve them once at the end of migration? But if that's the case,
>>> the user could simply choose to not retrieve the bitmap until the end
>>> of migration, the result would be the same. What we have here is that
>>> dirty bits are never cleared, regardless of whether the user has seen
>>> them, which is wrong. Sorry, we had a lot of discussions at KVM forum,
>>> I don't recall this specific one 5 months later and maybe we weren't
>>> considering all aspects. I see the behavior we have here as incorrect,
>>> but it also seems relatively trivial to make correct. I hope the QEMU
>>> code isn't making us go through all this trouble to report a dirty
>>> bitmap that gets thrown away because it expects the final one to be
>>> cumulative since the beginning of dirty logging. Thanks,
>>
>> I remember the discussion that we couldn't track the system memory
>> dirtying with current hardware; so the question then is just to track
> hi Dave
> there are already devices that are able to track the system memory,
> through two ways:
> (1) software method. like VFs for "Intel(R) Ethernet Controller XL710 Family
> support".
> (2) hardware method. through hardware internal buffer (as one Intel
> internal hardware not yet to public, but very soon) or through VTD-3.0
> IOMMU.
>
> we have already had code verified using the two ways to track system memory
> in fine-grained level.
>
>
>> what has been pinned and then ideally put that memory off until the end.
>> (Which is interesting because I don't think we currently have a way
>> to delay RAM pages till the end in qemu).
>
> I think the problem here is that we mixed pinned pages with dirty pages.
> yes, pinned pages for mdev devices are continuously likely to be dirty
> until device stopped.
> But for devices that are able to report dirty pages, dirtied pages
> will be marked again if hardware writes them later.
>
> So, is it good to introduce a capability to let vfio/qemu know how to
> treat the dirty pages?
> (1) for devices have no fine-grained dirty page tracking capability
> a. pinned pages are regarded as dirty pages. they are not cleared by
> dirty page query
> b. unpinned pages are regarded as dirty pages. they are cleared by
> dirty page query or UNMAP ioctl.
> (2) for devices that have fine-grained dirty page tracking capability
> a. pinned/unpinned pages are not regarded as dirty pages
> b. only pages they reported are regarded as dirty pages and are to be
> cleared by dirty page query and UNMAP ioctl.
> (3) for dirty pages marking APIs, like vfio_dma_rw()...
> pages marked by them are regared as dirty and are to be cleared by
> dirty page query and UNMAP ioctl
>
> For (1), qemu VFIO only reports dirty page amount and would not transfer
> those pages until last round.
(1) and (3) can be implemented now. I don't think QEMU have support to
delay certain marked pages dirty as of now. If QEMU queries dirty pages
bitmap in pre-copy phase, all pinned pages reported as dirty will be
transferred in pre-copy and again in stop-and-copy phase.
> for (2) and (3), qemu VFIO should report and transfer them in each
> round.
>
(2) can be added later
Thanks,
Kirti
>
>> [I still worry whether migration will be usable with any
>> significant amount of system ram that's pinned in this way; the
>> downside will very easily get above the threshold that people like]
>>
> yes. that's why we have to do multi-round dirty page query and
> transfer and clear the dirty bitmaps in each round for devices that are
> able to track in fine grain.
> and that's why we have to report the amount of dirty pages before
> stop-and-copy phase for mdev devices, so that people are able to know
> the real downtime as much as possible.
>
> Thanks
> Yan
>
next prev parent reply other threads:[~2020-03-24 9:46 UTC|newest]
Thread overview: 29+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-03-19 20:16 [PATCH v15 Kernel 0/7] KABIs to support migration for VFIO devices Kirti Wankhede
2020-03-19 20:16 ` [PATCH v15 Kernel 1/7] vfio: KABI for migration interface for device state Kirti Wankhede
2020-03-23 20:30 ` Auger Eric
2020-03-24 19:31 ` Kirti Wankhede
2020-03-26 9:33 ` Christoph Hellwig
2020-03-26 21:39 ` Kirti Wankhede
2020-03-19 20:16 ` [PATCH v15 Kernel 2/7] vfio iommu: Remove atomicity of ref_count of pinned pages Kirti Wankhede
2020-03-23 20:30 ` Auger Eric
2020-03-24 19:34 ` Kirti Wankhede
2020-03-19 20:16 ` [PATCH v15 Kernel 3/7] vfio iommu: Add ioctl definition for dirty pages tracking Kirti Wankhede
2020-03-23 21:11 ` Auger Eric
2020-03-24 19:49 ` Kirti Wankhede
2020-03-19 20:16 ` [PATCH v15 Kernel 4/7] vfio iommu: Implementation of ioctl " Kirti Wankhede
2020-03-19 22:57 ` Alex Williamson
2020-03-20 17:49 ` Kirti Wankhede
2020-03-20 18:01 ` Alex Williamson
2020-03-20 18:42 ` Kirti Wankhede
2020-03-20 18:59 ` Alex Williamson
2020-03-23 17:54 ` Kirti Wankhede
2020-03-23 18:44 ` Alex Williamson
2020-03-23 18:51 ` Dr. David Alan Gilbert
2020-03-24 3:01 ` Yan Zhao
2020-03-24 9:45 ` Kirti Wankhede [this message]
2020-03-24 14:36 ` Alex Williamson
2020-03-24 20:23 ` Dr. David Alan Gilbert
2020-03-25 5:31 ` Tian, Kevin
2020-03-19 20:16 ` [PATCH v15 Kernel 5/7] vfio iommu: Update UNMAP_DMA ioctl to get dirty bitmap before unmap Kirti Wankhede
2020-03-19 20:16 ` [PATCH v15 Kernel 6/7] vfio iommu: Adds flag to indicate dirty pages tracking capability support Kirti Wankhede
2020-03-19 20:16 ` [PATCH v15 Kernel 7/7] vfio: Selective dirty page tracking if IOMMU backed device pins pages Kirti Wankhede
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=d271acf4-7fd6-c93d-aa3d-399a7ace20cf@nvidia.com \
--to=kwankhede@nvidia.com \
--cc=Ken.Xue@amd.com \
--cc=Zhengxiao.zx@alibaba-inc.com \
--cc=aik@ozlabs.ru \
--cc=alex.williamson@redhat.com \
--cc=changpeng.liu@intel.com \
--cc=cjia@nvidia.com \
--cc=cohuck@redhat.com \
--cc=dgilbert@redhat.com \
--cc=eauger@redhat.com \
--cc=eskultet@redhat.com \
--cc=felipe@nutanix.com \
--cc=jonathan.davies@nutanix.com \
--cc=kevin.tian@intel.com \
--cc=kvm@vger.kernel.org \
--cc=mlevitsk@redhat.com \
--cc=pasic@linux.ibm.com \
--cc=qemu-devel@nongnu.org \
--cc=shuangtai.tst@alibaba-inc.com \
--cc=yan.y.zhao@intel.com \
--cc=yi.l.liu@intel.com \
--cc=zhi.a.wang@intel.com \
--cc=ziye.yang@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).