KVM Archive on lore.kernel.org
 help / color / Atom feed
From: Kirti Wankhede <kwankhede@nvidia.com>
To: Alex Williamson <alex.williamson@redhat.com>,
	Yan Zhao <yan.y.zhao@intel.com>
Cc: "cjia@nvidia.com" <cjia@nvidia.com>,
	"Tian, Kevin" <kevin.tian@intel.com>,
	"Yang, Ziye" <ziye.yang@intel.com>,
	"Liu, Changpeng" <changpeng.liu@intel.com>,
	"Liu, Yi L" <yi.l.liu@intel.com>,
	"mlevitsk@redhat.com" <mlevitsk@redhat.com>,
	"eskultet@redhat.com" <eskultet@redhat.com>,
	"cohuck@redhat.com" <cohuck@redhat.com>,
	"dgilbert@redhat.com" <dgilbert@redhat.com>,
	"jonathan.davies@nutanix.com" <jonathan.davies@nutanix.com>,
	"eauger@redhat.com" <eauger@redhat.com>,
	"aik@ozlabs.ru" <aik@ozlabs.ru>,
	"pasic@linux.ibm.com" <pasic@linux.ibm.com>,
	"felipe@nutanix.com" <felipe@nutanix.com>,
	"Zhengxiao.zx@Alibaba-inc.com" <Zhengxiao.zx@Alibaba-inc.com>,
	"shuangtai.tst@alibaba-inc.com" <shuangtai.tst@alibaba-inc.com>,
	"Ken.Xue@amd.com" <Ken.Xue@amd.com>,
	"Wang, Zhi A" <zhi.a.wang@intel.com>,
	"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>
Subject: Re: [PATCH v9 Kernel 2/5] vfio iommu: Add ioctl defination to get dirty pages bitmap.
Date: Wed, 4 Dec 2019 23:40:25 +0530
Message-ID: <cce08ca5-79df-2839-16cd-15723b995c07@nvidia.com> (raw)
In-Reply-To: <20191203110412.055c38df@x1.home>



On 12/3/2019 11:34 PM, Alex Williamson wrote:
> On Mon, 25 Nov 2019 19:57:39 -0500
> Yan Zhao <yan.y.zhao@intel.com> wrote:
> 
>> On Fri, Nov 15, 2019 at 05:06:25AM +0800, Alex Williamson wrote:
>>> On Fri, 15 Nov 2019 00:26:07 +0530
>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>    
>>>> On 11/14/2019 1:37 AM, Alex Williamson wrote:
>>>>> On Thu, 14 Nov 2019 01:07:21 +0530
>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>>>      
>>>>>> On 11/13/2019 4:00 AM, Alex Williamson wrote:
>>>>>>> On Tue, 12 Nov 2019 22:33:37 +0530
>>>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>>>>>         
>>>>>>>> All pages pinned by vendor driver through vfio_pin_pages API should be
>>>>>>>> considered as dirty during migration. IOMMU container maintains a list of
>>>>>>>> all such pinned pages. Added an ioctl defination to get bitmap of such
>>>>>>>
>>>>>>> definition
>>>>>>>         
>>>>>>>> pinned pages for requested IO virtual address range.
>>>>>>>
>>>>>>> Additionally, all mapped pages are considered dirty when physically
>>>>>>> mapped through to an IOMMU, modulo we discussed devices opting in to
>>>>>>> per page pinning to indicate finer granularity with a TBD mechanism to
>>>>>>> figure out if any non-opt-in devices remain.
>>>>>>>         
>>>>>>
>>>>>> You mean, in case of device direct assignment (device pass through)?
>>>>>
>>>>> Yes, or IOMMU backed mdevs.  If vfio_dmas in the container are fully
>>>>> pinned and mapped, then the correct dirty page set is all mapped pages.
>>>>> We discussed using the vpfn list as a mechanism for vendor drivers to
>>>>> reduce their migration footprint, but we also discussed that we would
>>>>> need a way to determine that all participants in the container have
>>>>> explicitly pinned their working pages or else we must consider the
>>>>> entire potential working set as dirty.
>>>>>      
>>>>
>>>> How can vendor driver tell this capability to iommu module? Any suggestions?
>>>
>>> I think it does so by pinning pages.  Is it acceptable that if the
>>> vendor driver pins any pages, then from that point forward we consider
>>> the IOMMU group dirty page scope to be limited to pinned pages?  There
>> we should also be aware of that dirty page scope is pinned pages + unpinned pages,
>> which means ever since a page is pinned, it should be regarded as dirty
>> no matter whether it's unpinned later. only after log_sync is called and
>> dirty info retrieved, its dirty state should be cleared.
> 
> Yes, good point.  We can't just remove a vpfn when a page is unpinned
> or else we'd lose information that the page potentially had been
> dirtied while it was pinned.  Maybe that vpfn needs to move to a dirty
> list and both the currently pinned vpfns and the dirty vpfns are walked
> on a log_sync.  The dirty vpfns list would be cleared after a log_sync.
> The container would need to know that dirty tracking is enabled and
> only manage the dirty vpfns list when necessary.  Thanks,
> 

If page is unpinned, then that page is available in free page pool for 
others to use, then how can we say that unpinned page has valid data?

If suppose, one driver A unpins a page and when driver B of some other 
device gets that page and he pins it, uses it, and then unpins it, then 
how can we say that page has valid data for driver A?

Can you give one example where unpinned page data is considered reliable 
and valid?

Thanks,
Kirti

> Alex
>   
>>> are complications around non-singleton IOMMU groups, but I think we're
>>> already leaning towards that being a non-worthwhile problem to solve.
>>> So if we require that only singleton IOMMU groups can pin pages and we
>>> pass the IOMMU group as a parameter to
>>> vfio_iommu_driver_ops.pin_pages(), then the type1 backend can set a
>>> flag on its local vfio_group struct to indicate dirty page scope is
>>> limited to pinned pages.  We might want to keep a flag on the
>>> vfio_iommu struct to indicate if all of the vfio_groups for each
>>> vfio_domain in the vfio_iommu.domain_list dirty page scope limited to
>>> pinned pages as an optimization to avoid walking lists too often.  Then
>>> we could test if vfio_iommu.domain_list is not empty and this new flag
>>> does not limit the dirty page scope, then everything within each
>>> vfio_dma is considered dirty.
>>>     
>>>>>>>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>>>>>>>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>>>>>>>> ---
>>>>>>>>     include/uapi/linux/vfio.h | 23 +++++++++++++++++++++++
>>>>>>>>     1 file changed, 23 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>>>>>>>> index 35b09427ad9f..6fd3822aa610 100644
>>>>>>>> --- a/include/uapi/linux/vfio.h
>>>>>>>> +++ b/include/uapi/linux/vfio.h
>>>>>>>> @@ -902,6 +902,29 @@ struct vfio_iommu_type1_dma_unmap {
>>>>>>>>     #define VFIO_IOMMU_ENABLE	_IO(VFIO_TYPE, VFIO_BASE + 15)
>>>>>>>>     #define VFIO_IOMMU_DISABLE	_IO(VFIO_TYPE, VFIO_BASE + 16)
>>>>>>>>     
>>>>>>>> +/**
>>>>>>>> + * VFIO_IOMMU_GET_DIRTY_BITMAP - _IOWR(VFIO_TYPE, VFIO_BASE + 17,
>>>>>>>> + *                                     struct vfio_iommu_type1_dirty_bitmap)
>>>>>>>> + *
>>>>>>>> + * IOCTL to get dirty pages bitmap for IOMMU container during migration.
>>>>>>>> + * Get dirty pages bitmap of given IO virtual addresses range using
>>>>>>>> + * struct vfio_iommu_type1_dirty_bitmap. Caller sets argsz, which is size of
>>>>>>>> + * struct vfio_iommu_type1_dirty_bitmap. User should allocate memory to get
>>>>>>>> + * bitmap and should set size of allocated memory in bitmap_size field.
>>>>>>>> + * One bit is used to represent per page consecutively starting from iova
>>>>>>>> + * offset. Bit set indicates page at that offset from iova is dirty.
>>>>>>>> + */
>>>>>>>> +struct vfio_iommu_type1_dirty_bitmap {
>>>>>>>> +	__u32        argsz;
>>>>>>>> +	__u32        flags;
>>>>>>>> +	__u64        iova;                      /* IO virtual address */
>>>>>>>> +	__u64        size;                      /* Size of iova range */
>>>>>>>> +	__u64        bitmap_size;               /* in bytes */
>>>>>>>
>>>>>>> This seems redundant.  We can calculate the size of the bitmap based on
>>>>>>> the iova size.
>>>>>>>        
>>>>>>
>>>>>> But in kernel space, we need to validate the size of memory allocated by
>>>>>> user instead of assuming user is always correct, right?
>>>>>
>>>>> What does it buy us for the user to tell us the size?  They could be
>>>>> wrong, they could be malicious.  The argsz field on the ioctl is mostly
>>>>> for the handshake that the user is competent, we should get faults from
>>>>> the copy-user operation if it's incorrect.
>>>>>     
>>>>
>>>> It is to mainly fail safe.
>>>>    
>>>>>>>> +	void __user *bitmap;                    /* one bit per page */
>>>>>>>
>>>>>>> Should we define that as a __u64* to (a) help with the size
>>>>>>> calculation, and (b) assure that we can use 8-byte ops on it?
>>>>>>>
>>>>>>> However, who defines page size?  Is it necessarily the processor page
>>>>>>> size?  A physical IOMMU may support page sizes other than the CPU page
>>>>>>> size.  It might be more important to indicate the expected page size
>>>>>>> than the bitmap size.  Thanks,
>>>>>>>        
>>>>>>
>>>>>> I see in QEMU and in vfio_iommu_type1 module, page sizes considered for
>>>>>> mapping are CPU page size, 4K. Do we still need to have such argument?
>>>>>
>>>>> That assumption exists for backwards compatibility prior to supporting
>>>>> the iova_pgsizes field in vfio_iommu_type1_info.  AFAIK the current
>>>>> interface has no page size assumptions and we should not add any.
>>>>
>>>> So userspace has iova_pgsizes information, which can be input to this
>>>> ioctl. Bitmap should be considering smallest page size. Does that makes
>>>> sense?
>>>
>>> I'm not sure.  I thought I had an argument that the iova_pgsize could
>>> indicate support for sizes smaller than the processor page size, which
>>> would make the user responsible for using a different base for their
>>> page size, but vfio_pgsize_bitmap() already masks out sub-page sizes.
>>> Clearly the vendor driver is pinning based on processor sized pages,
>>> but that's independent of an IOMMU and not part of a user ABI.
>>>
>>> I'm tempted to say your bitmap_size field has a use here, but it seems
>>> to fail in validating the user page size at the low extremes.  For
>>> example if we have a single page mapping, the user can specify the iova
>>> size as 4K (for example), but the minimum bitmap_size they can indicate
>>> is 1 byte, would we therefore assume the user's bitmap page size is 512
>>> bytes (ie. they provided us with 8 bits to describe a 4K range)?  We'd
>>> need to be careful to specify that the minimum iova_pgsize indicated
>>> page size is our lower bound as well.  But then what do we do if the
>>> user provides us with a smaller buffer than we expect?  For example, a
>>> 128MB iova range and only an 8-byte buffer.  Do we go ahead and assume
>>> a 2MB page size and fill the bitmap accordingly or do we generate an
>>> error?  If the latter, might we support that at some point in time and
>>> is it sufficient to let the user perform trial and error to test if that
>>> exists?  Thanks,
>>>
>>> Alex
>>>    
>>
> 

  reply index

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-11-12 17:03 [PATCH v9 Kernel 0/5] Add KABIs to support migration for VFIO devices Kirti Wankhede
2019-11-12 17:03 ` [PATCH v9 Kernel 1/5] vfio: KABI for migration interface for device state Kirti Wankhede
2019-11-12 22:30   ` Alex Williamson
2019-11-13  3:23     ` Yan Zhao
2019-11-13 19:02       ` Kirti Wankhede
2019-11-14  0:36         ` Yan Zhao
2019-11-14 18:55           ` Kirti Wankhede
2019-11-13 10:24     ` Cornelia Huck
2019-11-13 18:27       ` Alex Williamson
2019-11-13 19:29         ` Kirti Wankhede
2019-11-13 19:48           ` Alex Williamson
2019-11-13 20:17             ` Kirti Wankhede
2019-11-13 20:40               ` Alex Williamson
2019-11-14 18:49                 ` Kirti Wankhede
2019-11-12 17:03 ` [PATCH v9 Kernel 2/5] vfio iommu: Add ioctl defination to get dirty pages bitmap Kirti Wankhede
2019-11-12 22:30   ` Alex Williamson
2019-11-13 19:37     ` Kirti Wankhede
2019-11-13 20:07       ` Alex Williamson
2019-11-14 18:56         ` Kirti Wankhede
2019-11-14 21:06           ` Alex Williamson
2019-11-15  2:40             ` Yan Zhao
2019-11-15  3:21               ` Alex Williamson
2019-11-15  5:10                 ` Tian, Kevin
2019-11-19 23:16                   ` Alex Williamson
2019-11-20  1:04                     ` Tian, Kevin
2019-11-20  1:51                 ` Yan Zhao
2019-11-26  0:57             ` Yan Zhao
2019-12-03 18:04               ` Alex Williamson
2019-12-04 18:10                 ` Kirti Wankhede [this message]
2019-12-04 18:34                   ` Alex Williamson
2019-12-05  1:28                     ` Yan Zhao
2019-12-05  5:42                       ` Kirti Wankhede
2019-12-05  5:47                         ` Yan Zhao
2019-12-05  5:56                         ` Alex Williamson
2019-12-05  6:19                           ` Kirti Wankhede
2019-12-05  6:40                             ` Alex Williamson
2019-11-12 17:03 ` [PATCH v9 Kernel 3/5] vfio iommu: Add ioctl defination to unmap IOVA and return dirty bitmap Kirti Wankhede
2019-11-12 22:30   ` Alex Williamson
2019-11-13 19:52     ` Kirti Wankhede
2019-11-13 20:22       ` Alex Williamson
2019-11-14 18:56         ` Kirti Wankhede
2019-11-14 21:08           ` Alex Williamson
2019-11-12 17:03 ` [PATCH v9 Kernel 4/5] vfio iommu: Implementation of ioctl to get dirty pages bitmap Kirti Wankhede
2019-11-12 22:30   ` Alex Williamson
2019-11-12 17:03 ` [PATCH v9 Kernel 5/5] vfio iommu: Implementation of ioctl to get dirty bitmap before unmap Kirti Wankhede
2019-11-12 22:30   ` Alex Williamson

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cce08ca5-79df-2839-16cd-15723b995c07@nvidia.com \
    --to=kwankhede@nvidia.com \
    --cc=Ken.Xue@amd.com \
    --cc=Zhengxiao.zx@Alibaba-inc.com \
    --cc=aik@ozlabs.ru \
    --cc=alex.williamson@redhat.com \
    --cc=changpeng.liu@intel.com \
    --cc=cjia@nvidia.com \
    --cc=cohuck@redhat.com \
    --cc=dgilbert@redhat.com \
    --cc=eauger@redhat.com \
    --cc=eskultet@redhat.com \
    --cc=felipe@nutanix.com \
    --cc=jonathan.davies@nutanix.com \
    --cc=kevin.tian@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=mlevitsk@redhat.com \
    --cc=pasic@linux.ibm.com \
    --cc=qemu-devel@nongnu.org \
    --cc=shuangtai.tst@alibaba-inc.com \
    --cc=yan.y.zhao@intel.com \
    --cc=yi.l.liu@intel.com \
    --cc=zhi.a.wang@intel.com \
    --cc=ziye.yang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

KVM Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/kvm/0 kvm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 kvm kvm/ https://lore.kernel.org/kvm \
		kvm@vger.kernel.org
	public-inbox-index kvm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.kvm


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git