Re: [Qemu-devel] vhost, iova, and dirty page tracking

From: "Tian, Kevin" <kevin.tian@intel.com>
To: Jason Wang <jasowang@redhat.com>
Cc: 'Alex Williamson' <alex.williamson@redhat.com>,
	"Zhao, Yan Y" <yan.y.zhao@intel.com>,
	"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>
Subject: Re: [Qemu-devel] vhost, iova, and dirty page tracking
Date: Wed, 18 Sep 2019 01:44:28 +0000	[thread overview]
Message-ID: <AADFC41AFE54684AB9EE6CBC0274A5D19D57B1D1@SHSMSX104.ccr.corp.intel.com> (raw)
In-Reply-To: <8302a4ae-1914-3046-b3b5-b3234d7dda02@redhat.com>

> From: Jason Wang [mailto:jasowang@redhat.com]
> Sent: Tuesday, September 17, 2019 6:36 PM
> 
> On 2019/9/17 下午4:48, Tian, Kevin wrote:
> >> From: Jason Wang [mailto:jasowang@redhat.com]
> >> Sent: Monday, September 16, 2019 4:33 PM
> >>
> >>
> >> On 2019/9/16 上午9:51, Tian, Kevin wrote:
> >>> Hi, Jason
> >>>
> >>> We had a discussion about dirty page tracking in VFIO, when vIOMMU
> >>> is enabled:
> >>>
> >>> https://lists.nongnu.org/archive/html/qemu-devel/2019-
> >> 09/msg02690.html
> >>> It's actually a similar model as vhost - Qemu cannot interpose the fast-
> >> path
> >>> DMAs thus relies on the kernel part to track and report dirty page
> >> information.
> >>> Currently Qemu tracks dirty pages in GFN level, thus demanding a
> >> translation
> >>> from IOVA to GPA. Then the open in our discussion is where this
> >> translation
> >>> should happen. Doing the translation in kernel implies a device iotlb
> >> flavor,
> >>> which is what vhost implements today. It requires potentially large
> >> tracking
> >>> structures in the host kernel, but leveraging the existing log_sync flow
> in
> >> Qemu.
> >>> On the other hand, Qemu may perform log_sync for every removal of
> >> IOVA
> >>> mapping and then do the translation itself, then avoiding the GPA
> >> awareness
> >>> in the kernel side. It needs some change to current Qemu log-sync flow,
> >> and
> >>> may bring more overhead if IOVA is frequently unmapped.
> >>>
> >>> So we'd like to hear about your opinions, especially about how you
> came
> >>> down to the current iotlb approach for vhost.
> >>
> >> We don't consider too much in the point when introducing vhost. And
> >> before IOTLB, vhost has already know GPA through its mem table
> >> (GPA->HVA). So it's nature and easier to track dirty pages at GPA level
> >> then it won't any changes in the existing ABI.
> > This is the same situation as VFIO.
> >
> >> For VFIO case, the only advantages of using GPA is that the log can then
> >> be shared among all the devices that belongs to the VM. Otherwise
> >> syncing through IOVA is cleaner.
> > I still worry about the potential performance impact with this approach.
> > In current mdev live migration series, there are multiple system calls
> > involved when retrieving the dirty bitmap information for a given memory
> > range.
> 
> 
> I haven't took a deep look at that series. Technically dirty bitmap
> could be shared between device and driver, then there's no system call
> in synchronization.

That series require Qemu to tell the kernel about the information
about queried region (start, number, and page_size), read
the information about the dirty bitmap (offset, size) and then read
the dirty bitmap. Although the bitmap can be mmaped thus shared, 
earlier reads/writes are conducted by pread/pwrite system calls.
This design is fine for current log_dirty implementation, where 
dirty bitmap is synced in every pre-copy round. But to do it for
every IOVA unmap, it's definitely over-killed. 

> 
> 
> > IOVA mappings might be changed frequently. Though one may
> > argue that frequent IOVA change already has bad performance, it's still
> > not good to introduce further non-negligible overhead in such situation.
> 
> 
> Yes, it depends on the behavior of vIOMMU driver, e.g the frequency and
> granularity of the flushing.
> 
> 
> >
> > On the other hand, I realized that adding IOVA awareness in VFIO is
> > actually easy. Today VFIO already maintains a full list of IOVA and its
> > associated HVA in vfio_dma structure, according to VFIO_MAP and
> > VFIO_UNMAP. As long as we allow the latter two operations to accept
> > another parameter (GPA), IOVA->GPA mapping can be naturally cached
> > in existing vfio_dma objects.
> 
> 
> Note that the HVA to GPA mapping is not an 1:1 mapping. One HVA range
> could be mapped to several GPA ranges.

This is fine. Currently vfio_dma maintains IOVA->HVA mapping.

btw under what condition HVA->GPA is not 1:1 mapping? I didn't realize it.

> 
> 
> >   Those objects are always updated according
> > to MAP and UNMAP ioctls to be up-to-date. Qemu then uniformly
> > retrieves the VFIO dirty bitmap for the entire GPA range in every pre-copy
> > round, regardless of whether vIOMMU is enabled. There is no need of
> > another IOTLB implementation, with the main ask on a v2 MAP/UNMAP
> > interface.
> 
> 
> Or provide GPA to HVA mapping as vhost did. But a question is, I believe
> device can only do dirty page logging through IOVA. So how do you handle
> the case when IOVA is removed in this case?
> 

That's why a log_sync is required each time when IOVA is unmapped, in
Alex's thought. 

Thanks
Kevin