All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] vhost, iova, and dirty page tracking
@ 2019-09-16  1:51 Tian, Kevin
  2019-09-16  8:33 ` Jason Wang
  0 siblings, 1 reply; 40+ messages in thread
From: Tian, Kevin @ 2019-09-16  1:51 UTC (permalink / raw)
  To: Jason Wang
  Cc: Alex Williamson (alex.williamson@redhat.com), Zhao, Yan Y, qemu-devel

Hi, Jason

We had a discussion about dirty page tracking in VFIO, when vIOMMU
is enabled:

https://lists.nongnu.org/archive/html/qemu-devel/2019-09/msg02690.html

It's actually a similar model as vhost - Qemu cannot interpose the fast-path
DMAs thus relies on the kernel part to track and report dirty page information.
Currently Qemu tracks dirty pages in GFN level, thus demanding a translation
from IOVA to GPA. Then the open in our discussion is where this translation
should happen. Doing the translation in kernel implies a device iotlb flavor,
which is what vhost implements today. It requires potentially large tracking
structures in the host kernel, but leveraging the existing log_sync flow in Qemu.
On the other hand, Qemu may perform log_sync for every removal of IOVA
mapping and then do the translation itself, then avoiding the GPA awareness
in the kernel side. It needs some change to current Qemu log-sync flow, and 
may bring more overhead if IOVA is frequently unmapped.

So we'd like to hear about your opinions, especially about how you came
down to the current iotlb approach for vhost. 

p.s. Alex's comment is also copied here from original thread.

> So vhost must then be configuring a listener across system memory
> rather than only against the device AddressSpace like we do in vfio,
> such that it get's log_sync() callbacks for the actual GPA space rather
> than only the IOVA space.  OTOH, QEMU could understand that the device
> AddressSpace has a translate function and apply the IOVA dirty bits to
> the system memory AddressSpace.  Wouldn't it make more sense for
> QEMU
> to perform a log_sync() prior to removing a MemoryRegionSection within
> an AddressSpace and update the GPA rather than pushing GPA awareness
> and potentially large tracking structures into the host kernel?

Thanks
Kevin


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] vhost, iova, and dirty page tracking
  2019-09-16  1:51 [Qemu-devel] vhost, iova, and dirty page tracking Tian, Kevin
@ 2019-09-16  8:33 ` Jason Wang
  2019-09-17  8:48   ` Tian, Kevin
  0 siblings, 1 reply; 40+ messages in thread
From: Jason Wang @ 2019-09-16  8:33 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Alex Williamson (alex.williamson@redhat.com), Zhao, Yan Y, qemu-devel


On 2019/9/16 上午9:51, Tian, Kevin wrote:
> Hi, Jason
>
> We had a discussion about dirty page tracking in VFIO, when vIOMMU
> is enabled:
>
> https://lists.nongnu.org/archive/html/qemu-devel/2019-09/msg02690.html
>
> It's actually a similar model as vhost - Qemu cannot interpose the fast-path
> DMAs thus relies on the kernel part to track and report dirty page information.
> Currently Qemu tracks dirty pages in GFN level, thus demanding a translation
> from IOVA to GPA. Then the open in our discussion is where this translation
> should happen. Doing the translation in kernel implies a device iotlb flavor,
> which is what vhost implements today. It requires potentially large tracking
> structures in the host kernel, but leveraging the existing log_sync flow in Qemu.
> On the other hand, Qemu may perform log_sync for every removal of IOVA
> mapping and then do the translation itself, then avoiding the GPA awareness
> in the kernel side. It needs some change to current Qemu log-sync flow, and
> may bring more overhead if IOVA is frequently unmapped.
>
> So we'd like to hear about your opinions, especially about how you came
> down to the current iotlb approach for vhost.


We don't consider too much in the point when introducing vhost. And 
before IOTLB, vhost has already know GPA through its mem table 
(GPA->HVA). So it's nature and easier to track dirty pages at GPA level 
then it won't any changes in the existing ABI.

For VFIO case, the only advantages of using GPA is that the log can then 
be shared among all the devices that belongs to the VM. Otherwise 
syncing through IOVA is cleaner.

Thanks

>
> p.s. Alex's comment is also copied here from original thread.
>
>> So vhost must then be configuring a listener across system memory
>> rather than only against the device AddressSpace like we do in vfio,
>> such that it get's log_sync() callbacks for the actual GPA space rather
>> than only the IOVA space.  OTOH, QEMU could understand that the device
>> AddressSpace has a translate function and apply the IOVA dirty bits to
>> the system memory AddressSpace.  Wouldn't it make more sense for
>> QEMU
>> to perform a log_sync() prior to removing a MemoryRegionSection within
>> an AddressSpace and update the GPA rather than pushing GPA awareness
>> and potentially large tracking structures into the host kernel?
> Thanks
> Kevin
>


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] vhost, iova, and dirty page tracking
  2019-09-16  8:33 ` Jason Wang
@ 2019-09-17  8:48   ` Tian, Kevin
  2019-09-17 10:36     ` Jason Wang
  2019-09-17 14:54     ` Alex Williamson
  0 siblings, 2 replies; 40+ messages in thread
From: Tian, Kevin @ 2019-09-17  8:48 UTC (permalink / raw)
  To: Jason Wang; +Cc: 'Alex Williamson', Zhao, Yan Y, qemu-devel

> From: Jason Wang [mailto:jasowang@redhat.com]
> Sent: Monday, September 16, 2019 4:33 PM
> 
> 
> On 2019/9/16 上午9:51, Tian, Kevin wrote:
> > Hi, Jason
> >
> > We had a discussion about dirty page tracking in VFIO, when vIOMMU
> > is enabled:
> >
> > https://lists.nongnu.org/archive/html/qemu-devel/2019-
> 09/msg02690.html
> >
> > It's actually a similar model as vhost - Qemu cannot interpose the fast-
> path
> > DMAs thus relies on the kernel part to track and report dirty page
> information.
> > Currently Qemu tracks dirty pages in GFN level, thus demanding a
> translation
> > from IOVA to GPA. Then the open in our discussion is where this
> translation
> > should happen. Doing the translation in kernel implies a device iotlb
> flavor,
> > which is what vhost implements today. It requires potentially large
> tracking
> > structures in the host kernel, but leveraging the existing log_sync flow in
> Qemu.
> > On the other hand, Qemu may perform log_sync for every removal of
> IOVA
> > mapping and then do the translation itself, then avoiding the GPA
> awareness
> > in the kernel side. It needs some change to current Qemu log-sync flow,
> and
> > may bring more overhead if IOVA is frequently unmapped.
> >
> > So we'd like to hear about your opinions, especially about how you came
> > down to the current iotlb approach for vhost.
> 
> 
> We don't consider too much in the point when introducing vhost. And
> before IOTLB, vhost has already know GPA through its mem table
> (GPA->HVA). So it's nature and easier to track dirty pages at GPA level
> then it won't any changes in the existing ABI.

This is the same situation as VFIO.

> 
> For VFIO case, the only advantages of using GPA is that the log can then
> be shared among all the devices that belongs to the VM. Otherwise
> syncing through IOVA is cleaner.

I still worry about the potential performance impact with this approach.
In current mdev live migration series, there are multiple system calls 
involved when retrieving the dirty bitmap information for a given memory
range. IOVA mappings might be changed frequently. Though one may
argue that frequent IOVA change already has bad performance, it's still
not good to introduce further non-negligible overhead in such situation.

On the other hand, I realized that adding IOVA awareness in VFIO is
actually easy. Today VFIO already maintains a full list of IOVA and its 
associated HVA in vfio_dma structure, according to VFIO_MAP and 
VFIO_UNMAP. As long as we allow the latter two operations to accept 
another parameter (GPA), IOVA->GPA mapping can be naturally cached 
in existing vfio_dma objects. Those objects are always updated according 
to MAP and UNMAP ioctls to be up-to-date. Qemu then uniformly 
retrieves the VFIO dirty bitmap for the entire GPA range in every pre-copy 
round, regardless of whether vIOMMU is enabled. There is no need of 
another IOTLB implementation, with the main ask on a v2 MAP/UNMAP 
interface. 

Alex, your thoughts?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] vhost, iova, and dirty page tracking
  2019-09-17  8:48   ` Tian, Kevin
@ 2019-09-17 10:36     ` Jason Wang
  2019-09-18  1:44       ` Tian, Kevin
  2019-09-17 14:54     ` Alex Williamson
  1 sibling, 1 reply; 40+ messages in thread
From: Jason Wang @ 2019-09-17 10:36 UTC (permalink / raw)
  To: Tian, Kevin; +Cc: 'Alex Williamson', Zhao, Yan Y, qemu-devel


On 2019/9/17 下午4:48, Tian, Kevin wrote:
>> From: Jason Wang [mailto:jasowang@redhat.com]
>> Sent: Monday, September 16, 2019 4:33 PM
>>
>>
>> On 2019/9/16 上午9:51, Tian, Kevin wrote:
>>> Hi, Jason
>>>
>>> We had a discussion about dirty page tracking in VFIO, when vIOMMU
>>> is enabled:
>>>
>>> https://lists.nongnu.org/archive/html/qemu-devel/2019-
>> 09/msg02690.html
>>> It's actually a similar model as vhost - Qemu cannot interpose the fast-
>> path
>>> DMAs thus relies on the kernel part to track and report dirty page
>> information.
>>> Currently Qemu tracks dirty pages in GFN level, thus demanding a
>> translation
>>> from IOVA to GPA. Then the open in our discussion is where this
>> translation
>>> should happen. Doing the translation in kernel implies a device iotlb
>> flavor,
>>> which is what vhost implements today. It requires potentially large
>> tracking
>>> structures in the host kernel, but leveraging the existing log_sync flow in
>> Qemu.
>>> On the other hand, Qemu may perform log_sync for every removal of
>> IOVA
>>> mapping and then do the translation itself, then avoiding the GPA
>> awareness
>>> in the kernel side. It needs some change to current Qemu log-sync flow,
>> and
>>> may bring more overhead if IOVA is frequently unmapped.
>>>
>>> So we'd like to hear about your opinions, especially about how you came
>>> down to the current iotlb approach for vhost.
>>
>> We don't consider too much in the point when introducing vhost. And
>> before IOTLB, vhost has already know GPA through its mem table
>> (GPA->HVA). So it's nature and easier to track dirty pages at GPA level
>> then it won't any changes in the existing ABI.
> This is the same situation as VFIO.
>
>> For VFIO case, the only advantages of using GPA is that the log can then
>> be shared among all the devices that belongs to the VM. Otherwise
>> syncing through IOVA is cleaner.
> I still worry about the potential performance impact with this approach.
> In current mdev live migration series, there are multiple system calls
> involved when retrieving the dirty bitmap information for a given memory
> range.


I haven't took a deep look at that series. Technically dirty bitmap 
could be shared between device and driver, then there's no system call 
in synchronization.


> IOVA mappings might be changed frequently. Though one may
> argue that frequent IOVA change already has bad performance, it's still
> not good to introduce further non-negligible overhead in such situation.


Yes, it depends on the behavior of vIOMMU driver, e.g the frequency and 
granularity of the flushing.


>
> On the other hand, I realized that adding IOVA awareness in VFIO is
> actually easy. Today VFIO already maintains a full list of IOVA and its
> associated HVA in vfio_dma structure, according to VFIO_MAP and
> VFIO_UNMAP. As long as we allow the latter two operations to accept
> another parameter (GPA), IOVA->GPA mapping can be naturally cached
> in existing vfio_dma objects.


Note that the HVA to GPA mapping is not an 1:1 mapping. One HVA range 
could be mapped to several GPA ranges.


>   Those objects are always updated according
> to MAP and UNMAP ioctls to be up-to-date. Qemu then uniformly
> retrieves the VFIO dirty bitmap for the entire GPA range in every pre-copy
> round, regardless of whether vIOMMU is enabled. There is no need of
> another IOTLB implementation, with the main ask on a v2 MAP/UNMAP
> interface.


Or provide GPA to HVA mapping as vhost did. But a question is, I believe 
device can only do dirty page logging through IOVA. So how do you handle 
the case when IOVA is removed in this case?

Thanks


>
> Alex, your thoughts?
>
> Thanks
> Kevin


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] vhost, iova, and dirty page tracking
  2019-09-17  8:48   ` Tian, Kevin
  2019-09-17 10:36     ` Jason Wang
@ 2019-09-17 14:54     ` Alex Williamson
  2019-09-18  1:31       ` Tian, Kevin
       [not found]       ` <AADFC41AFE54684AB9EE6CBC0274A5D19D57AFB7@SHSMSX104.ccr.corp.intel.com>
  1 sibling, 2 replies; 40+ messages in thread
From: Alex Williamson @ 2019-09-17 14:54 UTC (permalink / raw)
  To: Tian, Kevin; +Cc: Jason Wang, Zhao, Yan Y, qemu-devel

On Tue, 17 Sep 2019 08:48:36 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Jason Wang [mailto:jasowang@redhat.com]
> > Sent: Monday, September 16, 2019 4:33 PM
> > 
> > 
> > On 2019/9/16 上午9:51, Tian, Kevin wrote:  
> > > Hi, Jason
> > >
> > > We had a discussion about dirty page tracking in VFIO, when vIOMMU
> > > is enabled:
> > >
> > > https://lists.nongnu.org/archive/html/qemu-devel/2019-  
> > 09/msg02690.html  
> > >
> > > It's actually a similar model as vhost - Qemu cannot interpose the fast-  
> > path  
> > > DMAs thus relies on the kernel part to track and report dirty page  
> > information.  
> > > Currently Qemu tracks dirty pages in GFN level, thus demanding a  
> > translation  
> > > from IOVA to GPA. Then the open in our discussion is where this  
> > translation  
> > > should happen. Doing the translation in kernel implies a device iotlb  
> > flavor,  
> > > which is what vhost implements today. It requires potentially large  
> > tracking  
> > > structures in the host kernel, but leveraging the existing log_sync flow in  
> > Qemu.  
> > > On the other hand, Qemu may perform log_sync for every removal of  
> > IOVA  
> > > mapping and then do the translation itself, then avoiding the GPA  
> > awareness  
> > > in the kernel side. It needs some change to current Qemu log-sync flow,  
> > and  
> > > may bring more overhead if IOVA is frequently unmapped.
> > >
> > > So we'd like to hear about your opinions, especially about how you came
> > > down to the current iotlb approach for vhost.  
> > 
> > 
> > We don't consider too much in the point when introducing vhost. And
> > before IOTLB, vhost has already know GPA through its mem table
> > (GPA->HVA). So it's nature and easier to track dirty pages at GPA level
> > then it won't any changes in the existing ABI.  
> 
> This is the same situation as VFIO.

It is?  VFIO doesn't know GPAs, it only knows HVA, HPA, and IOVA.  In
some cases IOVA is GPA, but not all.

> > For VFIO case, the only advantages of using GPA is that the log can then
> > be shared among all the devices that belongs to the VM. Otherwise
> > syncing through IOVA is cleaner.  
> 
> I still worry about the potential performance impact with this approach.
> In current mdev live migration series, there are multiple system calls 
> involved when retrieving the dirty bitmap information for a given memory
> range. IOVA mappings might be changed frequently. Though one may
> argue that frequent IOVA change already has bad performance, it's still
> not good to introduce further non-negligible overhead in such situation.
> 
> On the other hand, I realized that adding IOVA awareness in VFIO is
> actually easy. Today VFIO already maintains a full list of IOVA and its 
> associated HVA in vfio_dma structure, according to VFIO_MAP and 
> VFIO_UNMAP. As long as we allow the latter two operations to accept 
> another parameter (GPA), IOVA->GPA mapping can be naturally cached 
> in existing vfio_dma objects. Those objects are always updated according 
> to MAP and UNMAP ioctls to be up-to-date. Qemu then uniformly 
> retrieves the VFIO dirty bitmap for the entire GPA range in every pre-copy 
> round, regardless of whether vIOMMU is enabled. There is no need of 
> another IOTLB implementation, with the main ask on a v2 MAP/UNMAP 
> interface. 
> 
> Alex, your thoughts?

Same as last time, you're asking VFIO to be aware of an entirely new
address space and implement tracking structures of that address space
to make life easier for QEMU.  Don't we typically push such complexity
to userspace rather than into the kernel?  I'm not convinced.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] vhost, iova, and dirty page tracking
  2019-09-17 14:54     ` Alex Williamson
@ 2019-09-18  1:31       ` Tian, Kevin
  2019-09-18  6:03         ` Jason Wang
       [not found]       ` <AADFC41AFE54684AB9EE6CBC0274A5D19D57AFB7@SHSMSX104.ccr.corp.intel.com>
  1 sibling, 1 reply; 40+ messages in thread
From: Tian, Kevin @ 2019-09-18  1:31 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Jason Wang, Zhao, Yan Y, qemu-devel

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Tuesday, September 17, 2019 10:54 PM
> 
> On Tue, 17 Sep 2019 08:48:36 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Jason Wang [mailto:jasowang@redhat.com]
> > > Sent: Monday, September 16, 2019 4:33 PM
> > >
> > >
> > > On 2019/9/16 上午9:51, Tian, Kevin wrote:
> > > > Hi, Jason
> > > >
> > > > We had a discussion about dirty page tracking in VFIO, when vIOMMU
> > > > is enabled:
> > > >
> > > > https://lists.nongnu.org/archive/html/qemu-devel/2019-
> > > 09/msg02690.html
> > > >
> > > > It's actually a similar model as vhost - Qemu cannot interpose the fast-
> > > path
> > > > DMAs thus relies on the kernel part to track and report dirty page
> > > information.
> > > > Currently Qemu tracks dirty pages in GFN level, thus demanding a
> > > translation
> > > > from IOVA to GPA. Then the open in our discussion is where this
> > > translation
> > > > should happen. Doing the translation in kernel implies a device iotlb
> > > flavor,
> > > > which is what vhost implements today. It requires potentially large
> > > tracking
> > > > structures in the host kernel, but leveraging the existing log_sync flow
> in
> > > Qemu.
> > > > On the other hand, Qemu may perform log_sync for every removal of
> > > IOVA
> > > > mapping and then do the translation itself, then avoiding the GPA
> > > awareness
> > > > in the kernel side. It needs some change to current Qemu log-sync
> flow,
> > > and
> > > > may bring more overhead if IOVA is frequently unmapped.
> > > >
> > > > So we'd like to hear about your opinions, especially about how you
> came
> > > > down to the current iotlb approach for vhost.
> > >
> > >
> > > We don't consider too much in the point when introducing vhost. And
> > > before IOTLB, vhost has already know GPA through its mem table
> > > (GPA->HVA). So it's nature and easier to track dirty pages at GPA level
> > > then it won't any changes in the existing ABI.
> >
> > This is the same situation as VFIO.
> 
> It is?  VFIO doesn't know GPAs, it only knows HVA, HPA, and IOVA.  In
> some cases IOVA is GPA, but not all.

Well, I thought vhost has a similar design, that the index of its mem table
is GPA when vIOMMU is off and then becomes IOVA when vIOMMU is on.
But I may be wrong here. Jason, can you help clarify? I saw two 
interfaces which poke the mem table: VHOST_SET_MEM_TABLE (for GPA)
and VHOST_IOTLB_UPDATE (for IOVA). Are they used exclusively or together?

> 
> > > For VFIO case, the only advantages of using GPA is that the log can then
> > > be shared among all the devices that belongs to the VM. Otherwise
> > > syncing through IOVA is cleaner.
> >
> > I still worry about the potential performance impact with this approach.
> > In current mdev live migration series, there are multiple system calls
> > involved when retrieving the dirty bitmap information for a given memory
> > range. IOVA mappings might be changed frequently. Though one may
> > argue that frequent IOVA change already has bad performance, it's still
> > not good to introduce further non-negligible overhead in such situation.
> >
> > On the other hand, I realized that adding IOVA awareness in VFIO is
> > actually easy. Today VFIO already maintains a full list of IOVA and its
> > associated HVA in vfio_dma structure, according to VFIO_MAP and
> > VFIO_UNMAP. As long as we allow the latter two operations to accept
> > another parameter (GPA), IOVA->GPA mapping can be naturally cached
> > in existing vfio_dma objects. Those objects are always updated according
> > to MAP and UNMAP ioctls to be up-to-date. Qemu then uniformly
> > retrieves the VFIO dirty bitmap for the entire GPA range in every pre-copy
> > round, regardless of whether vIOMMU is enabled. There is no need of
> > another IOTLB implementation, with the main ask on a v2 MAP/UNMAP
> > interface.
> >
> > Alex, your thoughts?
> 
> Same as last time, you're asking VFIO to be aware of an entirely new
> address space and implement tracking structures of that address space
> to make life easier for QEMU.  Don't we typically push such complexity
> to userspace rather than into the kernel?  I'm not convinced.  Thanks,
> 

Is it really complex? No need of a new tracking structure. Just allowing
the MAP interface to carry a new parameter and then record it in the
existing vfio_dma objects.

Note the frequency of guest DMA map/unmap could be very high. We
saw >100K invocations per second with a 40G NIC. To do the right
translation Qemu requires log_sync for every unmap, before the
mapping for logged dirty IOVA becomes stale. In current Kirti's patch,
each log_sync requires several system_calls through the migration
info, e.g. setting start_pfn/page_size/total_pfns and then reading
data_offset/data_size. That design is fine for doing log_sync in every
pre-copy round, but too costly if doing so for every IOVA unmap. If
small extension in kernel can lead to great overhead reduction,
why not?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] vhost, iova, and dirty page tracking
  2019-09-17 10:36     ` Jason Wang
@ 2019-09-18  1:44       ` Tian, Kevin
  2019-09-18  6:10         ` Jason Wang
  0 siblings, 1 reply; 40+ messages in thread
From: Tian, Kevin @ 2019-09-18  1:44 UTC (permalink / raw)
  To: Jason Wang; +Cc: 'Alex Williamson', Zhao, Yan Y, qemu-devel

> From: Jason Wang [mailto:jasowang@redhat.com]
> Sent: Tuesday, September 17, 2019 6:36 PM
> 
> On 2019/9/17 下午4:48, Tian, Kevin wrote:
> >> From: Jason Wang [mailto:jasowang@redhat.com]
> >> Sent: Monday, September 16, 2019 4:33 PM
> >>
> >>
> >> On 2019/9/16 上午9:51, Tian, Kevin wrote:
> >>> Hi, Jason
> >>>
> >>> We had a discussion about dirty page tracking in VFIO, when vIOMMU
> >>> is enabled:
> >>>
> >>> https://lists.nongnu.org/archive/html/qemu-devel/2019-
> >> 09/msg02690.html
> >>> It's actually a similar model as vhost - Qemu cannot interpose the fast-
> >> path
> >>> DMAs thus relies on the kernel part to track and report dirty page
> >> information.
> >>> Currently Qemu tracks dirty pages in GFN level, thus demanding a
> >> translation
> >>> from IOVA to GPA. Then the open in our discussion is where this
> >> translation
> >>> should happen. Doing the translation in kernel implies a device iotlb
> >> flavor,
> >>> which is what vhost implements today. It requires potentially large
> >> tracking
> >>> structures in the host kernel, but leveraging the existing log_sync flow
> in
> >> Qemu.
> >>> On the other hand, Qemu may perform log_sync for every removal of
> >> IOVA
> >>> mapping and then do the translation itself, then avoiding the GPA
> >> awareness
> >>> in the kernel side. It needs some change to current Qemu log-sync flow,
> >> and
> >>> may bring more overhead if IOVA is frequently unmapped.
> >>>
> >>> So we'd like to hear about your opinions, especially about how you
> came
> >>> down to the current iotlb approach for vhost.
> >>
> >> We don't consider too much in the point when introducing vhost. And
> >> before IOTLB, vhost has already know GPA through its mem table
> >> (GPA->HVA). So it's nature and easier to track dirty pages at GPA level
> >> then it won't any changes in the existing ABI.
> > This is the same situation as VFIO.
> >
> >> For VFIO case, the only advantages of using GPA is that the log can then
> >> be shared among all the devices that belongs to the VM. Otherwise
> >> syncing through IOVA is cleaner.
> > I still worry about the potential performance impact with this approach.
> > In current mdev live migration series, there are multiple system calls
> > involved when retrieving the dirty bitmap information for a given memory
> > range.
> 
> 
> I haven't took a deep look at that series. Technically dirty bitmap
> could be shared between device and driver, then there's no system call
> in synchronization.

That series require Qemu to tell the kernel about the information
about queried region (start, number, and page_size), read
the information about the dirty bitmap (offset, size) and then read
the dirty bitmap. Although the bitmap can be mmaped thus shared, 
earlier reads/writes are conducted by pread/pwrite system calls.
This design is fine for current log_dirty implementation, where 
dirty bitmap is synced in every pre-copy round. But to do it for
every IOVA unmap, it's definitely over-killed. 

> 
> 
> > IOVA mappings might be changed frequently. Though one may
> > argue that frequent IOVA change already has bad performance, it's still
> > not good to introduce further non-negligible overhead in such situation.
> 
> 
> Yes, it depends on the behavior of vIOMMU driver, e.g the frequency and
> granularity of the flushing.
> 
> 
> >
> > On the other hand, I realized that adding IOVA awareness in VFIO is
> > actually easy. Today VFIO already maintains a full list of IOVA and its
> > associated HVA in vfio_dma structure, according to VFIO_MAP and
> > VFIO_UNMAP. As long as we allow the latter two operations to accept
> > another parameter (GPA), IOVA->GPA mapping can be naturally cached
> > in existing vfio_dma objects.
> 
> 
> Note that the HVA to GPA mapping is not an 1:1 mapping. One HVA range
> could be mapped to several GPA ranges.

This is fine. Currently vfio_dma maintains IOVA->HVA mapping.

btw under what condition HVA->GPA is not 1:1 mapping? I didn't realize it.

> 
> 
> >   Those objects are always updated according
> > to MAP and UNMAP ioctls to be up-to-date. Qemu then uniformly
> > retrieves the VFIO dirty bitmap for the entire GPA range in every pre-copy
> > round, regardless of whether vIOMMU is enabled. There is no need of
> > another IOTLB implementation, with the main ask on a v2 MAP/UNMAP
> > interface.
> 
> 
> Or provide GPA to HVA mapping as vhost did. But a question is, I believe
> device can only do dirty page logging through IOVA. So how do you handle
> the case when IOVA is removed in this case?
> 

That's why a log_sync is required each time when IOVA is unmapped, in
Alex's thought. 

Thanks
Kevin

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] vhost, iova, and dirty page tracking
       [not found]       ` <AADFC41AFE54684AB9EE6CBC0274A5D19D57AFB7@SHSMSX104.ccr.corp.intel.com>
@ 2019-09-18  2:15         ` Tian, Kevin
  0 siblings, 0 replies; 40+ messages in thread
From: Tian, Kevin @ 2019-09-18  2:15 UTC (permalink / raw)
  To: 'Alex Williamson'; +Cc: Jason Wang, Zhao, Yan Y, qemu-devel

> From: Tian, Kevin
> Sent: Wednesday, September 18, 2019 9:32 AM
> 
> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Tuesday, September 17, 2019 10:54 PM
> >
> > On Tue, 17 Sep 2019 08:48:36 +0000
> > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> >
> > > > From: Jason Wang [mailto:jasowang@redhat.com]
> > > > Sent: Monday, September 16, 2019 4:33 PM
> > > >
> > > >
> > > > On 2019/9/16 上午9:51, Tian, Kevin wrote:
> > > > > Hi, Jason
> > > > >
> > > > > We had a discussion about dirty page tracking in VFIO, when
> vIOMMU
> > > > > is enabled:
> > > > >
> > > > > https://lists.nongnu.org/archive/html/qemu-devel/2019-
> > > > 09/msg02690.html
> > > > >
> > > > > It's actually a similar model as vhost - Qemu cannot interpose the
> fast-
> > > > path
> > > > > DMAs thus relies on the kernel part to track and report dirty page
> > > > information.
> > > > > Currently Qemu tracks dirty pages in GFN level, thus demanding a
> > > > translation
> > > > > from IOVA to GPA. Then the open in our discussion is where this
> > > > translation
> > > > > should happen. Doing the translation in kernel implies a device iotlb
> > > > flavor,
> > > > > which is what vhost implements today. It requires potentially large
> > > > tracking
> > > > > structures in the host kernel, but leveraging the existing log_sync
> flow
> > in
> > > > Qemu.
> > > > > On the other hand, Qemu may perform log_sync for every removal
> of
> > > > IOVA
> > > > > mapping and then do the translation itself, then avoiding the GPA
> > > > awareness
> > > > > in the kernel side. It needs some change to current Qemu log-sync
> > flow,
> > > > and
> > > > > may bring more overhead if IOVA is frequently unmapped.
> > > > >
> > > > > So we'd like to hear about your opinions, especially about how you
> > came
> > > > > down to the current iotlb approach for vhost.
> > > >
> > > >
> > > > We don't consider too much in the point when introducing vhost. And
> > > > before IOTLB, vhost has already know GPA through its mem table
> > > > (GPA->HVA). So it's nature and easier to track dirty pages at GPA level
> > > > then it won't any changes in the existing ABI.
> > >
> > > This is the same situation as VFIO.
> >
> > It is?  VFIO doesn't know GPAs, it only knows HVA, HPA, and IOVA.  In
> > some cases IOVA is GPA, but not all.
> 
> Well, I thought vhost has a similar design, that the index of its mem table
> is GPA when vIOMMU is off and then becomes IOVA when vIOMMU is on.
> But I may be wrong here. Jason, can you help clarify? I saw two
> interfaces which poke the mem table: VHOST_SET_MEM_TABLE (for GPA)
> and VHOST_IOTLB_UPDATE (for IOVA). Are they used exclusively or
> together?
> 
> >
> > > > For VFIO case, the only advantages of using GPA is that the log can
> then
> > > > be shared among all the devices that belongs to the VM. Otherwise
> > > > syncing through IOVA is cleaner.
> > >
> > > I still worry about the potential performance impact with this approach.
> > > In current mdev live migration series, there are multiple system calls
> > > involved when retrieving the dirty bitmap information for a given
> memory
> > > range. IOVA mappings might be changed frequently. Though one may
> > > argue that frequent IOVA change already has bad performance, it's still
> > > not good to introduce further non-negligible overhead in such situation.
> > >
> > > On the other hand, I realized that adding IOVA awareness in VFIO is
> > > actually easy. Today VFIO already maintains a full list of IOVA and its
> > > associated HVA in vfio_dma structure, according to VFIO_MAP and
> > > VFIO_UNMAP. As long as we allow the latter two operations to accept
> > > another parameter (GPA), IOVA->GPA mapping can be naturally cached
> > > in existing vfio_dma objects. Those objects are always updated
> according
> > > to MAP and UNMAP ioctls to be up-to-date. Qemu then uniformly
> > > retrieves the VFIO dirty bitmap for the entire GPA range in every pre-
> copy
> > > round, regardless of whether vIOMMU is enabled. There is no need of
> > > another IOTLB implementation, with the main ask on a v2 MAP/UNMAP
> > > interface.
> > >
> > > Alex, your thoughts?
> >
> > Same as last time, you're asking VFIO to be aware of an entirely new
> > address space and implement tracking structures of that address space
> > to make life easier for QEMU.  Don't we typically push such complexity
> > to userspace rather than into the kernel?  I'm not convinced.  Thanks,
> >
> 
> Is it really complex? No need of a new tracking structure. Just allowing
> the MAP interface to carry a new parameter and then record it in the
> existing vfio_dma objects.
> 
> Note the frequency of guest DMA map/unmap could be very high. We
> saw >100K invocations per second with a 40G NIC. To do the right
> translation Qemu requires log_sync for every unmap, before the
> mapping for logged dirty IOVA becomes stale. In current Kirti's patch,
> each log_sync requires several system_calls through the migration
> info, e.g. setting start_pfn/page_size/total_pfns and then reading
> data_offset/data_size. That design is fine for doing log_sync in every
> pre-copy round, but too costly if doing so for every IOVA unmap. If
> small extension in kernel can lead to great overhead reduction,
> why not?
> 

There is another value of recording GPA in VFIO. Vendor drivers (e.g.
GVT-g) may need to selectively write-protect guest memory pages
when interpreting certain workload descriptors. Those pages are
recorded in IOVA when vIOMMU is enabled, however the KVM 
write-protection API only knows GPA. So currently vIOMMU must
be disabled on Intel vGPUs when GVT-g is enabled. To make it working
we need a way to translate IOVA into GPA in the vendor drivers. There
are two options. One is having KVM export a new API for such 
translation purpose. But as you explained earlier it's not good to
have vendor drivers depend on KVM. The other is having VFIO
maintaining such knowledge through extended MAP interface, 
then providing a uniform API for all vendor drivers to use.

Thanks
Kevin
 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] vhost, iova, and dirty page tracking
  2019-09-18  1:31       ` Tian, Kevin
@ 2019-09-18  6:03         ` Jason Wang
  2019-09-18  7:21           ` Tian, Kevin
  0 siblings, 1 reply; 40+ messages in thread
From: Jason Wang @ 2019-09-18  6:03 UTC (permalink / raw)
  To: Tian, Kevin, Alex Williamson; +Cc: Zhao, Yan Y, qemu-devel


On 2019/9/18 上午9:31, Tian, Kevin wrote:
>> From: Alex Williamson [mailto:alex.williamson@redhat.com]
>> Sent: Tuesday, September 17, 2019 10:54 PM
>>
>> On Tue, 17 Sep 2019 08:48:36 +0000
>> "Tian, Kevin"<kevin.tian@intel.com>  wrote:
>>
>>>> From: Jason Wang [mailto:jasowang@redhat.com]
>>>> Sent: Monday, September 16, 2019 4:33 PM
>>>>
>>>>
>>>> On 2019/9/16 上午9:51, Tian, Kevin wrote:
>>>>> Hi, Jason
>>>>>
>>>>> We had a discussion about dirty page tracking in VFIO, when vIOMMU
>>>>> is enabled:
>>>>>
>>>>> https://lists.nongnu.org/archive/html/qemu-devel/2019-
>>>> 09/msg02690.html
>>>>> It's actually a similar model as vhost - Qemu cannot interpose the fast-
>>>> path
>>>>> DMAs thus relies on the kernel part to track and report dirty page
>>>> information.
>>>>> Currently Qemu tracks dirty pages in GFN level, thus demanding a
>>>> translation
>>>>> from IOVA to GPA. Then the open in our discussion is where this
>>>> translation
>>>>> should happen. Doing the translation in kernel implies a device iotlb
>>>> flavor,
>>>>> which is what vhost implements today. It requires potentially large
>>>> tracking
>>>>> structures in the host kernel, but leveraging the existing log_sync flow
>> in
>>>> Qemu.
>>>>> On the other hand, Qemu may perform log_sync for every removal of
>>>> IOVA
>>>>> mapping and then do the translation itself, then avoiding the GPA
>>>> awareness
>>>>> in the kernel side. It needs some change to current Qemu log-sync
>> flow,
>>>> and
>>>>> may bring more overhead if IOVA is frequently unmapped.
>>>>>
>>>>> So we'd like to hear about your opinions, especially about how you
>> came
>>>>> down to the current iotlb approach for vhost.
>>>> We don't consider too much in the point when introducing vhost. And
>>>> before IOTLB, vhost has already know GPA through its mem table
>>>> (GPA->HVA). So it's nature and easier to track dirty pages at GPA level
>>>> then it won't any changes in the existing ABI.
>>> This is the same situation as VFIO.
>> It is?  VFIO doesn't know GPAs, it only knows HVA, HPA, and IOVA.  In
>> some cases IOVA is GPA, but not all.
> Well, I thought vhost has a similar design, that the index of its mem table
> is GPA when vIOMMU is off and then becomes IOVA when vIOMMU is on.
> But I may be wrong here. Jason, can you help clarify? I saw two
> interfaces which poke the mem table: VHOST_SET_MEM_TABLE (for GPA)
> and VHOST_IOTLB_UPDATE (for IOVA). Are they used exclusively or together?
>

Actually, vhost maintains two interval trees, mem table GPA->HVA, and 
device IOTLB IOVA->HVA. Device IOTLB is only used when vIOMMU is 
enabled, and in that case mem table is used only when vhost need to 
track dirty pages (do reverse lookup of memtable to get HVA->GPA). So in 
conclusion, for datapath, they are used exclusively, but they need 
cowork for logging dirty pages when device IOTLB is enabled.

Thanks



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] vhost, iova, and dirty page tracking
  2019-09-18  1:44       ` Tian, Kevin
@ 2019-09-18  6:10         ` Jason Wang
  2019-09-18  7:41           ` Tian, Kevin
  2019-09-18  8:37           ` Tian, Kevin
  0 siblings, 2 replies; 40+ messages in thread
From: Jason Wang @ 2019-09-18  6:10 UTC (permalink / raw)
  To: Tian, Kevin; +Cc: 'Alex Williamson', Zhao, Yan Y, qemu-devel


On 2019/9/18 上午9:44, Tian, Kevin wrote:
>> From: Jason Wang [mailto:jasowang@redhat.com]
>> Sent: Tuesday, September 17, 2019 6:36 PM
>>
>> On 2019/9/17 下午4:48, Tian, Kevin wrote:
>>>> From: Jason Wang [mailto:jasowang@redhat.com]
>>>> Sent: Monday, September 16, 2019 4:33 PM
>>>>
>>>>
>>>> On 2019/9/16 上午9:51, Tian, Kevin wrote:
>>>>> Hi, Jason
>>>>>
>>>>> We had a discussion about dirty page tracking in VFIO, when vIOMMU
>>>>> is enabled:
>>>>>
>>>>> https://lists.nongnu.org/archive/html/qemu-devel/2019-
>>>> 09/msg02690.html
>>>>> It's actually a similar model as vhost - Qemu cannot interpose the fast-
>>>> path
>>>>> DMAs thus relies on the kernel part to track and report dirty page
>>>> information.
>>>>> Currently Qemu tracks dirty pages in GFN level, thus demanding a
>>>> translation
>>>>> from IOVA to GPA. Then the open in our discussion is where this
>>>> translation
>>>>> should happen. Doing the translation in kernel implies a device iotlb
>>>> flavor,
>>>>> which is what vhost implements today. It requires potentially large
>>>> tracking
>>>>> structures in the host kernel, but leveraging the existing log_sync flow
>> in
>>>> Qemu.
>>>>> On the other hand, Qemu may perform log_sync for every removal of
>>>> IOVA
>>>>> mapping and then do the translation itself, then avoiding the GPA
>>>> awareness
>>>>> in the kernel side. It needs some change to current Qemu log-sync flow,
>>>> and
>>>>> may bring more overhead if IOVA is frequently unmapped.
>>>>>
>>>>> So we'd like to hear about your opinions, especially about how you
>> came
>>>>> down to the current iotlb approach for vhost.
>>>> We don't consider too much in the point when introducing vhost. And
>>>> before IOTLB, vhost has already know GPA through its mem table
>>>> (GPA->HVA). So it's nature and easier to track dirty pages at GPA level
>>>> then it won't any changes in the existing ABI.
>>> This is the same situation as VFIO.
>>>
>>>> For VFIO case, the only advantages of using GPA is that the log can then
>>>> be shared among all the devices that belongs to the VM. Otherwise
>>>> syncing through IOVA is cleaner.
>>> I still worry about the potential performance impact with this approach.
>>> In current mdev live migration series, there are multiple system calls
>>> involved when retrieving the dirty bitmap information for a given memory
>>> range.
>>
>> I haven't took a deep look at that series. Technically dirty bitmap
>> could be shared between device and driver, then there's no system call
>> in synchronization.
> That series require Qemu to tell the kernel about the information
> about queried region (start, number, and page_size), read
> the information about the dirty bitmap (offset, size) and then read
> the dirty bitmap.


Any pointer to that series, I can only find a "mdev live migration 
support with vfio-mdev-pci" from Liu Yi without actual codes.


> Although the bitmap can be mmaped thus shared,
> earlier reads/writes are conducted by pread/pwrite system calls.
> This design is fine for current log_dirty implementation, where
> dirty bitmap is synced in every pre-copy round. But to do it for
> every IOVA unmap, it's definitely over-killed.
>
>>
>>> IOVA mappings might be changed frequently. Though one may
>>> argue that frequent IOVA change already has bad performance, it's still
>>> not good to introduce further non-negligible overhead in such situation.
>>
>> Yes, it depends on the behavior of vIOMMU driver, e.g the frequency and
>> granularity of the flushing.
>>
>>
>>> On the other hand, I realized that adding IOVA awareness in VFIO is
>>> actually easy. Today VFIO already maintains a full list of IOVA and its
>>> associated HVA in vfio_dma structure, according to VFIO_MAP and
>>> VFIO_UNMAP. As long as we allow the latter two operations to accept
>>> another parameter (GPA), IOVA->GPA mapping can be naturally cached
>>> in existing vfio_dma objects.
>>
>> Note that the HVA to GPA mapping is not an 1:1 mapping. One HVA range
>> could be mapped to several GPA ranges.
> This is fine. Currently vfio_dma maintains IOVA->HVA mapping.
>
> btw under what condition HVA->GPA is not 1:1 mapping? I didn't realize it.


I don't remember the details e.g memory region alias? And neither kvm 
nor kvm API does forbid this if my memory is correct.


>
>>
>>>    Those objects are always updated according
>>> to MAP and UNMAP ioctls to be up-to-date. Qemu then uniformly
>>> retrieves the VFIO dirty bitmap for the entire GPA range in every pre-copy
>>> round, regardless of whether vIOMMU is enabled. There is no need of
>>> another IOTLB implementation, with the main ask on a v2 MAP/UNMAP
>>> interface.
>>
>> Or provide GPA to HVA mapping as vhost did. But a question is, I believe
>> device can only do dirty page logging through IOVA. So how do you handle
>> the case when IOVA is removed in this case?
>>
> That's why a log_sync is required each time when IOVA is unmapped, in
> Alex's thought.
>
> Thanks
> Kevin


Ok.

Thanks



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] vhost, iova, and dirty page tracking
  2019-09-18  6:03         ` Jason Wang
@ 2019-09-18  7:21           ` Tian, Kevin
  2019-09-19 17:20             ` Alex Williamson
  0 siblings, 1 reply; 40+ messages in thread
From: Tian, Kevin @ 2019-09-18  7:21 UTC (permalink / raw)
  To: Jason Wang, Alex Williamson; +Cc: Zhao, Yan Y, qemu-devel

> From: Jason Wang [mailto:jasowang@redhat.com]
> Sent: Wednesday, September 18, 2019 2:04 PM
> 
> On 2019/9/18 上午9:31, Tian, Kevin wrote:
> >> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> >> Sent: Tuesday, September 17, 2019 10:54 PM
> >>
> >> On Tue, 17 Sep 2019 08:48:36 +0000
> >> "Tian, Kevin"<kevin.tian@intel.com>  wrote:
> >>
> >>>> From: Jason Wang [mailto:jasowang@redhat.com]
> >>>> Sent: Monday, September 16, 2019 4:33 PM
> >>>>
> >>>>
> >>>> On 2019/9/16 上午9:51, Tian, Kevin wrote:
> >>>>> Hi, Jason
> >>>>>
> >>>>> We had a discussion about dirty page tracking in VFIO, when
> vIOMMU
> >>>>> is enabled:
> >>>>>
> >>>>> https://lists.nongnu.org/archive/html/qemu-devel/2019-
> >>>> 09/msg02690.html
> >>>>> It's actually a similar model as vhost - Qemu cannot interpose the
> fast-
> >>>> path
> >>>>> DMAs thus relies on the kernel part to track and report dirty page
> >>>> information.
> >>>>> Currently Qemu tracks dirty pages in GFN level, thus demanding a
> >>>> translation
> >>>>> from IOVA to GPA. Then the open in our discussion is where this
> >>>> translation
> >>>>> should happen. Doing the translation in kernel implies a device iotlb
> >>>> flavor,
> >>>>> which is what vhost implements today. It requires potentially large
> >>>> tracking
> >>>>> structures in the host kernel, but leveraging the existing log_sync
> flow
> >> in
> >>>> Qemu.
> >>>>> On the other hand, Qemu may perform log_sync for every removal
> of
> >>>> IOVA
> >>>>> mapping and then do the translation itself, then avoiding the GPA
> >>>> awareness
> >>>>> in the kernel side. It needs some change to current Qemu log-sync
> >> flow,
> >>>> and
> >>>>> may bring more overhead if IOVA is frequently unmapped.
> >>>>>
> >>>>> So we'd like to hear about your opinions, especially about how you
> >> came
> >>>>> down to the current iotlb approach for vhost.
> >>>> We don't consider too much in the point when introducing vhost. And
> >>>> before IOTLB, vhost has already know GPA through its mem table
> >>>> (GPA->HVA). So it's nature and easier to track dirty pages at GPA level
> >>>> then it won't any changes in the existing ABI.
> >>> This is the same situation as VFIO.
> >> It is?  VFIO doesn't know GPAs, it only knows HVA, HPA, and IOVA.  In
> >> some cases IOVA is GPA, but not all.
> > Well, I thought vhost has a similar design, that the index of its mem table
> > is GPA when vIOMMU is off and then becomes IOVA when vIOMMU is on.
> > But I may be wrong here. Jason, can you help clarify? I saw two
> > interfaces which poke the mem table: VHOST_SET_MEM_TABLE (for GPA)
> > and VHOST_IOTLB_UPDATE (for IOVA). Are they used exclusively or
> together?
> >
> 
> Actually, vhost maintains two interval trees, mem table GPA->HVA, and
> device IOTLB IOVA->HVA. Device IOTLB is only used when vIOMMU is
> enabled, and in that case mem table is used only when vhost need to
> track dirty pages (do reverse lookup of memtable to get HVA->GPA). So in
> conclusion, for datapath, they are used exclusively, but they need
> cowork for logging dirty pages when device IOTLB is enabled.
> 

OK. Then it's different from current VFIO design, which maintains only
one tree which is indexed by either GPA or IOVA exclusively, upon 
whether vIOMMU is in use. 

Thanks
Kevin

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] vhost, iova, and dirty page tracking
  2019-09-18  6:10         ` Jason Wang
@ 2019-09-18  7:41           ` Tian, Kevin
  2019-09-18  8:37           ` Tian, Kevin
  1 sibling, 0 replies; 40+ messages in thread
From: Tian, Kevin @ 2019-09-18  7:41 UTC (permalink / raw)
  To: Jason Wang; +Cc: 'Alex Williamson', Zhao, Yan Y, qemu-devel

> From: Jason Wang [mailto:jasowang@redhat.com]
> Sent: Wednesday, September 18, 2019 2:10 PM
> 
> On 2019/9/18 上午9:44, Tian, Kevin wrote:
> >> From: Jason Wang [mailto:jasowang@redhat.com]
> >> Sent: Tuesday, September 17, 2019 6:36 PM
> >>
> >> On 2019/9/17 下午4:48, Tian, Kevin wrote:
> >>>> From: Jason Wang [mailto:jasowang@redhat.com]
> >>>> Sent: Monday, September 16, 2019 4:33 PM
> >>>>
> >>>>
> >>>> On 2019/9/16 上午9:51, Tian, Kevin wrote:
> >>>>> Hi, Jason
> >>>>>
> >>>>> We had a discussion about dirty page tracking in VFIO, when
> vIOMMU
> >>>>> is enabled:
> >>>>>
> >>>>> https://lists.nongnu.org/archive/html/qemu-devel/2019-
> >>>> 09/msg02690.html
> >>>>> It's actually a similar model as vhost - Qemu cannot interpose the
> fast-
> >>>> path
> >>>>> DMAs thus relies on the kernel part to track and report dirty page
> >>>> information.
> >>>>> Currently Qemu tracks dirty pages in GFN level, thus demanding a
> >>>> translation
> >>>>> from IOVA to GPA. Then the open in our discussion is where this
> >>>> translation
> >>>>> should happen. Doing the translation in kernel implies a device iotlb
> >>>> flavor,
> >>>>> which is what vhost implements today. It requires potentially large
> >>>> tracking
> >>>>> structures in the host kernel, but leveraging the existing log_sync
> flow
> >> in
> >>>> Qemu.
> >>>>> On the other hand, Qemu may perform log_sync for every removal
> of
> >>>> IOVA
> >>>>> mapping and then do the translation itself, then avoiding the GPA
> >>>> awareness
> >>>>> in the kernel side. It needs some change to current Qemu log-sync
> flow,
> >>>> and
> >>>>> may bring more overhead if IOVA is frequently unmapped.
> >>>>>
> >>>>> So we'd like to hear about your opinions, especially about how you
> >> came
> >>>>> down to the current iotlb approach for vhost.
> >>>> We don't consider too much in the point when introducing vhost. And
> >>>> before IOTLB, vhost has already know GPA through its mem table
> >>>> (GPA->HVA). So it's nature and easier to track dirty pages at GPA level
> >>>> then it won't any changes in the existing ABI.
> >>> This is the same situation as VFIO.
> >>>
> >>>> For VFIO case, the only advantages of using GPA is that the log can
> then
> >>>> be shared among all the devices that belongs to the VM. Otherwise
> >>>> syncing through IOVA is cleaner.
> >>> I still worry about the potential performance impact with this approach.
> >>> In current mdev live migration series, there are multiple system calls
> >>> involved when retrieving the dirty bitmap information for a given
> memory
> >>> range.
> >>
> >> I haven't took a deep look at that series. Technically dirty bitmap
> >> could be shared between device and driver, then there's no system call
> >> in synchronization.
> > That series require Qemu to tell the kernel about the information
> > about queried region (start, number, and page_size), read
> > the information about the dirty bitmap (offset, size) and then read
> > the dirty bitmap.
> 
> 
> Any pointer to that series, I can only find a "mdev live migration
> support with vfio-mdev-pci" from Liu Yi without actual codes.

https://lists.nongnu.org/archive/html/qemu-devel/2019-08/msg05543.html
It's interesting that I cannot google it. Have to manually find it in
Qemu archive.

> 
> 
> > Although the bitmap can be mmaped thus shared,
> > earlier reads/writes are conducted by pread/pwrite system calls.
> > This design is fine for current log_dirty implementation, where
> > dirty bitmap is synced in every pre-copy round. But to do it for
> > every IOVA unmap, it's definitely over-killed.
> >
> >>
> >>> IOVA mappings might be changed frequently. Though one may
> >>> argue that frequent IOVA change already has bad performance, it's still
> >>> not good to introduce further non-negligible overhead in such situation.
> >>
> >> Yes, it depends on the behavior of vIOMMU driver, e.g the frequency
> and
> >> granularity of the flushing.
> >>
> >>
> >>> On the other hand, I realized that adding IOVA awareness in VFIO is
> >>> actually easy. Today VFIO already maintains a full list of IOVA and its
> >>> associated HVA in vfio_dma structure, according to VFIO_MAP and
> >>> VFIO_UNMAP. As long as we allow the latter two operations to accept
> >>> another parameter (GPA), IOVA->GPA mapping can be naturally cached
> >>> in existing vfio_dma objects.
> >>
> >> Note that the HVA to GPA mapping is not an 1:1 mapping. One HVA
> range
> >> could be mapped to several GPA ranges.
> > This is fine. Currently vfio_dma maintains IOVA->HVA mapping.
> >
> > btw under what condition HVA->GPA is not 1:1 mapping? I didn't realize it.
> 
> 
> I don't remember the details e.g memory region alias? And neither kvm
> nor kvm API does forbid this if my memory is correct.
> 

I did see such comment in vhost code (log_write_hva):

	/* More than one GPAs can be mapped into a single HVA. So
                 * iterate all possible umems here to be safe.
                 */

and looks it tries to log all possible GPAs that is mapped to the same
HVA, even when only one of them is actually mapped by requested 
IOVA (from guest p.o.v). but I just cannot come up a scenario where 
this situation will be triggered. I once thought about KSM, but in that 
case it's two HVAs mapped to the same PFN thus irrelevant...

Thanks
Kevin

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] vhost, iova, and dirty page tracking
  2019-09-18  6:10         ` Jason Wang
  2019-09-18  7:41           ` Tian, Kevin
@ 2019-09-18  8:37           ` Tian, Kevin
  2019-09-19  1:05             ` Jason Wang
  1 sibling, 1 reply; 40+ messages in thread
From: Tian, Kevin @ 2019-09-18  8:37 UTC (permalink / raw)
  To: Jason Wang; +Cc: 'Alex Williamson', Zhao, Yan Y, qemu-devel

> From: Jason Wang [mailto:jasowang@redhat.com]
> Sent: Wednesday, September 18, 2019 2:10 PM
> 
> >>
> >> Note that the HVA to GPA mapping is not an 1:1 mapping. One HVA
> range
> >> could be mapped to several GPA ranges.
> > This is fine. Currently vfio_dma maintains IOVA->HVA mapping.
> >
> > btw under what condition HVA->GPA is not 1:1 mapping? I didn't realize it.
> 
> 
> I don't remember the details e.g memory region alias? And neither kvm
> nor kvm API does forbid this if my memory is correct.
> 

I checked https://qemu.weilnetz.de/doc/devel/memory.html, which
provides an example of aliased layout. However, its aliasing is all
1:1, instead of N:1. From guest p.o.v every writable GPA implies an
unique location. Why would we hit the situation where multiple
write-able GPAs are mapped to the same HVA (i.e. same physical
memory location)? Is Qemu doing its own same-content memory
merging in GPA level, similar to KSM?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] vhost, iova, and dirty page tracking
  2019-09-18  8:37           ` Tian, Kevin
@ 2019-09-19  1:05             ` Jason Wang
  2019-09-19  5:28               ` Yan Zhao
  0 siblings, 1 reply; 40+ messages in thread
From: Jason Wang @ 2019-09-19  1:05 UTC (permalink / raw)
  To: Tian, Kevin; +Cc: 'Alex Williamson', Zhao, Yan Y, qemu-devel


On 2019/9/18 下午4:37, Tian, Kevin wrote:
>> From: Jason Wang [mailto:jasowang@redhat.com]
>> Sent: Wednesday, September 18, 2019 2:10 PM
>>
>>>> Note that the HVA to GPA mapping is not an 1:1 mapping. One HVA
>> range
>>>> could be mapped to several GPA ranges.
>>> This is fine. Currently vfio_dma maintains IOVA->HVA mapping.
>>>
>>> btw under what condition HVA->GPA is not 1:1 mapping? I didn't realize it.
>>
>> I don't remember the details e.g memory region alias? And neither kvm
>> nor kvm API does forbid this if my memory is correct.
>>
> I checked https://qemu.weilnetz.de/doc/devel/memory.html, which
> provides an example of aliased layout. However, its aliasing is all
> 1:1, instead of N:1. From guest p.o.v every writable GPA implies an
> unique location. Why would we hit the situation where multiple
> write-able GPAs are mapped to the same HVA (i.e. same physical
> memory location)?


I don't know, just want to say current API does not forbid this. So we 
probably need to take care it.


> Is Qemu doing its own same-content memory
> merging in GPA level, similar to KSM?


AFAIK, it doesn't.

Thanks


> Thanks
> Kevin





^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] vhost, iova, and dirty page tracking
  2019-09-19  1:05             ` Jason Wang
@ 2019-09-19  5:28               ` Yan Zhao
  2019-09-19  6:09                 ` Jason Wang
  0 siblings, 1 reply; 40+ messages in thread
From: Yan Zhao @ 2019-09-19  5:28 UTC (permalink / raw)
  To: Jason Wang; +Cc: Tian, Kevin, 'Alex Williamson', qemu-devel

On Thu, Sep 19, 2019 at 09:05:12AM +0800, Jason Wang wrote:
> 
> On 2019/9/18 下午4:37, Tian, Kevin wrote:
> >> From: Jason Wang [mailto:jasowang@redhat.com]
> >> Sent: Wednesday, September 18, 2019 2:10 PM
> >>
> >>>> Note that the HVA to GPA mapping is not an 1:1 mapping. One HVA
> >> range
> >>>> could be mapped to several GPA ranges.
> >>> This is fine. Currently vfio_dma maintains IOVA->HVA mapping.
> >>>
> >>> btw under what condition HVA->GPA is not 1:1 mapping? I didn't realize it.
> >>
> >> I don't remember the details e.g memory region alias? And neither kvm
> >> nor kvm API does forbid this if my memory is correct.
> >>
> > I checked https://qemu.weilnetz.de/doc/devel/memory.html, which
> > provides an example of aliased layout. However, its aliasing is all
> > 1:1, instead of N:1. From guest p.o.v every writable GPA implies an
> > unique location. Why would we hit the situation where multiple
> > write-able GPAs are mapped to the same HVA (i.e. same physical
> > memory location)?
> 
> 
> I don't know, just want to say current API does not forbid this. So we 
> probably need to take care it.
>
yes, in KVM API level, it does not forbid two slots to have the same HVA(slot->userspace_addr).
But 
(1) there's only one kvm instance for each vm for each qemu process.
(2) all ramblock->host (corresponds to HVA and slot->userspace_addr) in one qemu
process is non-overlapping as it's obtained from mmmap().
(3) qemu ensures two kvm slots will not point to the same section of one ramblock.

So, as long as kvm instance is not shared in two processes, and 
there's no bug in qemu, we can assure that HVA to GPA is 1:1.

But even if there are two processes operating on the same kvm instance
and manipulating on memory slots, adding an extra GPA along side current
IOVA & HVA to ioctl VFIO_IOMMU_MAP_DMA can still let driver knows the
right IOVA->GPA mapping, right?

Thanks
Yan

> 
> > Is Qemu doing its own same-content memory
> > merging in GPA level, similar to KSM?
> 
> 
> AFAIK, it doesn't.
> 
> Thanks
> 
> 
> > Thanks
> > Kevin
> 
> 
> 


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] vhost, iova, and dirty page tracking
  2019-09-19  5:28               ` Yan Zhao
@ 2019-09-19  6:09                 ` Jason Wang
  2019-09-19  6:17                   ` Yan Zhao
  0 siblings, 1 reply; 40+ messages in thread
From: Jason Wang @ 2019-09-19  6:09 UTC (permalink / raw)
  To: Yan Zhao; +Cc: Tian, Kevin, 'Alex Williamson', Peter Xu, qemu-devel


On 2019/9/19 下午1:28, Yan Zhao wrote:
> On Thu, Sep 19, 2019 at 09:05:12AM +0800, Jason Wang wrote:
>> On 2019/9/18 下午4:37, Tian, Kevin wrote:
>>>> From: Jason Wang [mailto:jasowang@redhat.com]
>>>> Sent: Wednesday, September 18, 2019 2:10 PM
>>>>
>>>>>> Note that the HVA to GPA mapping is not an 1:1 mapping. One HVA
>>>> range
>>>>>> could be mapped to several GPA ranges.
>>>>> This is fine. Currently vfio_dma maintains IOVA->HVA mapping.
>>>>>
>>>>> btw under what condition HVA->GPA is not 1:1 mapping? I didn't realize it.
>>>> I don't remember the details e.g memory region alias? And neither kvm
>>>> nor kvm API does forbid this if my memory is correct.
>>>>
>>> I checked https://qemu.weilnetz.de/doc/devel/memory.html, which
>>> provides an example of aliased layout. However, its aliasing is all
>>> 1:1, instead of N:1. From guest p.o.v every writable GPA implies an
>>> unique location. Why would we hit the situation where multiple
>>> write-able GPAs are mapped to the same HVA (i.e. same physical
>>> memory location)?
>>
>> I don't know, just want to say current API does not forbid this. So we
>> probably need to take care it.
>>
> yes, in KVM API level, it does not forbid two slots to have the same HVA(slot->userspace_addr).
> But
> (1) there's only one kvm instance for each vm for each qemu process.
> (2) all ramblock->host (corresponds to HVA and slot->userspace_addr) in one qemu
> process is non-overlapping as it's obtained from mmmap().
> (3) qemu ensures two kvm slots will not point to the same section of one ramblock.
>
> So, as long as kvm instance is not shared in two processes, and
> there's no bug in qemu, we can assure that HVA to GPA is 1:1.


Well, you leave this API for userspace, so you can't assume qemu is the 
only user or any its behavior. If you had you should limit it in the API 
level instead of open window for them.


>
> But even if there are two processes operating on the same kvm instance
> and manipulating on memory slots, adding an extra GPA along side current
> IOVA & HVA to ioctl VFIO_IOMMU_MAP_DMA can still let driver knows the
> right IOVA->GPA mapping, right?


It looks fragile. Consider HVA was mapped to both GPA1 and GPA2. Guest 
maps IOVA to GPA2, so we have IOVA GPA2 HVA in the new ioctl and then 
log through GPA2. If userspace is trying to sync through GPA1, it will 
miss the dirty page. So for safety we need log both GPA1 and GPA2. (See 
what has been done in log_write_hva() in vhost.c). The only way to do 
that is to maintain an independent HVA to GPA mapping like what KVM or 
vhost did.

Thanks


>
> Thanks
> Yan
>
>>> Is Qemu doing its own same-content memory
>>> merging in GPA level, similar to KSM?
>>
>> AFAIK, it doesn't.
>>
>> Thanks
>>
>>
>>> Thanks
>>> Kevin
>>
>>


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] vhost, iova, and dirty page tracking
  2019-09-19  6:09                 ` Jason Wang
@ 2019-09-19  6:17                   ` Yan Zhao
  2019-09-19  6:32                     ` Jason Wang
  0 siblings, 1 reply; 40+ messages in thread
From: Yan Zhao @ 2019-09-19  6:17 UTC (permalink / raw)
  To: Jason Wang; +Cc: Tian, Kevin, 'Alex Williamson', Peter Xu, qemu-devel

On Thu, Sep 19, 2019 at 02:09:53PM +0800, Jason Wang wrote:
> 
> On 2019/9/19 下午1:28, Yan Zhao wrote:
> > On Thu, Sep 19, 2019 at 09:05:12AM +0800, Jason Wang wrote:
> >> On 2019/9/18 下午4:37, Tian, Kevin wrote:
> >>>> From: Jason Wang [mailto:jasowang@redhat.com]
> >>>> Sent: Wednesday, September 18, 2019 2:10 PM
> >>>>
> >>>>>> Note that the HVA to GPA mapping is not an 1:1 mapping. One HVA
> >>>> range
> >>>>>> could be mapped to several GPA ranges.
> >>>>> This is fine. Currently vfio_dma maintains IOVA->HVA mapping.
> >>>>>
> >>>>> btw under what condition HVA->GPA is not 1:1 mapping? I didn't realize it.
> >>>> I don't remember the details e.g memory region alias? And neither kvm
> >>>> nor kvm API does forbid this if my memory is correct.
> >>>>
> >>> I checked https://qemu.weilnetz.de/doc/devel/memory.html, which
> >>> provides an example of aliased layout. However, its aliasing is all
> >>> 1:1, instead of N:1. From guest p.o.v every writable GPA implies an
> >>> unique location. Why would we hit the situation where multiple
> >>> write-able GPAs are mapped to the same HVA (i.e. same physical
> >>> memory location)?
> >>
> >> I don't know, just want to say current API does not forbid this. So we
> >> probably need to take care it.
> >>
> > yes, in KVM API level, it does not forbid two slots to have the same HVA(slot->userspace_addr).
> > But
> > (1) there's only one kvm instance for each vm for each qemu process.
> > (2) all ramblock->host (corresponds to HVA and slot->userspace_addr) in one qemu
> > process is non-overlapping as it's obtained from mmmap().
> > (3) qemu ensures two kvm slots will not point to the same section of one ramblock.
> >
> > So, as long as kvm instance is not shared in two processes, and
> > there's no bug in qemu, we can assure that HVA to GPA is 1:1.
> 
> 
> Well, you leave this API for userspace, so you can't assume qemu is the 
> only user or any its behavior. If you had you should limit it in the API 
> level instead of open window for them.
> 
> 
> >
> > But even if there are two processes operating on the same kvm instance
> > and manipulating on memory slots, adding an extra GPA along side current
> > IOVA & HVA to ioctl VFIO_IOMMU_MAP_DMA can still let driver knows the
> > right IOVA->GPA mapping, right?
> 
> 
> It looks fragile. Consider HVA was mapped to both GPA1 and GPA2. Guest 
> maps IOVA to GPA2, so we have IOVA GPA2 HVA in the new ioctl and then 
> log through GPA2. If userspace is trying to sync through GPA1, it will 
> miss the dirty page. So for safety we need log both GPA1 and GPA2. (See 
> what has been done in log_write_hva() in vhost.c). The only way to do 
> that is to maintain an independent HVA to GPA mapping like what KVM or 
> vhost did.
> 
why GPA1 and GPA2 should be both dirty?
even they have the same HVA due to overlaping virtual address space in
two processes, they still correspond to two physical pages.
don't get what's your meaning :)

Thanks
Yan


> Thanks
> 
> 
> >
> > Thanks
> > Yan
> >
> >>> Is Qemu doing its own same-content memory
> >>> merging in GPA level, similar to KSM?
> >>
> >> AFAIK, it doesn't.
> >>
> >> Thanks
> >>
> >>
> >>> Thanks
> >>> Kevin
> >>
> >>


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] vhost, iova, and dirty page tracking
  2019-09-19  6:32                     ` Jason Wang
@ 2019-09-19  6:29                       ` Yan Zhao
  2019-09-19  6:32                         ` Yan Zhao
  2019-09-19 10:06                         ` Jason Wang
  2019-09-19  7:16                       ` Tian, Kevin
  1 sibling, 2 replies; 40+ messages in thread
From: Yan Zhao @ 2019-09-19  6:29 UTC (permalink / raw)
  To: Jason Wang; +Cc: Tian, Kevin, 'Alex Williamson', Peter Xu, qemu-devel

On Thu, Sep 19, 2019 at 02:32:03PM +0800, Jason Wang wrote:
> 
> On 2019/9/19 下午2:17, Yan Zhao wrote:
> > On Thu, Sep 19, 2019 at 02:09:53PM +0800, Jason Wang wrote:
> >> On 2019/9/19 下午1:28, Yan Zhao wrote:
> >>> On Thu, Sep 19, 2019 at 09:05:12AM +0800, Jason Wang wrote:
> >>>> On 2019/9/18 下午4:37, Tian, Kevin wrote:
> >>>>>> From: Jason Wang [mailto:jasowang@redhat.com]
> >>>>>> Sent: Wednesday, September 18, 2019 2:10 PM
> >>>>>>
> >>>>>>>> Note that the HVA to GPA mapping is not an 1:1 mapping. One HVA
> >>>>>> range
> >>>>>>>> could be mapped to several GPA ranges.
> >>>>>>> This is fine. Currently vfio_dma maintains IOVA->HVA mapping.
> >>>>>>>
> >>>>>>> btw under what condition HVA->GPA is not 1:1 mapping? I didn't realize it.
> >>>>>> I don't remember the details e.g memory region alias? And neither kvm
> >>>>>> nor kvm API does forbid this if my memory is correct.
> >>>>>>
> >>>>> I checked https://qemu.weilnetz.de/doc/devel/memory.html, which
> >>>>> provides an example of aliased layout. However, its aliasing is all
> >>>>> 1:1, instead of N:1. From guest p.o.v every writable GPA implies an
> >>>>> unique location. Why would we hit the situation where multiple
> >>>>> write-able GPAs are mapped to the same HVA (i.e. same physical
> >>>>> memory location)?
> >>>> I don't know, just want to say current API does not forbid this. So we
> >>>> probably need to take care it.
> >>>>
> >>> yes, in KVM API level, it does not forbid two slots to have the same HVA(slot->userspace_addr).
> >>> But
> >>> (1) there's only one kvm instance for each vm for each qemu process.
> >>> (2) all ramblock->host (corresponds to HVA and slot->userspace_addr) in one qemu
> >>> process is non-overlapping as it's obtained from mmmap().
> >>> (3) qemu ensures two kvm slots will not point to the same section of one ramblock.
> >>>
> >>> So, as long as kvm instance is not shared in two processes, and
> >>> there's no bug in qemu, we can assure that HVA to GPA is 1:1.
> >>
> >> Well, you leave this API for userspace, so you can't assume qemu is the
> >> only user or any its behavior. If you had you should limit it in the API
> >> level instead of open window for them.
> >>
> >>
> >>> But even if there are two processes operating on the same kvm instance
> >>> and manipulating on memory slots, adding an extra GPA along side current
> >>> IOVA & HVA to ioctl VFIO_IOMMU_MAP_DMA can still let driver knows the
> >>> right IOVA->GPA mapping, right?
> >>
> >> It looks fragile. Consider HVA was mapped to both GPA1 and GPA2. Guest
> >> maps IOVA to GPA2, so we have IOVA GPA2 HVA in the new ioctl and then
> >> log through GPA2. If userspace is trying to sync through GPA1, it will
> >> miss the dirty page. So for safety we need log both GPA1 and GPA2. (See
> >> what has been done in log_write_hva() in vhost.c). The only way to do
> >> that is to maintain an independent HVA to GPA mapping like what KVM or
> >> vhost did.
> >>
> > why GPA1 and GPA2 should be both dirty?
> > even they have the same HVA due to overlaping virtual address space in
> > two processes, they still correspond to two physical pages.
> > don't get what's your meaning :)
> 
> 
> The point is not leave any corner case that is hard to debug or fix in 
> the future.
> 
> Let's just start by a single process, the API allows userspace to maps 
> HVA to both GPA1 and GPA2. Since it knows GPA1 and GPA2 are equivalent, 
> it's ok to sync just through GPA1. That means if you only log GPA2, it 
> won't work.
>
In that case, cannot log dirty according to HPA.
because kvm cannot tell whether it's an valid case (the two GPAs are equivalent)
or an invalid case (the two GPAs are not equivalent, but with the same
HVA value).

Right?

Thanks
Yan


> Thanks
> 
> 
> >
> > Thanks
> > Yan
> >
> >
> >> Thanks
> >>
> >>
> >>> Thanks
> >>> Yan
> >>>
> >>>>> Is Qemu doing its own same-content memory
> >>>>> merging in GPA level, similar to KSM?
> >>>> AFAIK, it doesn't.
> >>>>
> >>>> Thanks
> >>>>
> >>>>
> >>>>> Thanks
> >>>>> Kevin
> >>>>


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] vhost, iova, and dirty page tracking
  2019-09-19  6:17                   ` Yan Zhao
@ 2019-09-19  6:32                     ` Jason Wang
  2019-09-19  6:29                       ` Yan Zhao
  2019-09-19  7:16                       ` Tian, Kevin
  0 siblings, 2 replies; 40+ messages in thread
From: Jason Wang @ 2019-09-19  6:32 UTC (permalink / raw)
  To: Yan Zhao; +Cc: Tian, Kevin, 'Alex Williamson', Peter Xu, qemu-devel


On 2019/9/19 下午2:17, Yan Zhao wrote:
> On Thu, Sep 19, 2019 at 02:09:53PM +0800, Jason Wang wrote:
>> On 2019/9/19 下午1:28, Yan Zhao wrote:
>>> On Thu, Sep 19, 2019 at 09:05:12AM +0800, Jason Wang wrote:
>>>> On 2019/9/18 下午4:37, Tian, Kevin wrote:
>>>>>> From: Jason Wang [mailto:jasowang@redhat.com]
>>>>>> Sent: Wednesday, September 18, 2019 2:10 PM
>>>>>>
>>>>>>>> Note that the HVA to GPA mapping is not an 1:1 mapping. One HVA
>>>>>> range
>>>>>>>> could be mapped to several GPA ranges.
>>>>>>> This is fine. Currently vfio_dma maintains IOVA->HVA mapping.
>>>>>>>
>>>>>>> btw under what condition HVA->GPA is not 1:1 mapping? I didn't realize it.
>>>>>> I don't remember the details e.g memory region alias? And neither kvm
>>>>>> nor kvm API does forbid this if my memory is correct.
>>>>>>
>>>>> I checked https://qemu.weilnetz.de/doc/devel/memory.html, which
>>>>> provides an example of aliased layout. However, its aliasing is all
>>>>> 1:1, instead of N:1. From guest p.o.v every writable GPA implies an
>>>>> unique location. Why would we hit the situation where multiple
>>>>> write-able GPAs are mapped to the same HVA (i.e. same physical
>>>>> memory location)?
>>>> I don't know, just want to say current API does not forbid this. So we
>>>> probably need to take care it.
>>>>
>>> yes, in KVM API level, it does not forbid two slots to have the same HVA(slot->userspace_addr).
>>> But
>>> (1) there's only one kvm instance for each vm for each qemu process.
>>> (2) all ramblock->host (corresponds to HVA and slot->userspace_addr) in one qemu
>>> process is non-overlapping as it's obtained from mmmap().
>>> (3) qemu ensures two kvm slots will not point to the same section of one ramblock.
>>>
>>> So, as long as kvm instance is not shared in two processes, and
>>> there's no bug in qemu, we can assure that HVA to GPA is 1:1.
>>
>> Well, you leave this API for userspace, so you can't assume qemu is the
>> only user or any its behavior. If you had you should limit it in the API
>> level instead of open window for them.
>>
>>
>>> But even if there are two processes operating on the same kvm instance
>>> and manipulating on memory slots, adding an extra GPA along side current
>>> IOVA & HVA to ioctl VFIO_IOMMU_MAP_DMA can still let driver knows the
>>> right IOVA->GPA mapping, right?
>>
>> It looks fragile. Consider HVA was mapped to both GPA1 and GPA2. Guest
>> maps IOVA to GPA2, so we have IOVA GPA2 HVA in the new ioctl and then
>> log through GPA2. If userspace is trying to sync through GPA1, it will
>> miss the dirty page. So for safety we need log both GPA1 and GPA2. (See
>> what has been done in log_write_hva() in vhost.c). The only way to do
>> that is to maintain an independent HVA to GPA mapping like what KVM or
>> vhost did.
>>
> why GPA1 and GPA2 should be both dirty?
> even they have the same HVA due to overlaping virtual address space in
> two processes, they still correspond to two physical pages.
> don't get what's your meaning :)


The point is not leave any corner case that is hard to debug or fix in 
the future.

Let's just start by a single process, the API allows userspace to maps 
HVA to both GPA1 and GPA2. Since it knows GPA1 and GPA2 are equivalent, 
it's ok to sync just through GPA1. That means if you only log GPA2, it 
won't work.

Thanks


>
> Thanks
> Yan
>
>
>> Thanks
>>
>>
>>> Thanks
>>> Yan
>>>
>>>>> Is Qemu doing its own same-content memory
>>>>> merging in GPA level, similar to KSM?
>>>> AFAIK, it doesn't.
>>>>
>>>> Thanks
>>>>
>>>>
>>>>> Thanks
>>>>> Kevin
>>>>


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] vhost, iova, and dirty page tracking
  2019-09-19  6:29                       ` Yan Zhao
@ 2019-09-19  6:32                         ` Yan Zhao
  2019-09-19  9:35                           ` Jason Wang
  2019-09-19 10:06                         ` Jason Wang
  1 sibling, 1 reply; 40+ messages in thread
From: Yan Zhao @ 2019-09-19  6:32 UTC (permalink / raw)
  To: Jason Wang; +Cc: Tian, Kevin, 'Alex Williamson', Peter Xu, qemu-devel

On Thu, Sep 19, 2019 at 02:29:54PM +0800, Yan Zhao wrote:
> On Thu, Sep 19, 2019 at 02:32:03PM +0800, Jason Wang wrote:
> > 
> > On 2019/9/19 下午2:17, Yan Zhao wrote:
> > > On Thu, Sep 19, 2019 at 02:09:53PM +0800, Jason Wang wrote:
> > >> On 2019/9/19 下午1:28, Yan Zhao wrote:
> > >>> On Thu, Sep 19, 2019 at 09:05:12AM +0800, Jason Wang wrote:
> > >>>> On 2019/9/18 下午4:37, Tian, Kevin wrote:
> > >>>>>> From: Jason Wang [mailto:jasowang@redhat.com]
> > >>>>>> Sent: Wednesday, September 18, 2019 2:10 PM
> > >>>>>>
> > >>>>>>>> Note that the HVA to GPA mapping is not an 1:1 mapping. One HVA
> > >>>>>> range
> > >>>>>>>> could be mapped to several GPA ranges.
> > >>>>>>> This is fine. Currently vfio_dma maintains IOVA->HVA mapping.
> > >>>>>>>
> > >>>>>>> btw under what condition HVA->GPA is not 1:1 mapping? I didn't realize it.
> > >>>>>> I don't remember the details e.g memory region alias? And neither kvm
> > >>>>>> nor kvm API does forbid this if my memory is correct.
> > >>>>>>
> > >>>>> I checked https://qemu.weilnetz.de/doc/devel/memory.html, which
> > >>>>> provides an example of aliased layout. However, its aliasing is all
> > >>>>> 1:1, instead of N:1. From guest p.o.v every writable GPA implies an
> > >>>>> unique location. Why would we hit the situation where multiple
> > >>>>> write-able GPAs are mapped to the same HVA (i.e. same physical
> > >>>>> memory location)?
> > >>>> I don't know, just want to say current API does not forbid this. So we
> > >>>> probably need to take care it.
> > >>>>
> > >>> yes, in KVM API level, it does not forbid two slots to have the same HVA(slot->userspace_addr).
> > >>> But
> > >>> (1) there's only one kvm instance for each vm for each qemu process.
> > >>> (2) all ramblock->host (corresponds to HVA and slot->userspace_addr) in one qemu
> > >>> process is non-overlapping as it's obtained from mmmap().
> > >>> (3) qemu ensures two kvm slots will not point to the same section of one ramblock.
> > >>>
> > >>> So, as long as kvm instance is not shared in two processes, and
> > >>> there's no bug in qemu, we can assure that HVA to GPA is 1:1.
> > >>
> > >> Well, you leave this API for userspace, so you can't assume qemu is the
> > >> only user or any its behavior. If you had you should limit it in the API
> > >> level instead of open window for them.
> > >>
> > >>
> > >>> But even if there are two processes operating on the same kvm instance
> > >>> and manipulating on memory slots, adding an extra GPA along side current
> > >>> IOVA & HVA to ioctl VFIO_IOMMU_MAP_DMA can still let driver knows the
> > >>> right IOVA->GPA mapping, right?
> > >>
> > >> It looks fragile. Consider HVA was mapped to both GPA1 and GPA2. Guest
> > >> maps IOVA to GPA2, so we have IOVA GPA2 HVA in the new ioctl and then
> > >> log through GPA2. If userspace is trying to sync through GPA1, it will
> > >> miss the dirty page. So for safety we need log both GPA1 and GPA2. (See
> > >> what has been done in log_write_hva() in vhost.c). The only way to do
> > >> that is to maintain an independent HVA to GPA mapping like what KVM or
> > >> vhost did.
> > >>
> > > why GPA1 and GPA2 should be both dirty?
> > > even they have the same HVA due to overlaping virtual address space in
> > > two processes, they still correspond to two physical pages.
> > > don't get what's your meaning :)
> > 
> > 
> > The point is not leave any corner case that is hard to debug or fix in 
> > the future.
> > 
> > Let's just start by a single process, the API allows userspace to maps 
> > HVA to both GPA1 and GPA2. Since it knows GPA1 and GPA2 are equivalent, 
> > it's ok to sync just through GPA1. That means if you only log GPA2, it 
> > won't work.
> >
> In that case, cannot log dirty according to HPA.
sorry, it should be "cannot log dirty according to HVA".

> because kvm cannot tell whether it's an valid case (the two GPAs are equivalent)
> or an invalid case (the two GPAs are not equivalent, but with the same
> HVA value).
> 
> Right?
> 
> Thanks
> Yan
> 
> 
> > Thanks
> > 
> > 
> > >
> > > Thanks
> > > Yan
> > >
> > >
> > >> Thanks
> > >>
> > >>
> > >>> Thanks
> > >>> Yan
> > >>>
> > >>>>> Is Qemu doing its own same-content memory
> > >>>>> merging in GPA level, similar to KSM?
> > >>>> AFAIK, it doesn't.
> > >>>>
> > >>>> Thanks
> > >>>>
> > >>>>
> > >>>>> Thanks
> > >>>>> Kevin
> > >>>>
> 


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] vhost, iova, and dirty page tracking
  2019-09-19  6:32                     ` Jason Wang
  2019-09-19  6:29                       ` Yan Zhao
@ 2019-09-19  7:16                       ` Tian, Kevin
  2019-09-19  9:37                         ` Jason Wang
  2019-09-19 11:14                         ` Paolo Bonzini
  1 sibling, 2 replies; 40+ messages in thread
From: Tian, Kevin @ 2019-09-19  7:16 UTC (permalink / raw)
  To: Jason Wang, Zhao, Yan Y
  Cc: Paolo Bonzini, 'Alex Williamson', qemu-devel

+Paolo to help clarify here.

> From: Jason Wang [mailto:jasowang@redhat.com]
> Sent: Thursday, September 19, 2019 2:32 PM
> 
> 
> On 2019/9/19 下午2:17, Yan Zhao wrote:
> > On Thu, Sep 19, 2019 at 02:09:53PM +0800, Jason Wang wrote:
> >> On 2019/9/19 下午1:28, Yan Zhao wrote:
> >>> On Thu, Sep 19, 2019 at 09:05:12AM +0800, Jason Wang wrote:
> >>>> On 2019/9/18 下午4:37, Tian, Kevin wrote:
> >>>>>> From: Jason Wang [mailto:jasowang@redhat.com]
> >>>>>> Sent: Wednesday, September 18, 2019 2:10 PM
> >>>>>>
> >>>>>>>> Note that the HVA to GPA mapping is not an 1:1 mapping. One
> HVA
> >>>>>> range
> >>>>>>>> could be mapped to several GPA ranges.
> >>>>>>> This is fine. Currently vfio_dma maintains IOVA->HVA mapping.
> >>>>>>>
> >>>>>>> btw under what condition HVA->GPA is not 1:1 mapping? I didn't
> realize it.
> >>>>>> I don't remember the details e.g memory region alias? And neither
> kvm
> >>>>>> nor kvm API does forbid this if my memory is correct.
> >>>>>>
> >>>>> I checked https://qemu.weilnetz.de/doc/devel/memory.html, which
> >>>>> provides an example of aliased layout. However, its aliasing is all
> >>>>> 1:1, instead of N:1. From guest p.o.v every writable GPA implies an
> >>>>> unique location. Why would we hit the situation where multiple
> >>>>> write-able GPAs are mapped to the same HVA (i.e. same physical
> >>>>> memory location)?
> >>>> I don't know, just want to say current API does not forbid this. So we
> >>>> probably need to take care it.
> >>>>
> >>> yes, in KVM API level, it does not forbid two slots to have the same
> HVA(slot->userspace_addr).
> >>> But
> >>> (1) there's only one kvm instance for each vm for each qemu process.
> >>> (2) all ramblock->host (corresponds to HVA and slot->userspace_addr)
> in one qemu
> >>> process is non-overlapping as it's obtained from mmmap().
> >>> (3) qemu ensures two kvm slots will not point to the same section of
> one ramblock.
> >>>
> >>> So, as long as kvm instance is not shared in two processes, and
> >>> there's no bug in qemu, we can assure that HVA to GPA is 1:1.
> >>
> >> Well, you leave this API for userspace, so you can't assume qemu is the
> >> only user or any its behavior. If you had you should limit it in the API
> >> level instead of open window for them.
> >>
> >>
> >>> But even if there are two processes operating on the same kvm
> instance
> >>> and manipulating on memory slots, adding an extra GPA along side
> current
> >>> IOVA & HVA to ioctl VFIO_IOMMU_MAP_DMA can still let driver knows
> the
> >>> right IOVA->GPA mapping, right?
> >>
> >> It looks fragile. Consider HVA was mapped to both GPA1 and GPA2.
> Guest
> >> maps IOVA to GPA2, so we have IOVA GPA2 HVA in the new ioctl and
> then
> >> log through GPA2. If userspace is trying to sync through GPA1, it will
> >> miss the dirty page. So for safety we need log both GPA1 and GPA2. (See
> >> what has been done in log_write_hva() in vhost.c). The only way to do
> >> that is to maintain an independent HVA to GPA mapping like what KVM
> or
> >> vhost did.
> >>
> > why GPA1 and GPA2 should be both dirty?
> > even they have the same HVA due to overlaping virtual address space in
> > two processes, they still correspond to two physical pages.
> > don't get what's your meaning :)
> 
> 
> The point is not leave any corner case that is hard to debug or fix in
> the future.
> 
> Let's just start by a single process, the API allows userspace to maps
> HVA to both GPA1 and GPA2. Since it knows GPA1 and GPA2 are equivalent,
> it's ok to sync just through GPA1. That means if you only log GPA2, it
> won't work.
> 

I noted KVM itself doesn't consider such situation (one HVA is mapped
to multiple GPAs), when doing its dirty page tracking. If you look at
kvm_vcpu_mark_page_dirty, it simply finds the unique memslot which
contains the dirty gfn and then set the dirty bit within that slot. It
doesn't attempt to walk all memslots to find out any other GPA which
may be mapped to the same HVA. 

So there must be some disconnect here. let's hear from Paolo first and
understand the rationale behind such situation.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] vhost, iova, and dirty page tracking
  2019-09-19  6:32                         ` Yan Zhao
@ 2019-09-19  9:35                           ` Jason Wang
  2019-09-19  9:36                             ` Yan Zhao
  0 siblings, 1 reply; 40+ messages in thread
From: Jason Wang @ 2019-09-19  9:35 UTC (permalink / raw)
  To: Yan Zhao; +Cc: Tian, Kevin, 'Alex Williamson', Peter Xu, qemu-devel


On 2019/9/19 下午2:32, Yan Zhao wrote:
> On Thu, Sep 19, 2019 at 02:29:54PM +0800, Yan Zhao wrote:
>> On Thu, Sep 19, 2019 at 02:32:03PM +0800, Jason Wang wrote:
>>> On 2019/9/19 下午2:17, Yan Zhao wrote:
>>>> On Thu, Sep 19, 2019 at 02:09:53PM +0800, Jason Wang wrote:
>>>>> On 2019/9/19 下午1:28, Yan Zhao wrote:
>>>>>> On Thu, Sep 19, 2019 at 09:05:12AM +0800, Jason Wang wrote:
>>>>>>> On 2019/9/18 下午4:37, Tian, Kevin wrote:
>>>>>>>>> From: Jason Wang [mailto:jasowang@redhat.com]
>>>>>>>>> Sent: Wednesday, September 18, 2019 2:10 PM
>>>>>>>>>
>>>>>>>>>>> Note that the HVA to GPA mapping is not an 1:1 mapping. One HVA
>>>>>>>>> range
>>>>>>>>>>> could be mapped to several GPA ranges.
>>>>>>>>>> This is fine. Currently vfio_dma maintains IOVA->HVA mapping.
>>>>>>>>>>
>>>>>>>>>> btw under what condition HVA->GPA is not 1:1 mapping? I didn't realize it.
>>>>>>>>> I don't remember the details e.g memory region alias? And neither kvm
>>>>>>>>> nor kvm API does forbid this if my memory is correct.
>>>>>>>>>
>>>>>>>> I checked https://qemu.weilnetz.de/doc/devel/memory.html, which
>>>>>>>> provides an example of aliased layout. However, its aliasing is all
>>>>>>>> 1:1, instead of N:1. From guest p.o.v every writable GPA implies an
>>>>>>>> unique location. Why would we hit the situation where multiple
>>>>>>>> write-able GPAs are mapped to the same HVA (i.e. same physical
>>>>>>>> memory location)?
>>>>>>> I don't know, just want to say current API does not forbid this. So we
>>>>>>> probably need to take care it.
>>>>>>>
>>>>>> yes, in KVM API level, it does not forbid two slots to have the same HVA(slot->userspace_addr).
>>>>>> But
>>>>>> (1) there's only one kvm instance for each vm for each qemu process.
>>>>>> (2) all ramblock->host (corresponds to HVA and slot->userspace_addr) in one qemu
>>>>>> process is non-overlapping as it's obtained from mmmap().
>>>>>> (3) qemu ensures two kvm slots will not point to the same section of one ramblock.
>>>>>>
>>>>>> So, as long as kvm instance is not shared in two processes, and
>>>>>> there's no bug in qemu, we can assure that HVA to GPA is 1:1.
>>>>> Well, you leave this API for userspace, so you can't assume qemu is the
>>>>> only user or any its behavior. If you had you should limit it in the API
>>>>> level instead of open window for them.
>>>>>
>>>>>
>>>>>> But even if there are two processes operating on the same kvm instance
>>>>>> and manipulating on memory slots, adding an extra GPA along side current
>>>>>> IOVA & HVA to ioctl VFIO_IOMMU_MAP_DMA can still let driver knows the
>>>>>> right IOVA->GPA mapping, right?
>>>>> It looks fragile. Consider HVA was mapped to both GPA1 and GPA2. Guest
>>>>> maps IOVA to GPA2, so we have IOVA GPA2 HVA in the new ioctl and then
>>>>> log through GPA2. If userspace is trying to sync through GPA1, it will
>>>>> miss the dirty page. So for safety we need log both GPA1 and GPA2. (See
>>>>> what has been done in log_write_hva() in vhost.c). The only way to do
>>>>> that is to maintain an independent HVA to GPA mapping like what KVM or
>>>>> vhost did.
>>>>>
>>>> why GPA1 and GPA2 should be both dirty?
>>>> even they have the same HVA due to overlaping virtual address space in
>>>> two processes, they still correspond to two physical pages.
>>>> don't get what's your meaning :)
>>>
>>> The point is not leave any corner case that is hard to debug or fix in
>>> the future.
>>>
>>> Let's just start by a single process, the API allows userspace to maps
>>> HVA to both GPA1 and GPA2. Since it knows GPA1 and GPA2 are equivalent,
>>> it's ok to sync just through GPA1. That means if you only log GPA2, it
>>> won't work.
>>>
>> In that case, cannot log dirty according to HPA.
> sorry, it should be "cannot log dirty according to HVA".


I think we are discussing the choice between GPA and IOVA, not HVA?

Thanks


>
>> because kvm cannot tell whether it's an valid case (the two GPAs are equivalent)
>> or an invalid case (the two GPAs are not equivalent, but with the same
>> HVA value).
>>
>> Right?
>>
>> Thanks
>> Yan
>>
>>
>>> Thanks
>>>
>>>
>>>> Thanks
>>>> Yan
>>>>
>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>>> Thanks
>>>>>> Yan
>>>>>>
>>>>>>>> Is Qemu doing its own same-content memory
>>>>>>>> merging in GPA level, similar to KSM?
>>>>>>> AFAIK, it doesn't.
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Kevin


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] vhost, iova, and dirty page tracking
  2019-09-19  9:35                           ` Jason Wang
@ 2019-09-19  9:36                             ` Yan Zhao
  2019-09-19 10:08                               ` Jason Wang
  0 siblings, 1 reply; 40+ messages in thread
From: Yan Zhao @ 2019-09-19  9:36 UTC (permalink / raw)
  To: Jason Wang; +Cc: Tian, Kevin, 'Alex Williamson', Peter Xu, qemu-devel

On Thu, Sep 19, 2019 at 05:35:05PM +0800, Jason Wang wrote:
> 
> On 2019/9/19 下午2:32, Yan Zhao wrote:
> > On Thu, Sep 19, 2019 at 02:29:54PM +0800, Yan Zhao wrote:
> >> On Thu, Sep 19, 2019 at 02:32:03PM +0800, Jason Wang wrote:
> >>> On 2019/9/19 下午2:17, Yan Zhao wrote:
> >>>> On Thu, Sep 19, 2019 at 02:09:53PM +0800, Jason Wang wrote:
> >>>>> On 2019/9/19 下午1:28, Yan Zhao wrote:
> >>>>>> On Thu, Sep 19, 2019 at 09:05:12AM +0800, Jason Wang wrote:
> >>>>>>> On 2019/9/18 下午4:37, Tian, Kevin wrote:
> >>>>>>>>> From: Jason Wang [mailto:jasowang@redhat.com]
> >>>>>>>>> Sent: Wednesday, September 18, 2019 2:10 PM
> >>>>>>>>>
> >>>>>>>>>>> Note that the HVA to GPA mapping is not an 1:1 mapping. One HVA
> >>>>>>>>> range
> >>>>>>>>>>> could be mapped to several GPA ranges.
> >>>>>>>>>> This is fine. Currently vfio_dma maintains IOVA->HVA mapping.
> >>>>>>>>>>
> >>>>>>>>>> btw under what condition HVA->GPA is not 1:1 mapping? I didn't realize it.
> >>>>>>>>> I don't remember the details e.g memory region alias? And neither kvm
> >>>>>>>>> nor kvm API does forbid this if my memory is correct.
> >>>>>>>>>
> >>>>>>>> I checked https://qemu.weilnetz.de/doc/devel/memory.html, which
> >>>>>>>> provides an example of aliased layout. However, its aliasing is all
> >>>>>>>> 1:1, instead of N:1. From guest p.o.v every writable GPA implies an
> >>>>>>>> unique location. Why would we hit the situation where multiple
> >>>>>>>> write-able GPAs are mapped to the same HVA (i.e. same physical
> >>>>>>>> memory location)?
> >>>>>>> I don't know, just want to say current API does not forbid this. So we
> >>>>>>> probably need to take care it.
> >>>>>>>
> >>>>>> yes, in KVM API level, it does not forbid two slots to have the same HVA(slot->userspace_addr).
> >>>>>> But
> >>>>>> (1) there's only one kvm instance for each vm for each qemu process.
> >>>>>> (2) all ramblock->host (corresponds to HVA and slot->userspace_addr) in one qemu
> >>>>>> process is non-overlapping as it's obtained from mmmap().
> >>>>>> (3) qemu ensures two kvm slots will not point to the same section of one ramblock.
> >>>>>>
> >>>>>> So, as long as kvm instance is not shared in two processes, and
> >>>>>> there's no bug in qemu, we can assure that HVA to GPA is 1:1.
> >>>>> Well, you leave this API for userspace, so you can't assume qemu is the
> >>>>> only user or any its behavior. If you had you should limit it in the API
> >>>>> level instead of open window for them.
> >>>>>
> >>>>>
> >>>>>> But even if there are two processes operating on the same kvm instance
> >>>>>> and manipulating on memory slots, adding an extra GPA along side current
> >>>>>> IOVA & HVA to ioctl VFIO_IOMMU_MAP_DMA can still let driver knows the
> >>>>>> right IOVA->GPA mapping, right?
> >>>>> It looks fragile. Consider HVA was mapped to both GPA1 and GPA2. Guest
> >>>>> maps IOVA to GPA2, so we have IOVA GPA2 HVA in the new ioctl and then
> >>>>> log through GPA2. If userspace is trying to sync through GPA1, it will
> >>>>> miss the dirty page. So for safety we need log both GPA1 and GPA2. (See
> >>>>> what has been done in log_write_hva() in vhost.c). The only way to do
> >>>>> that is to maintain an independent HVA to GPA mapping like what KVM or
> >>>>> vhost did.
> >>>>>
> >>>> why GPA1 and GPA2 should be both dirty?
> >>>> even they have the same HVA due to overlaping virtual address space in
> >>>> two processes, they still correspond to two physical pages.
> >>>> don't get what's your meaning :)
> >>>
> >>> The point is not leave any corner case that is hard to debug or fix in
> >>> the future.
> >>>
> >>> Let's just start by a single process, the API allows userspace to maps
> >>> HVA to both GPA1 and GPA2. Since it knows GPA1 and GPA2 are equivalent,
> >>> it's ok to sync just through GPA1. That means if you only log GPA2, it
> >>> won't work.
> >>>
> >> In that case, cannot log dirty according to HPA.
> > sorry, it should be "cannot log dirty according to HVA".
> 
> 
> I think we are discussing the choice between GPA and IOVA, not HVA?
>
Right. so why do we need to care about HVA to GPA mapping?
as long as IOVA to GPA is 1:1, then it's fine.

Thanks
Yan

> Thanks
> 
> 
> >
> >> because kvm cannot tell whether it's an valid case (the two GPAs are equivalent)
> >> or an invalid case (the two GPAs are not equivalent, but with the same
> >> HVA value).
> >>
> >> Right?
> >>
> >> Thanks
> >> Yan
> >>
> >>
> >>> Thanks
> >>>
> >>>
> >>>> Thanks
> >>>> Yan
> >>>>
> >>>>
> >>>>> Thanks
> >>>>>
> >>>>>
> >>>>>> Thanks
> >>>>>> Yan
> >>>>>>
> >>>>>>>> Is Qemu doing its own same-content memory
> >>>>>>>> merging in GPA level, similar to KSM?
> >>>>>>> AFAIK, it doesn't.
> >>>>>>>
> >>>>>>> Thanks
> >>>>>>>
> >>>>>>>
> >>>>>>>> Thanks
> >>>>>>>> Kevin


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] vhost, iova, and dirty page tracking
  2019-09-19  7:16                       ` Tian, Kevin
@ 2019-09-19  9:37                         ` Jason Wang
  2019-09-19 14:06                           ` Michael S. Tsirkin
  2019-09-19 11:14                         ` Paolo Bonzini
  1 sibling, 1 reply; 40+ messages in thread
From: Jason Wang @ 2019-09-19  9:37 UTC (permalink / raw)
  To: Tian, Kevin, Zhao, Yan Y
  Cc: Paolo Bonzini, 'Alex Williamson', qemu-devel, Michael S. Tsirkin


On 2019/9/19 下午3:16, Tian, Kevin wrote:
> +Paolo to help clarify here.
>
>> From: Jason Wang [mailto:jasowang@redhat.com]
>> Sent: Thursday, September 19, 2019 2:32 PM
>>
>>
>> On 2019/9/19 下午2:17, Yan Zhao wrote:
>>> On Thu, Sep 19, 2019 at 02:09:53PM +0800, Jason Wang wrote:
>>>> On 2019/9/19 下午1:28, Yan Zhao wrote:
>>>>> On Thu, Sep 19, 2019 at 09:05:12AM +0800, Jason Wang wrote:
>>>>>> On 2019/9/18 下午4:37, Tian, Kevin wrote:
>>>>>>>> From: Jason Wang [mailto:jasowang@redhat.com]
>>>>>>>> Sent: Wednesday, September 18, 2019 2:10 PM
>>>>>>>>
>>>>>>>>>> Note that the HVA to GPA mapping is not an 1:1 mapping. One
>> HVA
>>>>>>>> range
>>>>>>>>>> could be mapped to several GPA ranges.
>>>>>>>>> This is fine. Currently vfio_dma maintains IOVA->HVA mapping.
>>>>>>>>>
>>>>>>>>> btw under what condition HVA->GPA is not 1:1 mapping? I didn't
>> realize it.
>>>>>>>> I don't remember the details e.g memory region alias? And neither
>> kvm
>>>>>>>> nor kvm API does forbid this if my memory is correct.
>>>>>>>>
>>>>>>> I checked https://qemu.weilnetz.de/doc/devel/memory.html, which
>>>>>>> provides an example of aliased layout. However, its aliasing is all
>>>>>>> 1:1, instead of N:1. From guest p.o.v every writable GPA implies an
>>>>>>> unique location. Why would we hit the situation where multiple
>>>>>>> write-able GPAs are mapped to the same HVA (i.e. same physical
>>>>>>> memory location)?
>>>>>> I don't know, just want to say current API does not forbid this. So we
>>>>>> probably need to take care it.
>>>>>>
>>>>> yes, in KVM API level, it does not forbid two slots to have the same
>> HVA(slot->userspace_addr).
>>>>> But
>>>>> (1) there's only one kvm instance for each vm for each qemu process.
>>>>> (2) all ramblock->host (corresponds to HVA and slot->userspace_addr)
>> in one qemu
>>>>> process is non-overlapping as it's obtained from mmmap().
>>>>> (3) qemu ensures two kvm slots will not point to the same section of
>> one ramblock.
>>>>> So, as long as kvm instance is not shared in two processes, and
>>>>> there's no bug in qemu, we can assure that HVA to GPA is 1:1.
>>>> Well, you leave this API for userspace, so you can't assume qemu is the
>>>> only user or any its behavior. If you had you should limit it in the API
>>>> level instead of open window for them.
>>>>
>>>>
>>>>> But even if there are two processes operating on the same kvm
>> instance
>>>>> and manipulating on memory slots, adding an extra GPA along side
>> current
>>>>> IOVA & HVA to ioctl VFIO_IOMMU_MAP_DMA can still let driver knows
>> the
>>>>> right IOVA->GPA mapping, right?
>>>> It looks fragile. Consider HVA was mapped to both GPA1 and GPA2.
>> Guest
>>>> maps IOVA to GPA2, so we have IOVA GPA2 HVA in the new ioctl and
>> then
>>>> log through GPA2. If userspace is trying to sync through GPA1, it will
>>>> miss the dirty page. So for safety we need log both GPA1 and GPA2. (See
>>>> what has been done in log_write_hva() in vhost.c). The only way to do
>>>> that is to maintain an independent HVA to GPA mapping like what KVM
>> or
>>>> vhost did.
>>>>
>>> why GPA1 and GPA2 should be both dirty?
>>> even they have the same HVA due to overlaping virtual address space in
>>> two processes, they still correspond to two physical pages.
>>> don't get what's your meaning :)
>>
>> The point is not leave any corner case that is hard to debug or fix in
>> the future.
>>
>> Let's just start by a single process, the API allows userspace to maps
>> HVA to both GPA1 and GPA2. Since it knows GPA1 and GPA2 are equivalent,
>> it's ok to sync just through GPA1. That means if you only log GPA2, it
>> won't work.
>>
> I noted KVM itself doesn't consider such situation (one HVA is mapped
> to multiple GPAs), when doing its dirty page tracking. If you look at
> kvm_vcpu_mark_page_dirty, it simply finds the unique memslot which
> contains the dirty gfn and then set the dirty bit within that slot. It
> doesn't attempt to walk all memslots to find out any other GPA which
> may be mapped to the same HVA.
>
> So there must be some disconnect here. let's hear from Paolo first and
> understand the rationale behind such situation.


Neither did vhost when IOTLB is disabled. And cc Michael who points out 
this issue at the beginning.

Thanks


>
> Thanks
> Kevin


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] vhost, iova, and dirty page tracking
  2019-09-19  6:29                       ` Yan Zhao
  2019-09-19  6:32                         ` Yan Zhao
@ 2019-09-19 10:06                         ` Jason Wang
  2019-09-19 10:16                           ` Yan Zhao
  1 sibling, 1 reply; 40+ messages in thread
From: Jason Wang @ 2019-09-19 10:06 UTC (permalink / raw)
  To: Yan Zhao; +Cc: Tian, Kevin, 'Alex Williamson', Peter Xu, qemu-devel


On 2019/9/19 下午2:29, Yan Zhao wrote:
> On Thu, Sep 19, 2019 at 02:32:03PM +0800, Jason Wang wrote:
>> On 2019/9/19 下午2:17, Yan Zhao wrote:
>>> On Thu, Sep 19, 2019 at 02:09:53PM +0800, Jason Wang wrote:
>>>> On 2019/9/19 下午1:28, Yan Zhao wrote:
>>>>> On Thu, Sep 19, 2019 at 09:05:12AM +0800, Jason Wang wrote:
>>>>>> On 2019/9/18 下午4:37, Tian, Kevin wrote:
>>>>>>>> From: Jason Wang [mailto:jasowang@redhat.com]
>>>>>>>> Sent: Wednesday, September 18, 2019 2:10 PM
>>>>>>>>
>>>>>>>>>> Note that the HVA to GPA mapping is not an 1:1 mapping. One HVA
>>>>>>>> range
>>>>>>>>>> could be mapped to several GPA ranges.
>>>>>>>>> This is fine. Currently vfio_dma maintains IOVA->HVA mapping.
>>>>>>>>>
>>>>>>>>> btw under what condition HVA->GPA is not 1:1 mapping? I didn't realize it.
>>>>>>>> I don't remember the details e.g memory region alias? And neither kvm
>>>>>>>> nor kvm API does forbid this if my memory is correct.
>>>>>>>>
>>>>>>> I checkedhttps://qemu.weilnetz.de/doc/devel/memory.html, which
>>>>>>> provides an example of aliased layout. However, its aliasing is all
>>>>>>> 1:1, instead of N:1. From guest p.o.v every writable GPA implies an
>>>>>>> unique location. Why would we hit the situation where multiple
>>>>>>> write-able GPAs are mapped to the same HVA (i.e. same physical
>>>>>>> memory location)?
>>>>>> I don't know, just want to say current API does not forbid this. So we
>>>>>> probably need to take care it.
>>>>>>
>>>>> yes, in KVM API level, it does not forbid two slots to have the same HVA(slot->userspace_addr).
>>>>> But
>>>>> (1) there's only one kvm instance for each vm for each qemu process.
>>>>> (2) all ramblock->host (corresponds to HVA and slot->userspace_addr) in one qemu
>>>>> process is non-overlapping as it's obtained from mmmap().
>>>>> (3) qemu ensures two kvm slots will not point to the same section of one ramblock.
>>>>>
>>>>> So, as long as kvm instance is not shared in two processes, and
>>>>> there's no bug in qemu, we can assure that HVA to GPA is 1:1.
>>>> Well, you leave this API for userspace, so you can't assume qemu is the
>>>> only user or any its behavior. If you had you should limit it in the API
>>>> level instead of open window for them.
>>>>
>>>>
>>>>> But even if there are two processes operating on the same kvm instance
>>>>> and manipulating on memory slots, adding an extra GPA along side current
>>>>> IOVA & HVA to ioctl VFIO_IOMMU_MAP_DMA can still let driver knows the
>>>>> right IOVA->GPA mapping, right?
>>>> It looks fragile. Consider HVA was mapped to both GPA1 and GPA2. Guest
>>>> maps IOVA to GPA2, so we have IOVA GPA2 HVA in the new ioctl and then
>>>> log through GPA2. If userspace is trying to sync through GPA1, it will
>>>> miss the dirty page. So for safety we need log both GPA1 and GPA2. (See
>>>> what has been done in log_write_hva() in vhost.c). The only way to do
>>>> that is to maintain an independent HVA to GPA mapping like what KVM or
>>>> vhost did.
>>>>
>>> why GPA1 and GPA2 should be both dirty?
>>> even they have the same HVA due to overlaping virtual address space in
>>> two processes, they still correspond to two physical pages.
>>> don't get what's your meaning:)
>> The point is not leave any corner case that is hard to debug or fix in
>> the future.
>>
>> Let's just start by a single process, the API allows userspace to maps
>> HVA to both GPA1 and GPA2. Since it knows GPA1 and GPA2 are equivalent,
>> it's ok to sync just through GPA1. That means if you only log GPA2, it
>> won't work.
>>
> In that case, cannot log dirty according to HPA.
> because kvm cannot tell whether it's an valid case (the two GPAs are equivalent)
> or an invalid case (the two GPAs are not equivalent, but with the same
> HVA value).
>
> Right?


There no need any examination on whether it was 'valid' or not. It's as 
simple as logging both GPA1 and GPA2. Then you won't need to care any 
corner case.

Thanks


>
> Thanks
> Yan
>
>


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] vhost, iova, and dirty page tracking
  2019-09-19  9:36                             ` Yan Zhao
@ 2019-09-19 10:08                               ` Jason Wang
  0 siblings, 0 replies; 40+ messages in thread
From: Jason Wang @ 2019-09-19 10:08 UTC (permalink / raw)
  To: Yan Zhao; +Cc: Tian, Kevin, 'Alex Williamson', Peter Xu, qemu-devel


On 2019/9/19 下午5:36, Yan Zhao wrote:
> On Thu, Sep 19, 2019 at 05:35:05PM +0800, Jason Wang wrote:
>> On 2019/9/19 下午2:32, Yan Zhao wrote:
>>> On Thu, Sep 19, 2019 at 02:29:54PM +0800, Yan Zhao wrote:
>>>> On Thu, Sep 19, 2019 at 02:32:03PM +0800, Jason Wang wrote:
>>>>> On 2019/9/19 下午2:17, Yan Zhao wrote:
>>>>>> On Thu, Sep 19, 2019 at 02:09:53PM +0800, Jason Wang wrote:
>>>>>>> On 2019/9/19 下午1:28, Yan Zhao wrote:
>>>>>>>> On Thu, Sep 19, 2019 at 09:05:12AM +0800, Jason Wang wrote:
>>>>>>>>> On 2019/9/18 下午4:37, Tian, Kevin wrote:
>>>>>>>>>>> From: Jason Wang [mailto:jasowang@redhat.com]
>>>>>>>>>>> Sent: Wednesday, September 18, 2019 2:10 PM
>>>>>>>>>>>
>>>>>>>>>>>>> Note that the HVA to GPA mapping is not an 1:1 mapping. One HVA
>>>>>>>>>>> range
>>>>>>>>>>>>> could be mapped to several GPA ranges.
>>>>>>>>>>>> This is fine. Currently vfio_dma maintains IOVA->HVA mapping.
>>>>>>>>>>>>
>>>>>>>>>>>> btw under what condition HVA->GPA is not 1:1 mapping? I didn't realize it.
>>>>>>>>>>> I don't remember the details e.g memory region alias? And neither kvm
>>>>>>>>>>> nor kvm API does forbid this if my memory is correct.
>>>>>>>>>>>
>>>>>>>>>> I checked https://qemu.weilnetz.de/doc/devel/memory.html, which
>>>>>>>>>> provides an example of aliased layout. However, its aliasing is all
>>>>>>>>>> 1:1, instead of N:1. From guest p.o.v every writable GPA implies an
>>>>>>>>>> unique location. Why would we hit the situation where multiple
>>>>>>>>>> write-able GPAs are mapped to the same HVA (i.e. same physical
>>>>>>>>>> memory location)?
>>>>>>>>> I don't know, just want to say current API does not forbid this. So we
>>>>>>>>> probably need to take care it.
>>>>>>>>>
>>>>>>>> yes, in KVM API level, it does not forbid two slots to have the same HVA(slot->userspace_addr).
>>>>>>>> But
>>>>>>>> (1) there's only one kvm instance for each vm for each qemu process.
>>>>>>>> (2) all ramblock->host (corresponds to HVA and slot->userspace_addr) in one qemu
>>>>>>>> process is non-overlapping as it's obtained from mmmap().
>>>>>>>> (3) qemu ensures two kvm slots will not point to the same section of one ramblock.
>>>>>>>>
>>>>>>>> So, as long as kvm instance is not shared in two processes, and
>>>>>>>> there's no bug in qemu, we can assure that HVA to GPA is 1:1.
>>>>>>> Well, you leave this API for userspace, so you can't assume qemu is the
>>>>>>> only user or any its behavior. If you had you should limit it in the API
>>>>>>> level instead of open window for them.
>>>>>>>
>>>>>>>
>>>>>>>> But even if there are two processes operating on the same kvm instance
>>>>>>>> and manipulating on memory slots, adding an extra GPA along side current
>>>>>>>> IOVA & HVA to ioctl VFIO_IOMMU_MAP_DMA can still let driver knows the
>>>>>>>> right IOVA->GPA mapping, right?
>>>>>>> It looks fragile. Consider HVA was mapped to both GPA1 and GPA2. Guest
>>>>>>> maps IOVA to GPA2, so we have IOVA GPA2 HVA in the new ioctl and then
>>>>>>> log through GPA2. If userspace is trying to sync through GPA1, it will
>>>>>>> miss the dirty page. So for safety we need log both GPA1 and GPA2. (See
>>>>>>> what has been done in log_write_hva() in vhost.c). The only way to do
>>>>>>> that is to maintain an independent HVA to GPA mapping like what KVM or
>>>>>>> vhost did.
>>>>>>>
>>>>>> why GPA1 and GPA2 should be both dirty?
>>>>>> even they have the same HVA due to overlaping virtual address space in
>>>>>> two processes, they still correspond to two physical pages.
>>>>>> don't get what's your meaning :)
>>>>> The point is not leave any corner case that is hard to debug or fix in
>>>>> the future.
>>>>>
>>>>> Let's just start by a single process, the API allows userspace to maps
>>>>> HVA to both GPA1 and GPA2. Since it knows GPA1 and GPA2 are equivalent,
>>>>> it's ok to sync just through GPA1. That means if you only log GPA2, it
>>>>> won't work.
>>>>>
>>>> In that case, cannot log dirty according to HPA.
>>> sorry, it should be "cannot log dirty according to HVA".
>>
>> I think we are discussing the choice between GPA and IOVA, not HVA?
>>
> Right. so why do we need to care about HVA to GPA mapping?
> as long as IOVA to GPA is 1:1, then it's fine.


The problem is (whether) userspace can try to sync from GPA2 whose HVA 
is the same as GPA1.

Maintainers are copied by Kevin, hope it can help to clarify things.

Thanks


> Thanks
> Yan
>
>> Thanks
>>
>>
>>>> because kvm cannot tell whether it's an valid case (the two GPAs are equivalent)
>>>> or an invalid case (the two GPAs are not equivalent, but with the same
>>>> HVA value).
>>>>
>>>> Right?
>>>>
>>>> Thanks
>>>> Yan
>>>>
>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>>> Thanks
>>>>>> Yan
>>>>>>
>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Yan
>>>>>>>>
>>>>>>>>>> Is Qemu doing its own same-content memory
>>>>>>>>>> merging in GPA level, similar to KSM?
>>>>>>>>> AFAIK, it doesn't.
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>> Kevin


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] vhost, iova, and dirty page tracking
  2019-09-19 10:06                         ` Jason Wang
@ 2019-09-19 10:16                           ` Yan Zhao
  2019-09-19 12:14                             ` Jason Wang
  0 siblings, 1 reply; 40+ messages in thread
From: Yan Zhao @ 2019-09-19 10:16 UTC (permalink / raw)
  To: Jason Wang; +Cc: Tian, Kevin, 'Alex Williamson', Peter Xu, qemu-devel

On Thu, Sep 19, 2019 at 06:06:52PM +0800, Jason Wang wrote:
> 
> On 2019/9/19 下午2:29, Yan Zhao wrote:
> > On Thu, Sep 19, 2019 at 02:32:03PM +0800, Jason Wang wrote:
> >> On 2019/9/19 下午2:17, Yan Zhao wrote:
> >>> On Thu, Sep 19, 2019 at 02:09:53PM +0800, Jason Wang wrote:
> >>>> On 2019/9/19 下午1:28, Yan Zhao wrote:
> >>>>> On Thu, Sep 19, 2019 at 09:05:12AM +0800, Jason Wang wrote:
> >>>>>> On 2019/9/18 下午4:37, Tian, Kevin wrote:
> >>>>>>>> From: Jason Wang [mailto:jasowang@redhat.com]
> >>>>>>>> Sent: Wednesday, September 18, 2019 2:10 PM
> >>>>>>>>
> >>>>>>>>>> Note that the HVA to GPA mapping is not an 1:1 mapping. One HVA
> >>>>>>>> range
> >>>>>>>>>> could be mapped to several GPA ranges.
> >>>>>>>>> This is fine. Currently vfio_dma maintains IOVA->HVA mapping.
> >>>>>>>>>
> >>>>>>>>> btw under what condition HVA->GPA is not 1:1 mapping? I didn't realize it.
> >>>>>>>> I don't remember the details e.g memory region alias? And neither kvm
> >>>>>>>> nor kvm API does forbid this if my memory is correct.
> >>>>>>>>
> >>>>>>> I checkedhttps://qemu.weilnetz.de/doc/devel/memory.html, which
> >>>>>>> provides an example of aliased layout. However, its aliasing is all
> >>>>>>> 1:1, instead of N:1. From guest p.o.v every writable GPA implies an
> >>>>>>> unique location. Why would we hit the situation where multiple
> >>>>>>> write-able GPAs are mapped to the same HVA (i.e. same physical
> >>>>>>> memory location)?
> >>>>>> I don't know, just want to say current API does not forbid this. So we
> >>>>>> probably need to take care it.
> >>>>>>
> >>>>> yes, in KVM API level, it does not forbid two slots to have the same HVA(slot->userspace_addr).
> >>>>> But
> >>>>> (1) there's only one kvm instance for each vm for each qemu process.
> >>>>> (2) all ramblock->host (corresponds to HVA and slot->userspace_addr) in one qemu
> >>>>> process is non-overlapping as it's obtained from mmmap().
> >>>>> (3) qemu ensures two kvm slots will not point to the same section of one ramblock.
> >>>>>
> >>>>> So, as long as kvm instance is not shared in two processes, and
> >>>>> there's no bug in qemu, we can assure that HVA to GPA is 1:1.
> >>>> Well, you leave this API for userspace, so you can't assume qemu is the
> >>>> only user or any its behavior. If you had you should limit it in the API
> >>>> level instead of open window for them.
> >>>>
> >>>>
> >>>>> But even if there are two processes operating on the same kvm instance
> >>>>> and manipulating on memory slots, adding an extra GPA along side current
> >>>>> IOVA & HVA to ioctl VFIO_IOMMU_MAP_DMA can still let driver knows the
> >>>>> right IOVA->GPA mapping, right?
> >>>> It looks fragile. Consider HVA was mapped to both GPA1 and GPA2. Guest
> >>>> maps IOVA to GPA2, so we have IOVA GPA2 HVA in the new ioctl and then
> >>>> log through GPA2. If userspace is trying to sync through GPA1, it will
> >>>> miss the dirty page. So for safety we need log both GPA1 and GPA2. (See
> >>>> what has been done in log_write_hva() in vhost.c). The only way to do
> >>>> that is to maintain an independent HVA to GPA mapping like what KVM or
> >>>> vhost did.
> >>>>
> >>> why GPA1 and GPA2 should be both dirty?
> >>> even they have the same HVA due to overlaping virtual address space in
> >>> two processes, they still correspond to two physical pages.
> >>> don't get what's your meaning:)
> >> The point is not leave any corner case that is hard to debug or fix in
> >> the future.
> >>
> >> Let's just start by a single process, the API allows userspace to maps
> >> HVA to both GPA1 and GPA2. Since it knows GPA1 and GPA2 are equivalent,
> >> it's ok to sync just through GPA1. That means if you only log GPA2, it
> >> won't work.
> >>
> > In that case, cannot log dirty according to HPA.
> > because kvm cannot tell whether it's an valid case (the two GPAs are equivalent)
> > or an invalid case (the two GPAs are not equivalent, but with the same
> > HVA value).
> >
> > Right?
> 
> 
> There no need any examination on whether it was 'valid' or not. It's as 
> simple as logging both GPA1 and GPA2. Then you won't need to care any 
> corner case.
>
But, if GPA1 and GPA2 point to the same HVA, it means they point to the
same page. Then if you only log GPA2, and send GPA2 to target,  it
should still works, unless in the target side GPA1 and GPA2 do not point to
the same HVA?

In what condition you met it in reality?
Please kindly point it out :)



> Thanks
> 
> 
> >
> > Thanks
> > Yan
> >
> >


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] vhost, iova, and dirty page tracking
  2019-09-19  7:16                       ` Tian, Kevin
  2019-09-19  9:37                         ` Jason Wang
@ 2019-09-19 11:14                         ` Paolo Bonzini
  2019-09-19 12:39                           ` Jason Wang
  2019-09-19 22:54                           ` Tian, Kevin
  1 sibling, 2 replies; 40+ messages in thread
From: Paolo Bonzini @ 2019-09-19 11:14 UTC (permalink / raw)
  To: Tian, Kevin, Jason Wang, Zhao, Yan Y
  Cc: 'Alex Williamson', qemu-devel

On 19/09/19 09:16, Tian, Kevin wrote:
>>> why GPA1 and GPA2 should be both dirty?
>>> even they have the same HVA due to overlaping virtual address space in
>>> two processes, they still correspond to two physical pages.
>>> don't get what's your meaning :)
>>
>> The point is not leave any corner case that is hard to debug or fix in
>> the future.
>>
>> Let's just start by a single process, the API allows userspace to maps
>> HVA to both GPA1 and GPA2. Since it knows GPA1 and GPA2 are equivalent,
>> it's ok to sync just through GPA1. That means if you only log GPA2, it
>> won't work.
> 
> I noted KVM itself doesn't consider such situation (one HVA is mapped
> to multiple GPAs), when doing its dirty page tracking. If you look at
> kvm_vcpu_mark_page_dirty, it simply finds the unique memslot which
> contains the dirty gfn and then set the dirty bit within that slot. It
> doesn't attempt to walk all memslots to find out any other GPA which
> may be mapped to the same HVA. 
> 
> So there must be some disconnect here. let's hear from Paolo first and
> understand the rationale behind such situation.

In general, userspace cannot assume that it's okay to sync just through
GPA1.  It must sync the host page if *either* GPA1 or GPA2 are marked dirty.

The situation really only arises in special cases.  For example,
0xfffe0000..0xffffffff and 0xe0000..0xfffff might be the same memory.
From "info mtree" before the guest boots:

    0000000000000000-ffffffffffffffff (prio -1, i/o): pci
      00000000000e0000-00000000000fffff (prio 1, i/o): alias isa-bios
@pc.bios 0000000000020000-000000000003ffff
      00000000fffc0000-00000000ffffffff (prio 0, rom): pc.bios

However, non-x86 machines may have other cases of aliased memory so it's
a case that you should cover.

Paolo


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] vhost, iova, and dirty page tracking
  2019-09-19 10:16                           ` Yan Zhao
@ 2019-09-19 12:14                             ` Jason Wang
  0 siblings, 0 replies; 40+ messages in thread
From: Jason Wang @ 2019-09-19 12:14 UTC (permalink / raw)
  To: Yan Zhao; +Cc: Tian, Kevin, 'Alex Williamson', Peter Xu, qemu-devel


On 2019/9/19 下午6:16, Yan Zhao wrote:
> On Thu, Sep 19, 2019 at 06:06:52PM +0800, Jason Wang wrote:
>> On 2019/9/19 下午2:29, Yan Zhao wrote:
>>> On Thu, Sep 19, 2019 at 02:32:03PM +0800, Jason Wang wrote:
>>>> On 2019/9/19 下午2:17, Yan Zhao wrote:
>>>>> On Thu, Sep 19, 2019 at 02:09:53PM +0800, Jason Wang wrote:
>>>>>> On 2019/9/19 下午1:28, Yan Zhao wrote:
>>>>>>> On Thu, Sep 19, 2019 at 09:05:12AM +0800, Jason Wang wrote:
>>>>>>>> On 2019/9/18 下午4:37, Tian, Kevin wrote:
>>>>>>>>>> From: Jason Wang [mailto:jasowang@redhat.com]
>>>>>>>>>> Sent: Wednesday, September 18, 2019 2:10 PM
>>>>>>>>>>
>>>>>>>>>>>> Note that the HVA to GPA mapping is not an 1:1 mapping. One HVA
>>>>>>>>>> range
>>>>>>>>>>>> could be mapped to several GPA ranges.
>>>>>>>>>>> This is fine. Currently vfio_dma maintains IOVA->HVA mapping.
>>>>>>>>>>>
>>>>>>>>>>> btw under what condition HVA->GPA is not 1:1 mapping? I didn't realize it.
>>>>>>>>>> I don't remember the details e.g memory region alias? And neither kvm
>>>>>>>>>> nor kvm API does forbid this if my memory is correct.
>>>>>>>>>>
>>>>>>>>> I checkedhttps://qemu.weilnetz.de/doc/devel/memory.html, which
>>>>>>>>> provides an example of aliased layout. However, its aliasing is all
>>>>>>>>> 1:1, instead of N:1. From guest p.o.v every writable GPA implies an
>>>>>>>>> unique location. Why would we hit the situation where multiple
>>>>>>>>> write-able GPAs are mapped to the same HVA (i.e. same physical
>>>>>>>>> memory location)?
>>>>>>>> I don't know, just want to say current API does not forbid this. So we
>>>>>>>> probably need to take care it.
>>>>>>>>
>>>>>>> yes, in KVM API level, it does not forbid two slots to have the same HVA(slot->userspace_addr).
>>>>>>> But
>>>>>>> (1) there's only one kvm instance for each vm for each qemu process.
>>>>>>> (2) all ramblock->host (corresponds to HVA and slot->userspace_addr) in one qemu
>>>>>>> process is non-overlapping as it's obtained from mmmap().
>>>>>>> (3) qemu ensures two kvm slots will not point to the same section of one ramblock.
>>>>>>>
>>>>>>> So, as long as kvm instance is not shared in two processes, and
>>>>>>> there's no bug in qemu, we can assure that HVA to GPA is 1:1.
>>>>>> Well, you leave this API for userspace, so you can't assume qemu is the
>>>>>> only user or any its behavior. If you had you should limit it in the API
>>>>>> level instead of open window for them.
>>>>>>
>>>>>>
>>>>>>> But even if there are two processes operating on the same kvm instance
>>>>>>> and manipulating on memory slots, adding an extra GPA along side current
>>>>>>> IOVA & HVA to ioctl VFIO_IOMMU_MAP_DMA can still let driver knows the
>>>>>>> right IOVA->GPA mapping, right?
>>>>>> It looks fragile. Consider HVA was mapped to both GPA1 and GPA2. Guest
>>>>>> maps IOVA to GPA2, so we have IOVA GPA2 HVA in the new ioctl and then
>>>>>> log through GPA2. If userspace is trying to sync through GPA1, it will
>>>>>> miss the dirty page. So for safety we need log both GPA1 and GPA2. (See
>>>>>> what has been done in log_write_hva() in vhost.c). The only way to do
>>>>>> that is to maintain an independent HVA to GPA mapping like what KVM or
>>>>>> vhost did.
>>>>>>
>>>>> why GPA1 and GPA2 should be both dirty?
>>>>> even they have the same HVA due to overlaping virtual address space in
>>>>> two processes, they still correspond to two physical pages.
>>>>> don't get what's your meaning:)
>>>> The point is not leave any corner case that is hard to debug or fix in
>>>> the future.
>>>>
>>>> Let's just start by a single process, the API allows userspace to maps
>>>> HVA to both GPA1 and GPA2. Since it knows GPA1 and GPA2 are equivalent,
>>>> it's ok to sync just through GPA1. That means if you only log GPA2, it
>>>> won't work.
>>>>
>>> In that case, cannot log dirty according to HPA.
>>> because kvm cannot tell whether it's an valid case (the two GPAs are equivalent)
>>> or an invalid case (the two GPAs are not equivalent, but with the same
>>> HVA value).
>>>
>>> Right?
>>
>> There no need any examination on whether it was 'valid' or not. It's as
>> simple as logging both GPA1 and GPA2. Then you won't need to care any
>> corner case.
>>
> But, if GPA1 and GPA2 point to the same HVA, it means they point to the
> same page. Then if you only log GPA2, and send GPA2 to target,  it
> should still works, unless in the target side GPA1 and GPA2 do not point to
> the same HVA?


The problem is whether userspace can just sync GPA1 instead of both GPA1 
and GPA2. If userspace can sync through GPA1 only, the dirty pages was 
lost. Paolo has pointed out that userspace can not have that assumption.


>
> In what condition you met it in reality?
> Please kindly point it out :)


It's not about reality, it's about possibility. Again, we don't want to 
leave any corner case that is hard to debug or fix in the future.

Thanks


>
>
>> Thanks
>>
>>
>>> Thanks
>>> Yan
>>>
>>>


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] vhost, iova, and dirty page tracking
  2019-09-19 11:14                         ` Paolo Bonzini
@ 2019-09-19 12:39                           ` Jason Wang
  2019-09-19 12:45                             ` Paolo Bonzini
  2019-09-19 22:54                           ` Tian, Kevin
  1 sibling, 1 reply; 40+ messages in thread
From: Jason Wang @ 2019-09-19 12:39 UTC (permalink / raw)
  To: Paolo Bonzini, Tian, Kevin, Zhao, Yan Y
  Cc: 'Alex Williamson', qemu-devel


On 2019/9/19 下午7:14, Paolo Bonzini wrote:
> On 19/09/19 09:16, Tian, Kevin wrote:
>>>> why GPA1 and GPA2 should be both dirty?
>>>> even they have the same HVA due to overlaping virtual address space in
>>>> two processes, they still correspond to two physical pages.
>>>> don't get what's your meaning :)
>>> The point is not leave any corner case that is hard to debug or fix in
>>> the future.
>>>
>>> Let's just start by a single process, the API allows userspace to maps
>>> HVA to both GPA1 and GPA2. Since it knows GPA1 and GPA2 are equivalent,
>>> it's ok to sync just through GPA1. That means if you only log GPA2, it
>>> won't work.
>> I noted KVM itself doesn't consider such situation (one HVA is mapped
>> to multiple GPAs), when doing its dirty page tracking. If you look at
>> kvm_vcpu_mark_page_dirty, it simply finds the unique memslot which
>> contains the dirty gfn and then set the dirty bit within that slot. It
>> doesn't attempt to walk all memslots to find out any other GPA which
>> may be mapped to the same HVA.
>>
>> So there must be some disconnect here. let's hear from Paolo first and
>> understand the rationale behind such situation.
> In general, userspace cannot assume that it's okay to sync just through
> GPA1.  It must sync the host page if *either* GPA1 or GPA2 are marked dirty.


Maybe we need document this somewhere.


>
> The situation really only arises in special cases.  For example,
> 0xfffe0000..0xffffffff and 0xe0000..0xfffff might be the same memory.
>  From "info mtree" before the guest boots:
>
>      0000000000000000-ffffffffffffffff (prio -1, i/o): pci
>        00000000000e0000-00000000000fffff (prio 1, i/o): alias isa-bios
> @pc.bios 0000000000020000-000000000003ffff
>        00000000fffc0000-00000000ffffffff (prio 0, rom): pc.bios
>
> However, non-x86 machines may have other cases of aliased memory so it's
> a case that you should cover.
>
> Paolo


Any other issue that still need to be covered consider userspace need to 
sync both GPAs?

Thanks



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] vhost, iova, and dirty page tracking
  2019-09-19 12:39                           ` Jason Wang
@ 2019-09-19 12:45                             ` Paolo Bonzini
  0 siblings, 0 replies; 40+ messages in thread
From: Paolo Bonzini @ 2019-09-19 12:45 UTC (permalink / raw)
  To: Jason Wang, Tian, Kevin, Zhao, Yan Y
  Cc: 'Alex Williamson', qemu-devel

On 19/09/19 14:39, Jason Wang wrote:
>> In general, userspace cannot assume that it's okay to sync just through
>> GPA1.  It must sync the host page if *either* GPA1 or GPA2 are marked
>> dirty.
> 
> Maybe we need document this somewhere.

Well, it's implicit but it should be kind of obvious.  The dirty page
only tells you that the guest wrote to the GPA, HVAs are never mentioned
in the documentation.

Paolo

> Any other issue that still need to be covered consider userspace need to
> sync both GPAs?
> 
> Thanks
> 



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] vhost, iova, and dirty page tracking
  2019-09-19  9:37                         ` Jason Wang
@ 2019-09-19 14:06                           ` Michael S. Tsirkin
  2019-09-20  1:15                             ` Jason Wang
  0 siblings, 1 reply; 40+ messages in thread
From: Michael S. Tsirkin @ 2019-09-19 14:06 UTC (permalink / raw)
  To: Jason Wang
  Cc: Paolo Bonzini, Tian, Kevin, Zhao, Yan Y,
	'Alex Williamson',
	qemu-devel

On Thu, Sep 19, 2019 at 05:37:48PM +0800, Jason Wang wrote:
> 
> On 2019/9/19 下午3:16, Tian, Kevin wrote:
> > +Paolo to help clarify here.
> > 
> > > From: Jason Wang [mailto:jasowang@redhat.com]
> > > Sent: Thursday, September 19, 2019 2:32 PM
> > > 
> > > 
> > > On 2019/9/19 下午2:17, Yan Zhao wrote:
> > > > On Thu, Sep 19, 2019 at 02:09:53PM +0800, Jason Wang wrote:
> > > > > On 2019/9/19 下午1:28, Yan Zhao wrote:
> > > > > > On Thu, Sep 19, 2019 at 09:05:12AM +0800, Jason Wang wrote:
> > > > > > > On 2019/9/18 下午4:37, Tian, Kevin wrote:
> > > > > > > > > From: Jason Wang [mailto:jasowang@redhat.com]
> > > > > > > > > Sent: Wednesday, September 18, 2019 2:10 PM
> > > > > > > > > 
> > > > > > > > > > > Note that the HVA to GPA mapping is not an 1:1 mapping. One
> > > HVA
> > > > > > > > > range
> > > > > > > > > > > could be mapped to several GPA ranges.
> > > > > > > > > > This is fine. Currently vfio_dma maintains IOVA->HVA mapping.
> > > > > > > > > > 
> > > > > > > > > > btw under what condition HVA->GPA is not 1:1 mapping? I didn't
> > > realize it.
> > > > > > > > > I don't remember the details e.g memory region alias? And neither
> > > kvm
> > > > > > > > > nor kvm API does forbid this if my memory is correct.
> > > > > > > > > 
> > > > > > > > I checked https://qemu.weilnetz.de/doc/devel/memory.html, which
> > > > > > > > provides an example of aliased layout. However, its aliasing is all
> > > > > > > > 1:1, instead of N:1. From guest p.o.v every writable GPA implies an
> > > > > > > > unique location. Why would we hit the situation where multiple
> > > > > > > > write-able GPAs are mapped to the same HVA (i.e. same physical
> > > > > > > > memory location)?
> > > > > > > I don't know, just want to say current API does not forbid this. So we
> > > > > > > probably need to take care it.
> > > > > > > 
> > > > > > yes, in KVM API level, it does not forbid two slots to have the same
> > > HVA(slot->userspace_addr).
> > > > > > But
> > > > > > (1) there's only one kvm instance for each vm for each qemu process.
> > > > > > (2) all ramblock->host (corresponds to HVA and slot->userspace_addr)
> > > in one qemu
> > > > > > process is non-overlapping as it's obtained from mmmap().
> > > > > > (3) qemu ensures two kvm slots will not point to the same section of
> > > one ramblock.
> > > > > > So, as long as kvm instance is not shared in two processes, and
> > > > > > there's no bug in qemu, we can assure that HVA to GPA is 1:1.
> > > > > Well, you leave this API for userspace, so you can't assume qemu is the
> > > > > only user or any its behavior. If you had you should limit it in the API
> > > > > level instead of open window for them.
> > > > > 
> > > > > 
> > > > > > But even if there are two processes operating on the same kvm
> > > instance
> > > > > > and manipulating on memory slots, adding an extra GPA along side
> > > current
> > > > > > IOVA & HVA to ioctl VFIO_IOMMU_MAP_DMA can still let driver knows
> > > the
> > > > > > right IOVA->GPA mapping, right?
> > > > > It looks fragile. Consider HVA was mapped to both GPA1 and GPA2.
> > > Guest
> > > > > maps IOVA to GPA2, so we have IOVA GPA2 HVA in the new ioctl and
> > > then
> > > > > log through GPA2. If userspace is trying to sync through GPA1, it will
> > > > > miss the dirty page. So for safety we need log both GPA1 and GPA2. (See
> > > > > what has been done in log_write_hva() in vhost.c). The only way to do
> > > > > that is to maintain an independent HVA to GPA mapping like what KVM
> > > or
> > > > > vhost did.
> > > > > 
> > > > why GPA1 and GPA2 should be both dirty?
> > > > even they have the same HVA due to overlaping virtual address space in
> > > > two processes, they still correspond to two physical pages.
> > > > don't get what's your meaning :)
> > > 
> > > The point is not leave any corner case that is hard to debug or fix in
> > > the future.
> > > 
> > > Let's just start by a single process, the API allows userspace to maps
> > > HVA to both GPA1 and GPA2. Since it knows GPA1 and GPA2 are equivalent,
> > > it's ok to sync just through GPA1. That means if you only log GPA2, it
> > > won't work.
> > > 
> > I noted KVM itself doesn't consider such situation (one HVA is mapped
> > to multiple GPAs), when doing its dirty page tracking. If you look at
> > kvm_vcpu_mark_page_dirty, it simply finds the unique memslot which
> > contains the dirty gfn and then set the dirty bit within that slot. It
> > doesn't attempt to walk all memslots to find out any other GPA which
> > may be mapped to the same HVA.
> > 
> > So there must be some disconnect here. let's hear from Paolo first and
> > understand the rationale behind such situation.
> 
> 
> Neither did vhost when IOTLB is disabled. And cc Michael who points out this
> issue at the beginning.
> 
> Thanks
> 
> 
> > 
> > Thanks
> > Kevin

Yes, we fixed with a kind of a work around, at the time I proposed
a new interace to fix it fully. I don't think we ever got around
to implementing it - right?



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] vhost, iova, and dirty page tracking
  2019-09-18  7:21           ` Tian, Kevin
@ 2019-09-19 17:20             ` Alex Williamson
  2019-09-19 22:40               ` Tian, Kevin
  0 siblings, 1 reply; 40+ messages in thread
From: Alex Williamson @ 2019-09-19 17:20 UTC (permalink / raw)
  To: Tian, Kevin; +Cc: Jason Wang, Zhao, Yan Y, qemu-devel

On Wed, 18 Sep 2019 07:21:05 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Jason Wang [mailto:jasowang@redhat.com]
> > Sent: Wednesday, September 18, 2019 2:04 PM
> > 
> > On 2019/9/18 上午9:31, Tian, Kevin wrote:  
> > >> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > >> Sent: Tuesday, September 17, 2019 10:54 PM
> > >>
> > >> On Tue, 17 Sep 2019 08:48:36 +0000
> > >> "Tian, Kevin"<kevin.tian@intel.com>  wrote:
> > >>  
> > >>>> From: Jason Wang [mailto:jasowang@redhat.com]
> > >>>> Sent: Monday, September 16, 2019 4:33 PM
> > >>>>
> > >>>>
> > >>>> On 2019/9/16 上午9:51, Tian, Kevin wrote:  
> > >>>>> Hi, Jason
> > >>>>>
> > >>>>> We had a discussion about dirty page tracking in VFIO, when  
> > vIOMMU  
> > >>>>> is enabled:
> > >>>>>
> > >>>>> https://lists.nongnu.org/archive/html/qemu-devel/2019-  
> > >>>> 09/msg02690.html  
> > >>>>> It's actually a similar model as vhost - Qemu cannot interpose the  
> > fast-  
> > >>>> path  
> > >>>>> DMAs thus relies on the kernel part to track and report dirty page  
> > >>>> information.  
> > >>>>> Currently Qemu tracks dirty pages in GFN level, thus demanding a  
> > >>>> translation  
> > >>>>> from IOVA to GPA. Then the open in our discussion is where this  
> > >>>> translation  
> > >>>>> should happen. Doing the translation in kernel implies a device iotlb  
> > >>>> flavor,  
> > >>>>> which is what vhost implements today. It requires potentially large  
> > >>>> tracking  
> > >>>>> structures in the host kernel, but leveraging the existing log_sync  
> > flow  
> > >> in  
> > >>>> Qemu.  
> > >>>>> On the other hand, Qemu may perform log_sync for every removal  
> > of  
> > >>>> IOVA  
> > >>>>> mapping and then do the translation itself, then avoiding the GPA  
> > >>>> awareness  
> > >>>>> in the kernel side. It needs some change to current Qemu log-sync  
> > >> flow,  
> > >>>> and  
> > >>>>> may bring more overhead if IOVA is frequently unmapped.
> > >>>>>
> > >>>>> So we'd like to hear about your opinions, especially about how you  
> > >> came  
> > >>>>> down to the current iotlb approach for vhost.  
> > >>>> We don't consider too much in the point when introducing vhost. And
> > >>>> before IOTLB, vhost has already know GPA through its mem table
> > >>>> (GPA->HVA). So it's nature and easier to track dirty pages at GPA level
> > >>>> then it won't any changes in the existing ABI.  
> > >>> This is the same situation as VFIO.  
> > >> It is?  VFIO doesn't know GPAs, it only knows HVA, HPA, and IOVA.  In
> > >> some cases IOVA is GPA, but not all.  
> > > Well, I thought vhost has a similar design, that the index of its mem table
> > > is GPA when vIOMMU is off and then becomes IOVA when vIOMMU is on.
> > > But I may be wrong here. Jason, can you help clarify? I saw two
> > > interfaces which poke the mem table: VHOST_SET_MEM_TABLE (for GPA)
> > > and VHOST_IOTLB_UPDATE (for IOVA). Are they used exclusively or  
> > together?  
> > >  
> > 
> > Actually, vhost maintains two interval trees, mem table GPA->HVA, and
> > device IOTLB IOVA->HVA. Device IOTLB is only used when vIOMMU is
> > enabled, and in that case mem table is used only when vhost need to
> > track dirty pages (do reverse lookup of memtable to get HVA->GPA). So in
> > conclusion, for datapath, they are used exclusively, but they need
> > cowork for logging dirty pages when device IOTLB is enabled.
> >   
> 
> OK. Then it's different from current VFIO design, which maintains only
> one tree which is indexed by either GPA or IOVA exclusively, upon 
> whether vIOMMU is in use. 

Nit, the VFIO tree is only ever indexed by IOVA.  The MAP_DMA ioctl is
only ever performed with an IOVA.  Userspace decides how that IOVA maps
to GPA, VFIO only needs to know how the IOVA maps to HPA via the HVA.
Thanks,

Alex


^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [Qemu-devel] vhost, iova, and dirty page tracking
  2019-09-19 17:20             ` Alex Williamson
@ 2019-09-19 22:40               ` Tian, Kevin
  0 siblings, 0 replies; 40+ messages in thread
From: Tian, Kevin @ 2019-09-19 22:40 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Jason Wang, Zhao, Yan Y, qemu-devel

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Friday, September 20, 2019 1:21 AM
> 
> On Wed, 18 Sep 2019 07:21:05 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Jason Wang [mailto:jasowang@redhat.com]
> > > Sent: Wednesday, September 18, 2019 2:04 PM
> > >
> > > On 2019/9/18 上午9:31, Tian, Kevin wrote:
> > > >> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > >> Sent: Tuesday, September 17, 2019 10:54 PM
> > > >>
> > > >> On Tue, 17 Sep 2019 08:48:36 +0000
> > > >> "Tian, Kevin"<kevin.tian@intel.com>  wrote:
> > > >>
> > > >>>> From: Jason Wang [mailto:jasowang@redhat.com]
> > > >>>> Sent: Monday, September 16, 2019 4:33 PM
> > > >>>>
> > > >>>>
> > > >>>> On 2019/9/16 上午9:51, Tian, Kevin wrote:
> > > >>>>> Hi, Jason
> > > >>>>>
> > > >>>>> We had a discussion about dirty page tracking in VFIO, when
> > > vIOMMU
> > > >>>>> is enabled:
> > > >>>>>
> > > >>>>> https://lists.nongnu.org/archive/html/qemu-devel/2019-
> > > >>>> 09/msg02690.html
> > > >>>>> It's actually a similar model as vhost - Qemu cannot interpose the
> > > fast-
> > > >>>> path
> > > >>>>> DMAs thus relies on the kernel part to track and report dirty page
> > > >>>> information.
> > > >>>>> Currently Qemu tracks dirty pages in GFN level, thus demanding a
> > > >>>> translation
> > > >>>>> from IOVA to GPA. Then the open in our discussion is where this
> > > >>>> translation
> > > >>>>> should happen. Doing the translation in kernel implies a device
> iotlb
> > > >>>> flavor,
> > > >>>>> which is what vhost implements today. It requires potentially
> large
> > > >>>> tracking
> > > >>>>> structures in the host kernel, but leveraging the existing log_sync
> > > flow
> > > >> in
> > > >>>> Qemu.
> > > >>>>> On the other hand, Qemu may perform log_sync for every
> removal
> > > of
> > > >>>> IOVA
> > > >>>>> mapping and then do the translation itself, then avoiding the GPA
> > > >>>> awareness
> > > >>>>> in the kernel side. It needs some change to current Qemu log-
> sync
> > > >> flow,
> > > >>>> and
> > > >>>>> may bring more overhead if IOVA is frequently unmapped.
> > > >>>>>
> > > >>>>> So we'd like to hear about your opinions, especially about how
> you
> > > >> came
> > > >>>>> down to the current iotlb approach for vhost.
> > > >>>> We don't consider too much in the point when introducing vhost.
> And
> > > >>>> before IOTLB, vhost has already know GPA through its mem table
> > > >>>> (GPA->HVA). So it's nature and easier to track dirty pages at GPA
> level
> > > >>>> then it won't any changes in the existing ABI.
> > > >>> This is the same situation as VFIO.
> > > >> It is?  VFIO doesn't know GPAs, it only knows HVA, HPA, and IOVA.
> In
> > > >> some cases IOVA is GPA, but not all.
> > > > Well, I thought vhost has a similar design, that the index of its mem
> table
> > > > is GPA when vIOMMU is off and then becomes IOVA when vIOMMU is
> on.
> > > > But I may be wrong here. Jason, can you help clarify? I saw two
> > > > interfaces which poke the mem table: VHOST_SET_MEM_TABLE (for
> GPA)
> > > > and VHOST_IOTLB_UPDATE (for IOVA). Are they used exclusively or
> > > together?
> > > >
> > >
> > > Actually, vhost maintains two interval trees, mem table GPA->HVA, and
> > > device IOTLB IOVA->HVA. Device IOTLB is only used when vIOMMU is
> > > enabled, and in that case mem table is used only when vhost need to
> > > track dirty pages (do reverse lookup of memtable to get HVA->GPA). So
> in
> > > conclusion, for datapath, they are used exclusively, but they need
> > > cowork for logging dirty pages when device IOTLB is enabled.
> > >
> >
> > OK. Then it's different from current VFIO design, which maintains only
> > one tree which is indexed by either GPA or IOVA exclusively, upon
> > whether vIOMMU is in use.
> 
> Nit, the VFIO tree is only ever indexed by IOVA.  The MAP_DMA ioctl is
> only ever performed with an IOVA.  Userspace decides how that IOVA
> maps
> to GPA, VFIO only needs to know how the IOVA maps to HPA via the HVA.
> Thanks,
> 

I was only referring to its actual meaning from usage p.o.v, not the 
parameter name (which is always called iova) in vfio. 

Thanks
Kevin

^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [Qemu-devel] vhost, iova, and dirty page tracking
  2019-09-19 11:14                         ` Paolo Bonzini
  2019-09-19 12:39                           ` Jason Wang
@ 2019-09-19 22:54                           ` Tian, Kevin
  2019-09-20  1:18                             ` Jason Wang
  1 sibling, 1 reply; 40+ messages in thread
From: Tian, Kevin @ 2019-09-19 22:54 UTC (permalink / raw)
  To: Paolo Bonzini, Jason Wang, Zhao, Yan Y
  Cc: 'Alex Williamson', qemu-devel, mst

> From: Paolo Bonzini [mailto:pbonzini@redhat.com]
> Sent: Thursday, September 19, 2019 7:14 PM
> 
> On 19/09/19 09:16, Tian, Kevin wrote:
> >>> why GPA1 and GPA2 should be both dirty?
> >>> even they have the same HVA due to overlaping virtual address space
> in
> >>> two processes, they still correspond to two physical pages.
> >>> don't get what's your meaning :)
> >>
> >> The point is not leave any corner case that is hard to debug or fix in
> >> the future.
> >>
> >> Let's just start by a single process, the API allows userspace to maps
> >> HVA to both GPA1 and GPA2. Since it knows GPA1 and GPA2 are
> equivalent,
> >> it's ok to sync just through GPA1. That means if you only log GPA2, it
> >> won't work.
> >
> > I noted KVM itself doesn't consider such situation (one HVA is mapped
> > to multiple GPAs), when doing its dirty page tracking. If you look at
> > kvm_vcpu_mark_page_dirty, it simply finds the unique memslot which
> > contains the dirty gfn and then set the dirty bit within that slot. It
> > doesn't attempt to walk all memslots to find out any other GPA which
> > may be mapped to the same HVA.
> >
> > So there must be some disconnect here. let's hear from Paolo first and
> > understand the rationale behind such situation.
> 
> In general, userspace cannot assume that it's okay to sync just through
> GPA1.  It must sync the host page if *either* GPA1 or GPA2 are marked
> dirty.

Agree. In this case the kernel only needs to track whether GPA1 or
GPA2 is dirtied by guest operations. The reason why vhost has to
set both GPA1 and GPA2 is due to its own design - it maintains
IOVA->HVA and GPA->HVA mappings thus given a IOVA you have
to reverse lookup GPA->HVA memTable which gives multiple possible
GPAs. But in concept if vhost can maintain a IOVA->GPA mapping,
then it is straightforward to set the right GPA every time when a IOVA
is tracked.

> 
> The situation really only arises in special cases.  For example,
> 0xfffe0000..0xffffffff and 0xe0000..0xfffff might be the same memory.
> From "info mtree" before the guest boots:
> 
>     0000000000000000-ffffffffffffffff (prio -1, i/o): pci
>       00000000000e0000-00000000000fffff (prio 1, i/o): alias isa-bios
> @pc.bios 0000000000020000-000000000003ffff
>       00000000fffc0000-00000000ffffffff (prio 0, rom): pc.bios
> 
> However, non-x86 machines may have other cases of aliased memory so
> it's
> a case that you should cover.
> 

Above example is read-only, thus won't be touched in logdirty path.
But now I agree that a specific architecture may define two
writable GPA ranges with one as the alias to the other, as long as
such case is explicitly documented so guest OS won't treat them as
separate memory pages.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] vhost, iova, and dirty page tracking
  2019-09-19 14:06                           ` Michael S. Tsirkin
@ 2019-09-20  1:15                             ` Jason Wang
  2019-09-20 10:02                               ` Michael S. Tsirkin
  0 siblings, 1 reply; 40+ messages in thread
From: Jason Wang @ 2019-09-20  1:15 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Paolo Bonzini, Tian, Kevin, Zhao, Yan Y,
	'Alex Williamson',
	qemu-devel


On 2019/9/19 下午10:06, Michael S. Tsirkin wrote:
> On Thu, Sep 19, 2019 at 05:37:48PM +0800, Jason Wang wrote:
>> On 2019/9/19 下午3:16, Tian, Kevin wrote:
>>> +Paolo to help clarify here.
>>>
>>>> From: Jason Wang [mailto:jasowang@redhat.com]
>>>> Sent: Thursday, September 19, 2019 2:32 PM
>>>>
>>>>
>>>> On 2019/9/19 下午2:17, Yan Zhao wrote:
>>>>> On Thu, Sep 19, 2019 at 02:09:53PM +0800, Jason Wang wrote:
>>>>>> On 2019/9/19 下午1:28, Yan Zhao wrote:
>>>>>>> On Thu, Sep 19, 2019 at 09:05:12AM +0800, Jason Wang wrote:
>>>>>>>> On 2019/9/18 下午4:37, Tian, Kevin wrote:
>>>>>>>>>> From: Jason Wang [mailto:jasowang@redhat.com]
>>>>>>>>>> Sent: Wednesday, September 18, 2019 2:10 PM
>>>>>>>>>>
>>>>>>>>>>>> Note that the HVA to GPA mapping is not an 1:1 mapping. One
>>>> HVA
>>>>>>>>>> range
>>>>>>>>>>>> could be mapped to several GPA ranges.
>>>>>>>>>>> This is fine. Currently vfio_dma maintains IOVA->HVA mapping.
>>>>>>>>>>>
>>>>>>>>>>> btw under what condition HVA->GPA is not 1:1 mapping? I didn't
>>>> realize it.
>>>>>>>>>> I don't remember the details e.g memory region alias? And neither
>>>> kvm
>>>>>>>>>> nor kvm API does forbid this if my memory is correct.
>>>>>>>>>>
>>>>>>>>> I checked https://qemu.weilnetz.de/doc/devel/memory.html, which
>>>>>>>>> provides an example of aliased layout. However, its aliasing is all
>>>>>>>>> 1:1, instead of N:1. From guest p.o.v every writable GPA implies an
>>>>>>>>> unique location. Why would we hit the situation where multiple
>>>>>>>>> write-able GPAs are mapped to the same HVA (i.e. same physical
>>>>>>>>> memory location)?
>>>>>>>> I don't know, just want to say current API does not forbid this. So we
>>>>>>>> probably need to take care it.
>>>>>>>>
>>>>>>> yes, in KVM API level, it does not forbid two slots to have the same
>>>> HVA(slot->userspace_addr).
>>>>>>> But
>>>>>>> (1) there's only one kvm instance for each vm for each qemu process.
>>>>>>> (2) all ramblock->host (corresponds to HVA and slot->userspace_addr)
>>>> in one qemu
>>>>>>> process is non-overlapping as it's obtained from mmmap().
>>>>>>> (3) qemu ensures two kvm slots will not point to the same section of
>>>> one ramblock.
>>>>>>> So, as long as kvm instance is not shared in two processes, and
>>>>>>> there's no bug in qemu, we can assure that HVA to GPA is 1:1.
>>>>>> Well, you leave this API for userspace, so you can't assume qemu is the
>>>>>> only user or any its behavior. If you had you should limit it in the API
>>>>>> level instead of open window for them.
>>>>>>
>>>>>>
>>>>>>> But even if there are two processes operating on the same kvm
>>>> instance
>>>>>>> and manipulating on memory slots, adding an extra GPA along side
>>>> current
>>>>>>> IOVA & HVA to ioctl VFIO_IOMMU_MAP_DMA can still let driver knows
>>>> the
>>>>>>> right IOVA->GPA mapping, right?
>>>>>> It looks fragile. Consider HVA was mapped to both GPA1 and GPA2.
>>>> Guest
>>>>>> maps IOVA to GPA2, so we have IOVA GPA2 HVA in the new ioctl and
>>>> then
>>>>>> log through GPA2. If userspace is trying to sync through GPA1, it will
>>>>>> miss the dirty page. So for safety we need log both GPA1 and GPA2. (See
>>>>>> what has been done in log_write_hva() in vhost.c). The only way to do
>>>>>> that is to maintain an independent HVA to GPA mapping like what KVM
>>>> or
>>>>>> vhost did.
>>>>>>
>>>>> why GPA1 and GPA2 should be both dirty?
>>>>> even they have the same HVA due to overlaping virtual address space in
>>>>> two processes, they still correspond to two physical pages.
>>>>> don't get what's your meaning :)
>>>> The point is not leave any corner case that is hard to debug or fix in
>>>> the future.
>>>>
>>>> Let's just start by a single process, the API allows userspace to maps
>>>> HVA to both GPA1 and GPA2. Since it knows GPA1 and GPA2 are equivalent,
>>>> it's ok to sync just through GPA1. That means if you only log GPA2, it
>>>> won't work.
>>>>
>>> I noted KVM itself doesn't consider such situation (one HVA is mapped
>>> to multiple GPAs), when doing its dirty page tracking. If you look at
>>> kvm_vcpu_mark_page_dirty, it simply finds the unique memslot which
>>> contains the dirty gfn and then set the dirty bit within that slot. It
>>> doesn't attempt to walk all memslots to find out any other GPA which
>>> may be mapped to the same HVA.
>>>
>>> So there must be some disconnect here. let's hear from Paolo first and
>>> understand the rationale behind such situation.
>>
>> Neither did vhost when IOTLB is disabled. And cc Michael who points out this
>> issue at the beginning.
>>
>> Thanks
>>
>>
>>> Thanks
>>> Kevin
> Yes, we fixed with a kind of a work around, at the time I proposed
> a new interace to fix it fully. I don't think we ever got around
> to implementing it - right?


Paolo said userspace just need to sync through all GPAs, so my 
understanding is that work around is ok by redundant, so did the API you 
proposed. Anything I miss?

Thanks



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] vhost, iova, and dirty page tracking
  2019-09-19 22:54                           ` Tian, Kevin
@ 2019-09-20  1:18                             ` Jason Wang
  2019-09-24  2:02                               ` Tian, Kevin
  0 siblings, 1 reply; 40+ messages in thread
From: Jason Wang @ 2019-09-20  1:18 UTC (permalink / raw)
  To: Tian, Kevin, Paolo Bonzini, Zhao, Yan Y
  Cc: 'Alex Williamson', qemu-devel, mst


On 2019/9/20 上午6:54, Tian, Kevin wrote:
>> From: Paolo Bonzini [mailto:pbonzini@redhat.com]
>> Sent: Thursday, September 19, 2019 7:14 PM
>>
>> On 19/09/19 09:16, Tian, Kevin wrote:
>>>>> why GPA1 and GPA2 should be both dirty?
>>>>> even they have the same HVA due to overlaping virtual address space
>> in
>>>>> two processes, they still correspond to two physical pages.
>>>>> don't get what's your meaning :)
>>>> The point is not leave any corner case that is hard to debug or fix in
>>>> the future.
>>>>
>>>> Let's just start by a single process, the API allows userspace to maps
>>>> HVA to both GPA1 and GPA2. Since it knows GPA1 and GPA2 are
>> equivalent,
>>>> it's ok to sync just through GPA1. That means if you only log GPA2, it
>>>> won't work.
>>> I noted KVM itself doesn't consider such situation (one HVA is mapped
>>> to multiple GPAs), when doing its dirty page tracking. If you look at
>>> kvm_vcpu_mark_page_dirty, it simply finds the unique memslot which
>>> contains the dirty gfn and then set the dirty bit within that slot. It
>>> doesn't attempt to walk all memslots to find out any other GPA which
>>> may be mapped to the same HVA.
>>>
>>> So there must be some disconnect here. let's hear from Paolo first and
>>> understand the rationale behind such situation.
>> In general, userspace cannot assume that it's okay to sync just through
>> GPA1.  It must sync the host page if *either* GPA1 or GPA2 are marked
>> dirty.
> Agree. In this case the kernel only needs to track whether GPA1 or
> GPA2 is dirtied by guest operations.


Not necessarily guest operations.


>   The reason why vhost has to
> set both GPA1 and GPA2 is due to its own design - it maintains
> IOVA->HVA and GPA->HVA mappings thus given a IOVA you have
> to reverse lookup GPA->HVA memTable which gives multiple possible
> GPAs.


So if userspace need to track both GPA1 and GPA2, vhost can just stop 
when it found a one HVA->GPA mapping there.


>   But in concept if vhost can maintain a IOVA->GPA mapping,
> then it is straightforward to set the right GPA every time when a IOVA
> is tracked.


That means, the translation is done twice by software, IOVA->GPA and 
GPA->HVA for each packet.

Thanks


>
>> The situation really only arises in special cases.  For example,
>> 0xfffe0000..0xffffffff and 0xe0000..0xfffff might be the same memory.
>>  From "info mtree" before the guest boots:
>>
>>      0000000000000000-ffffffffffffffff (prio -1, i/o): pci
>>        00000000000e0000-00000000000fffff (prio 1, i/o): alias isa-bios
>> @pc.bios 0000000000020000-000000000003ffff
>>        00000000fffc0000-00000000ffffffff (prio 0, rom): pc.bios
>>
>> However, non-x86 machines may have other cases of aliased memory so
>> it's
>> a case that you should cover.
>>
> Above example is read-only, thus won't be touched in logdirty path.
> But now I agree that a specific architecture may define two
> writable GPA ranges with one as the alias to the other, as long as
> such case is explicitly documented so guest OS won't treat them as
> separate memory pages.
>
> Thanks
> Kevin


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] vhost, iova, and dirty page tracking
  2019-09-20  1:15                             ` Jason Wang
@ 2019-09-20 10:02                               ` Michael S. Tsirkin
  0 siblings, 0 replies; 40+ messages in thread
From: Michael S. Tsirkin @ 2019-09-20 10:02 UTC (permalink / raw)
  To: Jason Wang
  Cc: Paolo Bonzini, Tian, Kevin, Zhao, Yan Y,
	'Alex Williamson',
	qemu-devel

On Fri, Sep 20, 2019 at 09:15:40AM +0800, Jason Wang wrote:
> 
> On 2019/9/19 下午10:06, Michael S. Tsirkin wrote:
> > On Thu, Sep 19, 2019 at 05:37:48PM +0800, Jason Wang wrote:
> > > On 2019/9/19 下午3:16, Tian, Kevin wrote:
> > > > +Paolo to help clarify here.
> > > > 
> > > > > From: Jason Wang [mailto:jasowang@redhat.com]
> > > > > Sent: Thursday, September 19, 2019 2:32 PM
> > > > > 
> > > > > 
> > > > > On 2019/9/19 下午2:17, Yan Zhao wrote:
> > > > > > On Thu, Sep 19, 2019 at 02:09:53PM +0800, Jason Wang wrote:
> > > > > > > On 2019/9/19 下午1:28, Yan Zhao wrote:
> > > > > > > > On Thu, Sep 19, 2019 at 09:05:12AM +0800, Jason Wang wrote:
> > > > > > > > > On 2019/9/18 下午4:37, Tian, Kevin wrote:
> > > > > > > > > > > From: Jason Wang [mailto:jasowang@redhat.com]
> > > > > > > > > > > Sent: Wednesday, September 18, 2019 2:10 PM
> > > > > > > > > > > 
> > > > > > > > > > > > > Note that the HVA to GPA mapping is not an 1:1 mapping. One
> > > > > HVA
> > > > > > > > > > > range
> > > > > > > > > > > > > could be mapped to several GPA ranges.
> > > > > > > > > > > > This is fine. Currently vfio_dma maintains IOVA->HVA mapping.
> > > > > > > > > > > > 
> > > > > > > > > > > > btw under what condition HVA->GPA is not 1:1 mapping? I didn't
> > > > > realize it.
> > > > > > > > > > > I don't remember the details e.g memory region alias? And neither
> > > > > kvm
> > > > > > > > > > > nor kvm API does forbid this if my memory is correct.
> > > > > > > > > > > 
> > > > > > > > > > I checked https://qemu.weilnetz.de/doc/devel/memory.html, which
> > > > > > > > > > provides an example of aliased layout. However, its aliasing is all
> > > > > > > > > > 1:1, instead of N:1. From guest p.o.v every writable GPA implies an
> > > > > > > > > > unique location. Why would we hit the situation where multiple
> > > > > > > > > > write-able GPAs are mapped to the same HVA (i.e. same physical
> > > > > > > > > > memory location)?
> > > > > > > > > I don't know, just want to say current API does not forbid this. So we
> > > > > > > > > probably need to take care it.
> > > > > > > > > 
> > > > > > > > yes, in KVM API level, it does not forbid two slots to have the same
> > > > > HVA(slot->userspace_addr).
> > > > > > > > But
> > > > > > > > (1) there's only one kvm instance for each vm for each qemu process.
> > > > > > > > (2) all ramblock->host (corresponds to HVA and slot->userspace_addr)
> > > > > in one qemu
> > > > > > > > process is non-overlapping as it's obtained from mmmap().
> > > > > > > > (3) qemu ensures two kvm slots will not point to the same section of
> > > > > one ramblock.
> > > > > > > > So, as long as kvm instance is not shared in two processes, and
> > > > > > > > there's no bug in qemu, we can assure that HVA to GPA is 1:1.
> > > > > > > Well, you leave this API for userspace, so you can't assume qemu is the
> > > > > > > only user or any its behavior. If you had you should limit it in the API
> > > > > > > level instead of open window for them.
> > > > > > > 
> > > > > > > 
> > > > > > > > But even if there are two processes operating on the same kvm
> > > > > instance
> > > > > > > > and manipulating on memory slots, adding an extra GPA along side
> > > > > current
> > > > > > > > IOVA & HVA to ioctl VFIO_IOMMU_MAP_DMA can still let driver knows
> > > > > the
> > > > > > > > right IOVA->GPA mapping, right?
> > > > > > > It looks fragile. Consider HVA was mapped to both GPA1 and GPA2.
> > > > > Guest
> > > > > > > maps IOVA to GPA2, so we have IOVA GPA2 HVA in the new ioctl and
> > > > > then
> > > > > > > log through GPA2. If userspace is trying to sync through GPA1, it will
> > > > > > > miss the dirty page. So for safety we need log both GPA1 and GPA2. (See
> > > > > > > what has been done in log_write_hva() in vhost.c). The only way to do
> > > > > > > that is to maintain an independent HVA to GPA mapping like what KVM
> > > > > or
> > > > > > > vhost did.
> > > > > > > 
> > > > > > why GPA1 and GPA2 should be both dirty?
> > > > > > even they have the same HVA due to overlaping virtual address space in
> > > > > > two processes, they still correspond to two physical pages.
> > > > > > don't get what's your meaning :)
> > > > > The point is not leave any corner case that is hard to debug or fix in
> > > > > the future.
> > > > > 
> > > > > Let's just start by a single process, the API allows userspace to maps
> > > > > HVA to both GPA1 and GPA2. Since it knows GPA1 and GPA2 are equivalent,
> > > > > it's ok to sync just through GPA1. That means if you only log GPA2, it
> > > > > won't work.
> > > > > 
> > > > I noted KVM itself doesn't consider such situation (one HVA is mapped
> > > > to multiple GPAs), when doing its dirty page tracking. If you look at
> > > > kvm_vcpu_mark_page_dirty, it simply finds the unique memslot which
> > > > contains the dirty gfn and then set the dirty bit within that slot. It
> > > > doesn't attempt to walk all memslots to find out any other GPA which
> > > > may be mapped to the same HVA.
> > > > 
> > > > So there must be some disconnect here. let's hear from Paolo first and
> > > > understand the rationale behind such situation.
> > > 
> > > Neither did vhost when IOTLB is disabled. And cc Michael who points out this
> > > issue at the beginning.
> > > 
> > > Thanks
> > > 
> > > 
> > > > Thanks
> > > > Kevin
> > Yes, we fixed with a kind of a work around, at the time I proposed
> > a new interace to fix it fully. I don't think we ever got around
> > to implementing it - right?
> 
> 
> Paolo said userspace just need to sync through all GPAs, so my understanding
> is that work around is ok by redundant, so did the API you proposed.
> Anything I miss?
> 
> Thanks

I just feel an extra lookup is awkward. We don't benchmark
the speed during migration right now but it's something
we might care about down the road.

HTH

-- 
MST


^ permalink raw reply	[flat|nested] 40+ messages in thread

* RE: [Qemu-devel] vhost, iova, and dirty page tracking
  2019-09-20  1:18                             ` Jason Wang
@ 2019-09-24  2:02                               ` Tian, Kevin
  2019-09-25  3:46                                 ` Jason Wang
  0 siblings, 1 reply; 40+ messages in thread
From: Tian, Kevin @ 2019-09-24  2:02 UTC (permalink / raw)
  To: Jason Wang, Paolo Bonzini, Zhao, Yan Y
  Cc: Adalbert Lazar, 'Alex Williamson', tamas, qemu-devel, mst

> From: Jason Wang [mailto:jasowang@redhat.com]
> Sent: Friday, September 20, 2019 9:19 AM
> 
> On 2019/9/20 上午6:54, Tian, Kevin wrote:
> >> From: Paolo Bonzini [mailto:pbonzini@redhat.com]
> >> Sent: Thursday, September 19, 2019 7:14 PM
> >>
> >> On 19/09/19 09:16, Tian, Kevin wrote:
> >>>>> why GPA1 and GPA2 should be both dirty?
> >>>>> even they have the same HVA due to overlaping virtual address
> space
> >> in
> >>>>> two processes, they still correspond to two physical pages.
> >>>>> don't get what's your meaning :)
> >>>> The point is not leave any corner case that is hard to debug or fix in
> >>>> the future.
> >>>>
> >>>> Let's just start by a single process, the API allows userspace to maps
> >>>> HVA to both GPA1 and GPA2. Since it knows GPA1 and GPA2 are
> >> equivalent,
> >>>> it's ok to sync just through GPA1. That means if you only log GPA2, it
> >>>> won't work.
> >>> I noted KVM itself doesn't consider such situation (one HVA is mapped
> >>> to multiple GPAs), when doing its dirty page tracking. If you look at
> >>> kvm_vcpu_mark_page_dirty, it simply finds the unique memslot which
> >>> contains the dirty gfn and then set the dirty bit within that slot. It
> >>> doesn't attempt to walk all memslots to find out any other GPA which
> >>> may be mapped to the same HVA.
> >>>
> >>> So there must be some disconnect here. let's hear from Paolo first and
> >>> understand the rationale behind such situation.
> >> In general, userspace cannot assume that it's okay to sync just through
> >> GPA1.  It must sync the host page if *either* GPA1 or GPA2 are marked
> >> dirty.
> > Agree. In this case the kernel only needs to track whether GPA1 or
> > GPA2 is dirtied by guest operations.
> 
> 
> Not necessarily guest operations.
> 
> 
> >   The reason why vhost has to
> > set both GPA1 and GPA2 is due to its own design - it maintains
> > IOVA->HVA and GPA->HVA mappings thus given a IOVA you have
> > to reverse lookup GPA->HVA memTable which gives multiple possible
> > GPAs.
> 
> 
> So if userspace need to track both GPA1 and GPA2, vhost can just stop
> when it found a one HVA->GPA mapping there.
> 
> 
> >   But in concept if vhost can maintain a IOVA->GPA mapping,
> > then it is straightforward to set the right GPA every time when a IOVA
> > is tracked.
> 
> 
> That means, the translation is done twice by software, IOVA->GPA and
> GPA->HVA for each packet.
> 
> Thanks
> 

yes, it's not necessary if we care about only the content of the dirty GPA,
as seen in live migration. In that case, just setting the first GPA in the loop
is sufficient as you pointed out. However there is one corner case which I'm
not sure. What about an usage (e.g. VM introspection) which cares only 
about the guest access pattern i.e. which GPA is dirtied instead of poking
its content? Neither setting the first GPA nor setting all the aliasing GPAs
can provide the accurate info, if no explicit IOVA->GPA mapping is maintained
inside vhost. But I cannot tell whether maintaining such accuracy for aliasing
GPAs is really necessary. +VM introspection guys if they have some opinions.

Thanks
Kevin


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [Qemu-devel] vhost, iova, and dirty page tracking
  2019-09-24  2:02                               ` Tian, Kevin
@ 2019-09-25  3:46                                 ` Jason Wang
  0 siblings, 0 replies; 40+ messages in thread
From: Jason Wang @ 2019-09-25  3:46 UTC (permalink / raw)
  To: Tian, Kevin, Paolo Bonzini, Zhao, Yan Y
  Cc: Adalbert Lazar, 'Alex Williamson', tamas, qemu-devel, mst


On 2019/9/24 上午10:02, Tian, Kevin wrote:
>> From: Jason Wang [mailto:jasowang@redhat.com]
>> Sent: Friday, September 20, 2019 9:19 AM
>>
>> On 2019/9/20 上午6:54, Tian, Kevin wrote:
>>>> From: Paolo Bonzini [mailto:pbonzini@redhat.com]
>>>> Sent: Thursday, September 19, 2019 7:14 PM
>>>>
>>>> On 19/09/19 09:16, Tian, Kevin wrote:
>>>>>>> why GPA1 and GPA2 should be both dirty?
>>>>>>> even they have the same HVA due to overlaping virtual address
>> space
>>>> in
>>>>>>> two processes, they still correspond to two physical pages.
>>>>>>> don't get what's your meaning :)
>>>>>> The point is not leave any corner case that is hard to debug or fix in
>>>>>> the future.
>>>>>>
>>>>>> Let's just start by a single process, the API allows userspace to maps
>>>>>> HVA to both GPA1 and GPA2. Since it knows GPA1 and GPA2 are
>>>> equivalent,
>>>>>> it's ok to sync just through GPA1. That means if you only log GPA2, it
>>>>>> won't work.
>>>>> I noted KVM itself doesn't consider such situation (one HVA is mapped
>>>>> to multiple GPAs), when doing its dirty page tracking. If you look at
>>>>> kvm_vcpu_mark_page_dirty, it simply finds the unique memslot which
>>>>> contains the dirty gfn and then set the dirty bit within that slot. It
>>>>> doesn't attempt to walk all memslots to find out any other GPA which
>>>>> may be mapped to the same HVA.
>>>>>
>>>>> So there must be some disconnect here. let's hear from Paolo first and
>>>>> understand the rationale behind such situation.
>>>> In general, userspace cannot assume that it's okay to sync just through
>>>> GPA1.  It must sync the host page if *either* GPA1 or GPA2 are marked
>>>> dirty.
>>> Agree. In this case the kernel only needs to track whether GPA1 or
>>> GPA2 is dirtied by guest operations.
>>
>> Not necessarily guest operations.
>>
>>
>>>    The reason why vhost has to
>>> set both GPA1 and GPA2 is due to its own design - it maintains
>>> IOVA->HVA and GPA->HVA mappings thus given a IOVA you have
>>> to reverse lookup GPA->HVA memTable which gives multiple possible
>>> GPAs.
>>
>> So if userspace need to track both GPA1 and GPA2, vhost can just stop
>> when it found a one HVA->GPA mapping there.
>>
>>
>>>    But in concept if vhost can maintain a IOVA->GPA mapping,
>>> then it is straightforward to set the right GPA every time when a IOVA
>>> is tracked.
>>
>> That means, the translation is done twice by software, IOVA->GPA and
>> GPA->HVA for each packet.
>>
>> Thanks
>>
> yes, it's not necessary if we care about only the content of the dirty GPA,
> as seen in live migration. In that case, just setting the first GPA in the loop
> is sufficient as you pointed out. However there is one corner case which I'm
> not sure. What about an usage (e.g. VM introspection) which cares only
> about the guest access pattern i.e. which GPA is dirtied instead of poking
> its content? Neither setting the first GPA nor setting all the aliasing GPAs
> can provide the accurate info, if no explicit IOVA->GPA mapping is maintained
> inside vhost. But I cannot tell whether maintaining such accuracy for aliasing
> GPAs is really necessary. +VM introspection guys if they have some opinions.


Interesting, for vhost, vIOMMU can pass IOVA->GPA actually and vhost can 
keep it and just do the translation from GPA->HVA in the map command. So 
it can have both IOVA->GPA and IOVA->HVA mapping.

Thanks


>
> Thanks
> Kevin
>


^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2019-09-25  3:47 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-16  1:51 [Qemu-devel] vhost, iova, and dirty page tracking Tian, Kevin
2019-09-16  8:33 ` Jason Wang
2019-09-17  8:48   ` Tian, Kevin
2019-09-17 10:36     ` Jason Wang
2019-09-18  1:44       ` Tian, Kevin
2019-09-18  6:10         ` Jason Wang
2019-09-18  7:41           ` Tian, Kevin
2019-09-18  8:37           ` Tian, Kevin
2019-09-19  1:05             ` Jason Wang
2019-09-19  5:28               ` Yan Zhao
2019-09-19  6:09                 ` Jason Wang
2019-09-19  6:17                   ` Yan Zhao
2019-09-19  6:32                     ` Jason Wang
2019-09-19  6:29                       ` Yan Zhao
2019-09-19  6:32                         ` Yan Zhao
2019-09-19  9:35                           ` Jason Wang
2019-09-19  9:36                             ` Yan Zhao
2019-09-19 10:08                               ` Jason Wang
2019-09-19 10:06                         ` Jason Wang
2019-09-19 10:16                           ` Yan Zhao
2019-09-19 12:14                             ` Jason Wang
2019-09-19  7:16                       ` Tian, Kevin
2019-09-19  9:37                         ` Jason Wang
2019-09-19 14:06                           ` Michael S. Tsirkin
2019-09-20  1:15                             ` Jason Wang
2019-09-20 10:02                               ` Michael S. Tsirkin
2019-09-19 11:14                         ` Paolo Bonzini
2019-09-19 12:39                           ` Jason Wang
2019-09-19 12:45                             ` Paolo Bonzini
2019-09-19 22:54                           ` Tian, Kevin
2019-09-20  1:18                             ` Jason Wang
2019-09-24  2:02                               ` Tian, Kevin
2019-09-25  3:46                                 ` Jason Wang
2019-09-17 14:54     ` Alex Williamson
2019-09-18  1:31       ` Tian, Kevin
2019-09-18  6:03         ` Jason Wang
2019-09-18  7:21           ` Tian, Kevin
2019-09-19 17:20             ` Alex Williamson
2019-09-19 22:40               ` Tian, Kevin
     [not found]       ` <AADFC41AFE54684AB9EE6CBC0274A5D19D57AFB7@SHSMSX104.ccr.corp.intel.com>
2019-09-18  2:15         ` Tian, Kevin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.