kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* iommufd dirty page logging overview
@ 2022-03-16 23:29 Thanos Makatos
  2022-03-16 23:50 ` Jason Gunthorpe
  2022-03-17 12:39 ` Joao Martins
  0 siblings, 2 replies; 12+ messages in thread
From: Thanos Makatos @ 2022-03-16 23:29 UTC (permalink / raw)
  To: kvm
  Cc: Joao Martins, John Levon, john.g.johnson, alex.williamson,
	Stefan Hajnoczi, Jason Gunthorpe, kevin.tian, Eric Auger,
	David Gibson, yi.l.liu

We're interested in adopting the new migration v2 interface and the new dirty page logging for /dev/iommufd in an out-of-process device emulation protocol [1]. Although it's purely userspace, we do want to stay close to the new API(s) being proposed for many reasons, mainly to re-use the QEMU implementation. The migration-related changes are relatively straightforward, I'm more interested in the dirty page logging. I've started reading the relevant email threads and my impression so far is that the details are still being decided? I don't see any commits related to dirty page logging in Yi's repo (https://github.com/luxis1999/iommufd) (at least not in the commit messages). I see that Joao has done some work using the existing dirty bitmaps (https://github.com/jpemartins/linux/commits/iommufd). Is there a rough idea of how the new dirty page logging will look like? Is this already explained in the email threads an I missed it?

[1] https://lore.kernel.org/all/a9b696ca38ee2329e371c28bcaa2921cac2a48a2.1641584316.git.john.g.johnson@oracle.com/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: iommufd dirty page logging overview
  2022-03-16 23:29 iommufd dirty page logging overview Thanos Makatos
@ 2022-03-16 23:50 ` Jason Gunthorpe
  2022-03-18  9:23   ` Tian, Kevin
  2022-03-17 12:39 ` Joao Martins
  1 sibling, 1 reply; 12+ messages in thread
From: Jason Gunthorpe @ 2022-03-16 23:50 UTC (permalink / raw)
  To: Thanos Makatos
  Cc: kvm, Joao Martins, John Levon, john.g.johnson, alex.williamson,
	Stefan Hajnoczi, kevin.tian, Eric Auger, David Gibson, yi.l.liu

On Wed, Mar 16, 2022 at 11:29:42PM +0000, Thanos Makatos wrote:

> We're interested in adopting the new migration v2 interface and the
> new dirty page logging for /dev/iommufd in an out-of-process device
> emulation protocol [1]. Although it's purely userspace, we do want
> to stay close to the new API(s) being proposed for many reasons,
> mainly to re-use the QEMU implementation. The migration-related
> changes are relatively straightforward, I'm more interested in the
> dirty page logging. I've started reading the relevant email threads
> and my impression so far is that the details are still being
> decided? 

Yes

Joao has made the most progress so far

> there a rough idea of how the new dirty page logging will look like?
> Is this already explained in the email threads an I missed it?

I'm hoping to get something to show in the next few weeks, but what
I've talked about previously is to have two things:

1) Control and reporting of dirty tracking via the system IOMMU
   through the iommu_domain interface exposed by iommufd

2) Control and reporting of dirty tracking via a VFIO migration
   capable device's internal tracking through a VFIO_DEVICE_FEATURE
   interface similar to the v2 migration interface

The two APIs would be semantically very similar but target different
HW blocks. Userspace would be in charge to decide which dirty tracker
to use and how to configure it.

I'm expecting that semantically everything would look like the current
dirty tracking draft, where you start/stop tracking then read&clear a
bitmap for a range of IOVA and integrate that with the dirty data
from KVM.

Jason


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: iommufd dirty page logging overview
  2022-03-16 23:29 iommufd dirty page logging overview Thanos Makatos
  2022-03-16 23:50 ` Jason Gunthorpe
@ 2022-03-17 12:39 ` Joao Martins
  1 sibling, 0 replies; 12+ messages in thread
From: Joao Martins @ 2022-03-17 12:39 UTC (permalink / raw)
  To: Thanos Makatos
  Cc: John Levon, john.g.johnson, alex.williamson, Stefan Hajnoczi,
	Jason Gunthorpe, kevin.tian, Eric Auger, David Gibson, yi.l.liu,
	kvm

On 3/16/22 23:29, Thanos Makatos wrote:
> We're interested in adopting the new migration v2 interface and the new dirty
> page logging for /dev/iommufd in an out-of-process device emulation protocol
> [1]. Although it's purely userspace, we do want to stay close to the new API(s)
> being proposed for many reasons, mainly to re-use the QEMU implementation. The
> migration-related changes are relatively straightforward, I'm more interested
> in the dirty page logging. I've started reading the relevant email threads and
> my impression so far is that the details are still being decided? I don't see
> any commits related to dirty page logging in Yi's repo
> (https://github.com/luxis1999/iommufd) (at least not in the commit messages). I
> see that Joao has done some work using the existing dirty bitmaps
> (https://github.com/jpemartins/linux/commits/iommufd).

This branch here so far covers current vfio compatibility of iommufd (with an
AMD IOMMU implementation) solely because it was the easiest to start with given
the existing userspace (qemu). There's also a qemu counterpart with the emulated
AMD IOMMU implementation (should you lack the hardware). I'll be updating those branches
as things evolve i.e. once I have an initial version of the iommufd native API and more
IOMMUs support. Now, whether the vfio-compat part remains is TBD.

TBH much how we are discussing on the list -- and that Jason iterated yesterday --
I too don't expect a divergence on the API semantics from current VFIO system IOMMU
tracking. userspace-facing dirty reporting eventually gets a bitmap, with a bit
representing a <page-size>, the bitmap represents a range with a "base" iova and a <size>
(subset of a previously DMA-mapped range) that *matches* the size of the bitmap. And then
you start/stop tracking, read the dirty data and lastly DMA unmaps also fetch the dirty
data (if requested). The device-dirty tracking (via PCI) ought to model the target PF/VF
vendor interface but that is not iommufd.

The new things iommu-wise I expect are more about what the current API doesn't capture,
though, these are somewhat unrelated to the tracking control / reporting itself and more
about the IO page tables mappings i.e. changing domain's <page-size> in-place/dynamically
to increase the granularity of the dirty tracking.

[*] interestingly, arm64 SMMUv3.x seems to have an idea of 'stalling' transactions (not
sure if all kinds are supported) and letting CPU retry them as if the endpoint had just
requested ... without depending on endpoint PRI support.

> Is there a rough idea of
> how the new dirty page logging will look like? Is this already explained in the
> email threads an I missed it?
>
Granted that you came across the repo, I suppose you went through all the threads :)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: iommufd dirty page logging overview
  2022-03-16 23:50 ` Jason Gunthorpe
@ 2022-03-18  9:23   ` Tian, Kevin
  2022-03-18 12:41     ` Jason Gunthorpe
  0 siblings, 1 reply; 12+ messages in thread
From: Tian, Kevin @ 2022-03-18  9:23 UTC (permalink / raw)
  To: Jason Gunthorpe, Thanos Makatos
  Cc: kvm, Martins, Joao, John Levon, john.g.johnson, alex.williamson,
	Stefan Hajnoczi, Eric Auger, David Gibson, Liu, Yi L

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, March 17, 2022 7:51 AM
> 
> > there a rough idea of how the new dirty page logging will look like?
> > Is this already explained in the email threads an I missed it?
> 
> I'm hoping to get something to show in the next few weeks, but what
> I've talked about previously is to have two things:
> 
> 1) Control and reporting of dirty tracking via the system IOMMU
>    through the iommu_domain interface exposed by iommufd
> 
> 2) Control and reporting of dirty tracking via a VFIO migration
>    capable device's internal tracking through a VFIO_DEVICE_FEATURE
>    interface similar to the v2 migration interface
> 
> The two APIs would be semantically very similar but target different
> HW blocks. Userspace would be in charge to decide which dirty tracker
> to use and how to configure it.
> 

for the 2nd option I suppose userspace is expected to retrieve
dirty bits via VFIO_DEVICE_FEATURE before every iommufd 
unmap operation in precopy phase, just like why we need return
the dirty bitmap to userspace in iommufd unmap interface in
the 1st option. Correct?

Is there any value of having iommufd pull dirty bitmap from
vfio driver then the userspace can just stick to a unified
iommufd interface for dirty pages no matter they are tracked
by system IOMMU or device IP? Sorry if this has been discussed
in previous threads which I haven't fully checked.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: iommufd dirty page logging overview
  2022-03-18  9:23   ` Tian, Kevin
@ 2022-03-18 12:41     ` Jason Gunthorpe
  2022-03-18 15:06       ` Alex Williamson
                         ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Jason Gunthorpe @ 2022-03-18 12:41 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Thanos Makatos, kvm, Martins, Joao, John Levon, john.g.johnson,
	alex.williamson, Stefan Hajnoczi, Eric Auger, David Gibson, Liu,
	Yi L

On Fri, Mar 18, 2022 at 09:23:49AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Thursday, March 17, 2022 7:51 AM
> > 
> > > there a rough idea of how the new dirty page logging will look like?
> > > Is this already explained in the email threads an I missed it?
> > 
> > I'm hoping to get something to show in the next few weeks, but what
> > I've talked about previously is to have two things:
> > 
> > 1) Control and reporting of dirty tracking via the system IOMMU
> >    through the iommu_domain interface exposed by iommufd
> > 
> > 2) Control and reporting of dirty tracking via a VFIO migration
> >    capable device's internal tracking through a VFIO_DEVICE_FEATURE
> >    interface similar to the v2 migration interface
> > 
> > The two APIs would be semantically very similar but target different
> > HW blocks. Userspace would be in charge to decide which dirty tracker
> > to use and how to configure it.
> > 
> 
> for the 2nd option I suppose userspace is expected to retrieve
> dirty bits via VFIO_DEVICE_FEATURE before every iommufd 
> unmap operation in precopy phase, just like why we need return
> the dirty bitmap to userspace in iommufd unmap interface in
> the 1st option. Correct?

It would have to be after unmap, not before

> Is there any value of having iommufd pull dirty bitmap from
> vfio driver then the userspace can just stick to a unified
> iommufd interface for dirty pages no matter they are tracked
> by system IOMMU or device IP? Sorry if this has been discussed
> in previous threads which I haven't fully checked.

It is something to discuss, this is sort of what the current vfio
interface imagines

But to do it we need to build a whole bunch of infrastructure to
register and control these things and add new ioctls to vfio to
support this. I'm not sure we get a sufficient benifit to be
worthwhile, infact it is probably a net loss as we loose the ability
for userspace to pull the dirty bits from multiple device trackers in
parallel with threading.

Jason

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: iommufd dirty page logging overview
  2022-03-18 12:41     ` Jason Gunthorpe
@ 2022-03-18 15:06       ` Alex Williamson
  2022-03-18 15:55         ` Jason Gunthorpe
  2022-03-19  7:54       ` Tian, Kevin
                         ` (2 subsequent siblings)
  3 siblings, 1 reply; 12+ messages in thread
From: Alex Williamson @ 2022-03-18 15:06 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Thanos Makatos, kvm, Martins, Joao, John Levon,
	john.g.johnson, Stefan Hajnoczi, Eric Auger, David Gibson, Liu,
	Yi L

On Fri, 18 Mar 2022 09:41:08 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Fri, Mar 18, 2022 at 09:23:49AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Thursday, March 17, 2022 7:51 AM
> > >   
> > > > there a rough idea of how the new dirty page logging will look like?
> > > > Is this already explained in the email threads an I missed it?  
> > > 
> > > I'm hoping to get something to show in the next few weeks, but what
> > > I've talked about previously is to have two things:
> > > 
> > > 1) Control and reporting of dirty tracking via the system IOMMU
> > >    through the iommu_domain interface exposed by iommufd
> > > 
> > > 2) Control and reporting of dirty tracking via a VFIO migration
> > >    capable device's internal tracking through a VFIO_DEVICE_FEATURE
> > >    interface similar to the v2 migration interface
> > > 
> > > The two APIs would be semantically very similar but target different
> > > HW blocks. Userspace would be in charge to decide which dirty tracker
> > > to use and how to configure it.
> > >   
> > 
> > for the 2nd option I suppose userspace is expected to retrieve
> > dirty bits via VFIO_DEVICE_FEATURE before every iommufd 
> > unmap operation in precopy phase, just like why we need return
> > the dirty bitmap to userspace in iommufd unmap interface in
> > the 1st option. Correct?  
> 
> It would have to be after unmap, not before
> 
> > Is there any value of having iommufd pull dirty bitmap from
> > vfio driver then the userspace can just stick to a unified
> > iommufd interface for dirty pages no matter they are tracked
> > by system IOMMU or device IP? Sorry if this has been discussed
> > in previous threads which I haven't fully checked.  
> 
> It is something to discuss, this is sort of what the current vfio
> interface imagines

Yes.

> But to do it we need to build a whole bunch of infrastructure to
> register and control these things and add new ioctls to vfio to
> support this. I'm not sure we get a sufficient benifit to be
> worthwhile, infact it is probably a net loss as we loose the ability
> for userspace to pull the dirty bits from multiple device trackers in
> parallel with threading.

It seems like the new ioctls and such would be to configure the 2nd
option, the current assumption is the iommu collects all dirty bits.

There are advantages to each, the 2nd option gives the user more
visibility, more options to thread, but it also possibly duplicates
significant data.

The unmap scenario above is also not quite as cohesive if the user
needs to poll devices for dirty pages in the unmapped range after
performing the unmap.  It might make sense if the iommufd could
generate the merged bitmap on unmap as the threading optimization
probably has less value in that case.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: iommufd dirty page logging overview
  2022-03-18 15:06       ` Alex Williamson
@ 2022-03-18 15:55         ` Jason Gunthorpe
  0 siblings, 0 replies; 12+ messages in thread
From: Jason Gunthorpe @ 2022-03-18 15:55 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Tian, Kevin, Thanos Makatos, kvm, Martins, Joao, John Levon,
	john.g.johnson, Stefan Hajnoczi, Eric Auger, David Gibson, Liu,
	Yi L

On Fri, Mar 18, 2022 at 09:06:36AM -0600, Alex Williamson wrote:

> There are advantages to each, the 2nd option gives the user more
> visibility, more options to thread, but it also possibly duplicates
> significant data.

The coming mlx5 tracker won't require kernel storage at all, so I
think this is something to tackle if/when someone comes with a device
that uses the CPU to somehow track dirties (probably via a mdev that
is already tracking DMA?)

One thought is to let vfio coordinate a single allocation of a dirty
bitmap xarray among drivers.

Even in the worst case of duplicated bitmaps the memory usage is not
fatally terrible it is about 32MB per 1TB of guest memory.

> The unmap scenario above is also not quite as cohesive if the user
> needs to poll devices for dirty pages in the unmapped range after
> performing the unmap.  It might make sense if the iommufd could
> generate the merged bitmap on unmap as the threading optimization
> probably has less value in that case.

I don't think of it this way. The device tracker has no idea about
munmap/mmap, it just tracks IOVA dirties.

Which is a problem because any time we alter the IOVA to PFN map we
need to read the device dirties and correlate them back to the actual
CPU pages that were dirtied.

unmap is one case, but nested paging invalidation is another much
nastier problem. How exactly that can work is a bit of a mystery to me
as the ultimate IOVA to PFN mapping is rather fuzzy/racy from the view
of the hypervisor.

So, I wouldn't invest effort to make a special kernel API to link
unmap and leave invalidate unsolved. Just keeping them seperated seems
to make more sense, and userspace knows better what it is doing. Eg
vIOMMU cases need to synchronize the dirty, but other things like
memory unplug don't.

From another perspective, what we want is for the system iommu to be
as lean and fast as possible because in, say, 10 years it will be the
dominate way to do this job. This is another reason I'm reluctant to
co-mingle it with device trackers in a way that might limit it.

Though overall I think threading is the key argument. Given this is
time critical for stop copy, and all trackers can report fully in
parallel, we should strive to allow userspace to thread them.

Jason

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: iommufd dirty page logging overview
  2022-03-18 12:41     ` Jason Gunthorpe
  2022-03-18 15:06       ` Alex Williamson
@ 2022-03-19  7:54       ` Tian, Kevin
  2022-03-19  8:14       ` Tian, Kevin
  2022-03-20  3:34       ` Tian, Kevin
  3 siblings, 0 replies; 12+ messages in thread
From: Tian, Kevin @ 2022-03-19  7:54 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Thanos Makatos, kvm, Martins, Joao, John Levon, john.g.johnson,
	alex.williamson, Stefan Hajnoczi, Eric Auger, David Gibson, Liu,
	Yi L

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Friday, March 18, 2022 8:41 PM
> 
> On Fri, Mar 18, 2022 at 09:23:49AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Thursday, March 17, 2022 7:51 AM
> > >
> > > > there a rough idea of how the new dirty page logging will look like?
> > > > Is this already explained in the email threads an I missed it?
> > >
> > > I'm hoping to get something to show in the next few weeks, but what
> > > I've talked about previously is to have two things:
> > >
> > > 1) Control and reporting of dirty tracking via the system IOMMU
> > >    through the iommu_domain interface exposed by iommufd
> > >
> > > 2) Control and reporting of dirty tracking via a VFIO migration
> > >    capable device's internal tracking through a VFIO_DEVICE_FEATURE
> > >    interface similar to the v2 migration interface
> > >
> > > The two APIs would be semantically very similar but target different
> > > HW blocks. Userspace would be in charge to decide which dirty tracker
> > > to use and how to configure it.
> > >
> >
> > for the 2nd option I suppose userspace is expected to retrieve
> > dirty bits via VFIO_DEVICE_FEATURE before every iommufd
> > unmap operation in precopy phase, just like why we need return
> > the dirty bitmap to userspace in iommufd unmap interface in
> > the 1st option. Correct?
> 
> It would have to be after unmap, not before
> 

why? after unmap a dirty GPA page in the unmapped range is
meaningless to userspace since there is no backing PFN for that
GPA.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: iommufd dirty page logging overview
  2022-03-18 12:41     ` Jason Gunthorpe
  2022-03-18 15:06       ` Alex Williamson
  2022-03-19  7:54       ` Tian, Kevin
@ 2022-03-19  8:14       ` Tian, Kevin
  2022-03-20  3:34       ` Tian, Kevin
  3 siblings, 0 replies; 12+ messages in thread
From: Tian, Kevin @ 2022-03-19  8:14 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Thanos Makatos, kvm, Martins, Joao, John Levon, john.g.johnson,
	alex.williamson, Stefan Hajnoczi, Eric Auger, David Gibson, Liu,
	Yi L

> From: Tian, Kevin
> Sent: Saturday, March 19, 2022 3:55 PM
> 
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Friday, March 18, 2022 8:41 PM
> >
> > On Fri, Mar 18, 2022 at 09:23:49AM +0000, Tian, Kevin wrote:
> > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > Sent: Thursday, March 17, 2022 7:51 AM
> > > >
> > > > > there a rough idea of how the new dirty page logging will look like?
> > > > > Is this already explained in the email threads an I missed it?
> > > >
> > > > I'm hoping to get something to show in the next few weeks, but what
> > > > I've talked about previously is to have two things:
> > > >
> > > > 1) Control and reporting of dirty tracking via the system IOMMU
> > > >    through the iommu_domain interface exposed by iommufd
> > > >
> > > > 2) Control and reporting of dirty tracking via a VFIO migration
> > > >    capable device's internal tracking through a VFIO_DEVICE_FEATURE
> > > >    interface similar to the v2 migration interface
> > > >
> > > > The two APIs would be semantically very similar but target different
> > > > HW blocks. Userspace would be in charge to decide which dirty tracker
> > > > to use and how to configure it.
> > > >
> > >
> > > for the 2nd option I suppose userspace is expected to retrieve
> > > dirty bits via VFIO_DEVICE_FEATURE before every iommufd
> > > unmap operation in precopy phase, just like why we need return
> > > the dirty bitmap to userspace in iommufd unmap interface in
> > > the 1st option. Correct?
> >
> > It would have to be after unmap, not before
> >
> 
> why? after unmap a dirty GPA page in the unmapped range is
> meaningless to userspace since there is no backing PFN for that
> GPA.
> 

Let me make it more specific by taking vIOMMU as an example.
No nesting i.e. Qemu generates a merged mapping for GIOVA->HPA
via iommufd.

iommufd unmap is caused when emulating virtual iotlb invalidation
request, *after* the guest iommu driver clears the guest I/O page
table for the specified GIOVA range.

The dirty bits recorded by the device is around the dma addresses
programmed by the guest, i.e. GIOVA.

Now if qemu pulls dirty bits from vfio device after iommufd unmap,
how would qemu even know the corresponding PFN/VA for dirty
GFNs given the guest I/O mapping has been cleared?

This might not be a problem for dpdk when the mapping is managed
by the application itself thus that knowledge is not lost after iommufd
unmap. But concept-wise I feel pulling dirty bits before destroying
related mappings makes more sense as translating dirty bits to
underlying PFNs is kind of an usage of the mapping.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: iommufd dirty page logging overview
  2022-03-18 12:41     ` Jason Gunthorpe
                         ` (2 preceding siblings ...)
  2022-03-19  8:14       ` Tian, Kevin
@ 2022-03-20  3:34       ` Tian, Kevin
  2022-03-21 13:30         ` Jason Gunthorpe
  3 siblings, 1 reply; 12+ messages in thread
From: Tian, Kevin @ 2022-03-20  3:34 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Thanos Makatos, kvm, Martins, Joao, John Levon, john.g.johnson,
	alex.williamson, Stefan Hajnoczi, Eric Auger, David Gibson, Liu,
	Yi L

> From: Tian, Kevin
> Sent: Saturday, March 19, 2022 4:15 PM
> 
> 
> Let me make it more specific by taking vIOMMU as an example.
> No nesting i.e. Qemu generates a merged mapping for GIOVA->HPA
> via iommufd.
> 
> iommufd unmap is caused when emulating virtual iotlb invalidation
> request, *after* the guest iommu driver clears the guest I/O page
> table for the specified GIOVA range.
> 
> The dirty bits recorded by the device is around the dma addresses
> programmed by the guest, i.e. GIOVA.
> 
> Now if qemu pulls dirty bits from vfio device after iommufd unmap,
> how would qemu even know the corresponding PFN/VA for dirty
> GFNs given the guest I/O mapping has been cleared?
> 

Thinking more the real problem is not related to *before* vs. *after*
thing. :/ If Qemu itself doesn't maintain a virtual iotlb (large enough
to duplicate all valid mappings in guest I/O page table) and ensure
the cached mappings for the unmapped range is not zapped before
the dirty bitmap for that range is digested, the whole dirty tracking
is just broken in this scenario no matter which approach is used and
whether bitmap is retrieved before or after the iommufd unmap,
given guest mappings for dirtied GIOVAs in the unmapped range
already disappear at that point thus the path to find GIOVA->GPA->HVA
is just broken.

I roughly recalled a gap in Qemu viotlb was discussed when dirty
bitmap was added to vfio unmap. At that time Qemu's viotlb was
like a normal iotlb i.e. only caching mappings due to walking guest
page table for emulating DMA from non-vfio devices in Qemu. That
is definitely inadequate for aforementioned purpose.

But I don't know whether this gap has been fixed now.

there is no such concern with dpdk or VM w/o vIOMMU since the
iova address space is managed by host userspace which has intrinsic
knowledge about IOVA<->HVA even after iommufd unmap.

this is also fine with hardware nesting. The hardware ensures all
stage-1 activities converged on the dirty bits of stage-2 IOPTEs.
So the userspace can just ignore stage-1 and just collects dirty 
bitmap associated with stage-2.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: iommufd dirty page logging overview
  2022-03-20  3:34       ` Tian, Kevin
@ 2022-03-21 13:30         ` Jason Gunthorpe
  2022-03-22  2:40           ` Tian, Kevin
  0 siblings, 1 reply; 12+ messages in thread
From: Jason Gunthorpe @ 2022-03-21 13:30 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Thanos Makatos, kvm, Martins, Joao, John Levon, john.g.johnson,
	alex.williamson, Stefan Hajnoczi, Eric Auger, David Gibson, Liu,
	Yi L

On Sun, Mar 20, 2022 at 03:34:30AM +0000, Tian, Kevin wrote:

> Thinking more the real problem is not related to *before* vs. *after*
> thing. :/ If Qemu itself doesn't maintain a virtual iotlb

It has to be after because only unmap guarentees that DMA is
completely stopped.

qemu must ensure it doesn't change the user VA to GPA mapping between
unmap and device fetch dirty, or install something else into that
IOVA.

Yes the physical PFNs can be shuffled around by the kernel due to the
lost page pin, but the logical dirty is really attached to qemu's
process VA (HVA), not the physical PFN.

It has to do this in all cases regardless of device or not - when it
unmaps the IOVA it must know what HVA it put there and translate the
dirties to that bitmap.

> given guest mappings for dirtied GIOVAs in the unmapped range
> already disappear at that point thus the path to find GIOVA->GPA->HVA
> is just broken.

qemu has to keep track of how IOVAs translate to HVAs - maybe we could
have the kernel return the HVA during unmap as well, it already stores
it, but this has some complications..

Fundamentally from a qemu perspective it is translating everything to
UVA because UVA is what the live migration machinery uses.

But this is all qemu problems and doesn't really help inform the
kernel API..

Jason

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: iommufd dirty page logging overview
  2022-03-21 13:30         ` Jason Gunthorpe
@ 2022-03-22  2:40           ` Tian, Kevin
  0 siblings, 0 replies; 12+ messages in thread
From: Tian, Kevin @ 2022-03-22  2:40 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Thanos Makatos, kvm, Martins, Joao, John Levon, john.g.johnson,
	alex.williamson, Stefan Hajnoczi, Eric Auger, David Gibson, Liu,
	Yi L

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Monday, March 21, 2022 9:31 PM
> 
> On Sun, Mar 20, 2022 at 03:34:30AM +0000, Tian, Kevin wrote:
> 
> > Thinking more the real problem is not related to *before* vs. *after*
> > thing. :/ If Qemu itself doesn't maintain a virtual iotlb
> 
> It has to be after because only unmap guarentees that DMA is
> completely stopped.

In concept, yes.

In reality probably no sw-visible difference. A sane driver doing unmap is
expected to stop the source (i.e. the device) use of the unmapped buffer
first and then clear the iommu mapping. Once the former is completed 
the dirty bitmap of a given range won't change before and after the unmap.

In case of a driver bug which fails to stop the device use in the first place,
losing the dirty bits across the unmap doesn't sound a problem as the user
cannot expect a deterministic behavior in such scenario anyway.

But I didn't intend to advocate 'before' as there is no value of doing so
and 'after' is conceptually correct per your explanation.

> 
> qemu must ensure it doesn't change the user VA to GPA mapping between
> unmap and device fetch dirty, or install something else into that
> IOVA.
> 
> Yes the physical PFNs can be shuffled around by the kernel due to the
> lost page pin, but the logical dirty is really attached to qemu's
> process VA (HVA), not the physical PFN.
> 
> It has to do this in all cases regardless of device or not - when it
> unmaps the IOVA it must know what HVA it put there and translate the
> dirties to that bitmap.
> 
> > given guest mappings for dirtied GIOVAs in the unmapped range
> > already disappear at that point thus the path to find GIOVA->GPA->HVA
> > is just broken.
> 
> qemu has to keep track of how IOVAs translate to HVAs - maybe we could
> have the kernel return the HVA during unmap as well, it already stores
> it, but this has some complications..

Qemu has such information. The key, as you said, is that Qemu shouldn't
destroy that information before dirty bitmap is translated.

> 
> Fundamentally from a qemu perspective it is translating everything to
> UVA because UVA is what the live migration machinery uses.
> 
> But this is all qemu problems and doesn't really help inform the
> kernel API..
> 

Yes, and this is the merit of hw nesting and IOMMU dirty bits. Otherwise
Qemu has to pay the burden of maintaining a copy of guest page table
besides the shadow one maintained in the kernel.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2022-03-22  2:40 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-16 23:29 iommufd dirty page logging overview Thanos Makatos
2022-03-16 23:50 ` Jason Gunthorpe
2022-03-18  9:23   ` Tian, Kevin
2022-03-18 12:41     ` Jason Gunthorpe
2022-03-18 15:06       ` Alex Williamson
2022-03-18 15:55         ` Jason Gunthorpe
2022-03-19  7:54       ` Tian, Kevin
2022-03-19  8:14       ` Tian, Kevin
2022-03-20  3:34       ` Tian, Kevin
2022-03-21 13:30         ` Jason Gunthorpe
2022-03-22  2:40           ` Tian, Kevin
2022-03-17 12:39 ` Joao Martins

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).