Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver

From: Joao Martins <joao.m.martins@oracle.com>
To: Jason Gunthorpe <jgg@nvidia.com>
Cc: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-crypto@vger.kernel.org" <linux-crypto@vger.kernel.org>,
	"alex.williamson@redhat.com" <alex.williamson@redhat.com>,
	"mgurtovoy@nvidia.com" <mgurtovoy@nvidia.com>,
	Linuxarm <linuxarm@huawei.com>,
	liulongfang <liulongfang@huawei.com>,
	"Zengtao (B)" <prime.zeng@hisilicon.com>,
	yuzenghui <yuzenghui@huawei.com>,
	Jonathan Cameron <jonathan.cameron@huawei.com>,
	"Wangzhou (B)" <wangzhou1@hisilicon.com>
Subject: Re: [RFC v2 0/4] vfio/hisilicon: add acc live migration driver
Date: Fri, 11 Feb 2022 17:28:22 +0000	[thread overview]
Message-ID: <0cd1d33a-39b1-9104-a4f8-ff2338699a41@oracle.com> (raw)
In-Reply-To: <20220204230750.GR1786498@nvidia.com>

On 2/4/22 23:07, Jason Gunthorpe wrote:
> On Fri, Feb 04, 2022 at 07:53:12PM +0000, Joao Martins wrote:
>> On 2/3/22 15:18, Jason Gunthorpe wrote:
>>> On Wed, Feb 02, 2022 at 07:05:02PM +0000, Joao Martins wrote:
>>>> On 2/2/22 17:03, Jason Gunthorpe wrote:
>>>>> how to integrate that with the iommufd work, which I hope will allow
>>>>> that series, and the other IOMMU drivers that can support this to be
>>>>> merged..
>>>>
>>>> The iommu-fd thread wasn't particularly obvious on how dirty tracking is done
>>>> there, but TBH I am not up to speed on iommu-fd yet so I missed something
>>>> obvious for sure. When you say 'integrate that with the iommufd' can you
>>>> expand on that?
>>>
>>> The general idea is that iommufd is the place to put all the iommu
>>> driver uAPI for consumption by userspace. The IOMMU feature of dirty
>>> tracking would belong there.
>>>
>>> So, some kind of API needs to be designed to meet the needs of the
>>> IOMMU drivers.
>>>
>> /me nods
>>
>> I am gonna assume below is the most up-to-date to iommufd (as you pointed
>> out in another thread IIRC):
>>
>>   https://github.com/jgunthorpe/linux iommufd
>>
>> Let me know if it's not :)
> 
> The iommufd part is pretty good, but there is hacky patch to hook it
> into vfio that isn't there, if you want to actually try it.
> 
OK -- Thanks for the heads up.

>>> But, as you say, it looks unnatural and inefficient when the domain
>>> itself is storing the dirty bits inside the IOPTE.
>>
>> How much is this already represented as the io-pgtable in IOMMU internal kAPI
>> (if we exclude the UAPI portion of iommufd for now) ? FWIW, that is today
>> used by the AMD IOMMU and ARM IOMMUs. Albeit, not Intel :(
> 
> Which are you looking at? AFACIT there is no diry page support in
> iommu_ops ?
> 
There is not, indeed. But how you manage IO page tables felt *somewhat
hinting* to io-pgtable -- but maybe it's a stretch. The dirty
page support builds on top of that, as ARM IOMMUs use iommu io-pgtable.

In the series in the link, it pretty much boils down to
essentially 3 iommu_ops introduced:

* support_dirty_log(domain)
  - Checks whether a given IOMMU domain supports dirty tracking
* switch_dirty_log(domain, enable, iova, size, prot)
  - Enables/Disables IOMMU dirty tracking on an IOVA range
  (The range I suppose it's because ARM supports range tracking, contrary
   to Intel and AMD which affect all the protection domain pg tables)
* sync_dirty_log(domain, iova, size, &bitmap, ...)

The IO pgtables, have an interesting set of APIs strictly for page table
manipulation, and they are sort of optional (so far ARM and AMD). Those
include, map, unmap, iova_to_pfn translation. So, the dirty log supports
for ARM builds on that and adds split, merge (for splitting and collapsing
IO pagtables), and finally sync_dirty_log() which is where we get the
bitmap of dirtied pages *updated*.

But well, at the end of the day for an IOMMU driver the domain ops are
the important stuff, maybe IO pgtable framework isn't as critical
(Intel for example, doesn't use that at all).

>> then potentially VMM/process can more efficiently scan the dirtied
>> set? But if some layer needs to somehow mediate between the vendor
>> IOPTE representation and an UAPI IOPTE representation, to be able to
>> make that delegation to userspace ... then maybe both might be
>> inefficient?  I didn't see how iommu-fd would abstract the IOPTEs
>> lookup as far as I glanced through the code, perhaps that's another
>> ioctl().
> 
> It is based around the same model as VFIO container - map/unmap of
> user address space into the IOPTEs and the user space doesn't see
> anything resembling a 'pte' - at least for kernel owned IO page
> tables.
> 
Ah! I misinterpreted what you were saying.

> User space page tables will not be abstracted and the userspace must
> know the direct HW format of the IOMMU they is being used.
> 
That's countering earlier sentence? Because hw format (for me at least)
means PTE and protection domain config format too. And if iommufd
abstracts the HW format, modelling after the IOMMU domain and its ops,
then it's abstracting userspace from those details e.g. it works over
IOVAs, but not over its vendor representation of how that IOVA is set up.

I am probably being dense.

>> But what strikes /specifically/ on the dirty bit feature is that it looks
>> simpler with the current VFIO, the heavy lifting seems to be
>> mostly on the IOMMU vendor. The proposed API above for VFIO looking at
>> the container (small changes), and IOMMU vendor would do most of it:
> 
> It is basically the same, almost certainly the user API in iommufd
> will be some 'get dirty bits' and 'unmap and give me the dirty bits'
> just like vfio has.
> 

The 'unmap and give dirty bits' looks to be something TBD even in a VFIO
migration flow. Considering that we pause vCPUs, and we quiesce the
VFs, I guess we are mostly supposed to have a guarantee that no DMA is
supposed to be happening (excluding P2P)? So perhaps the unmap is
unneeded after quiescing the VF. Which probably is important
migration-wise considering the unmap is really what's expensive (e.g.
needs IOTLB flush) and that would add up between stop copy and the time it
would resume on the destination, if we need to wait to unmap the whole IOVA
from source.

> The tricky details are around how do you manage this when the system
> may have multiple things invovled capable, or not, of actualy doing
> dirty tracking.
> 
Yeap.

>> At the same time, what particularly scares me perf-wise (for the
>> device being migrated) ... is the fact that we need to dynamically
>> split and collapse page tables to increase the granularity of which
>> we track. In the above interface it splits/collapses when you turn
>> on/off the dirty tracking (respectively). That's *probably* where we
>> need more flexibility, not sure.
> 
> For sure that is a particularly big adventure in the iommu driver..
> 
Yeah.

Me playing around was with VFIO IOMMU hugepages disabled, and this
was something I am planning on working when I get back to this.

>> Do you have thoughts on what such device-dirty interface could look like?
>> (Perhaps too early to poke while the FSM/UAPI is being worked out)
> 
> I've been thinking the same general read-and-clear of a dirty
> bitmap. It matches nicely the the KVM interface.
> 

A bitmap I think it is a good start.

We have a bitmap based interface in KVM, but there's also a recent ring
interface for dirty tracking, which has some probably more determinism than
a big bitmap. And if we look at hardware, AMD needs to scan NPT pagetables
and breaking its entries on-demand IIRC, whereas Intel resembles something
closer to a 512 entries 'ring' with VMX PML, which tells what has been
dirtied.

>> I was wondering if container has a dirty scan/sync callback funnelled
>> by a vendor IOMMU ops implemented (as Shameerali patches proposed), 
> 
> Yes, this is almost certainly how the in-kernel parts will look
> 
/me nods

>> and vfio vendor driver provides one per device. 
> 
> But this is less clear..
> 
>> Or propagate the dirty tracking API to vendor vfio driver[*]. 
>> [*] considering the device may choose where to place its tracking storage, and
>> which scheme (bitmap, ring, etc) it might be.
> 
> This has been my thinking, yes
> 
/me nods

>> The reporting of the dirtying, though, looks hazzy to achieve if you
>> try to make it uniform even to userspace. Perhaps with iommu-fd
>> you're thinking to mmap() the dirty region back to userspace, or an
>> iommu-fd ioctl() updates the PTEs, while letting the kernel clear
>> the dirty status via the mmap() object. And that would be the common
>> API regardless of dirty-hw scheme. Anyway, just thinking out loud.
> 
> My general thinking has be that iommufd would control only the system
> IOMMU hardware. The FD interface directly exposes the iommu_domain as
> a manipulable object, so I'd imagine making userspace have a simple
> 1:1 connection to the iommu_ops of a single iommu_domain.
> 
> Doing this avoids all the weirdo questions about what do you do if
> there is non-uniformity in the iommu_domain's.
> 
> Keeping with that theme the vfio_device would provide a similar
> interface, on its own device FD.
> 
> I don't know if mmap should be involed here, the dirty bitmaps are not
> so big, I suspect a simple get_user_pages_fast() would be entirely OK.
> 
Considering that is 32MB of a bitmap per TB maybe it is cheap.

>>> VFIO proposed to squash everything
>>> into the container code, but I've been mulling about having iommufd
>>> only do system iommu and push the PCI device internal tracking over to
>>> VFIO.
>>>
>>
>> Seems to me that the juicy part falls mostly in IOMMU vendor code, I am
>> not sure yet how much one can we 'offload' to a generic layer, at least
>> compared with this other proposal.
> 
> Yes, I expect there is very little generic code here if we go this
> way. The generic layer is just marshalling the ioctl(s) to the iommu
> drivers. Certainly not providing storage or anything/
> 
Gotcha. Perhaps I need to sort out how this would work with iommufd.

But I guess we can hash out how the iommu ops would look like?

We seem to be on the same page, at least that's the feeling I get.

>> Give me some time (few days only, as I gotta sort some things) and I'll
>> respond here as follow up with link to a branch with the WIP/PoC patches.
> 
> Great!
>
Here it is. "A few days" turn into a week sorry :/

https://github.com/jpemartins/qemu  amd-iommu-hdsup-wip
https://github.com/jpemartins/linux  amd-vfio-iommu-hdsup-wip

Note, it is an early PoC. I still need to get the split/collapse thing going,
and fix the FIXMEs there, and have a second good look at the iommu page tables.

The Qemu link above has one patch to get device-dirty bitmaps via HMP to be able to
measure the rate by hw dirties and another patch for emulated amd iommu support too,
should you lack the hardware to play (I usually do on both, more for people to be
able to coverage test it). I sadly cannot do the end-to-end migration test.

>> 3) Dirty bit is sticky, hardware never clears it. Reading the access/dirty
>> bit is cheap, clearing them is 'expensive' because one needs to flush
>> IOTLB as the IOMMU hardware may cache the bits in the IOTLB as a result
>> of an address-translation/io-page-walk. Even though the IOMMU uses interlocked
>> operations to actually update the Access/Dirty bit in concurrency with
>> the CPU. The AMD manuals are a tad misleading as they talk about marking
>> non-present, but that would be catastrophic for migration as it would
>> mean a DMA target abort for the PCI device, unless I missed something obvious.
>> In any case, this means that the dirty bit *clearing* needs to be
>> batched as much as possible, to amortize the cost of flushing the IOTLB.
>> This is the same for Intel *IIUC*.
> 
> You have to mark it as non-present to do the final read out if
> something unmaps while the tracker is on - eg emulating a viommu or
> something. Then you mark non-present, flush the iotlb and read back
> the dirty bit.
> 
You would be surprised that AMD IOMMUs have also an accelerated vIOMMU
too :) without needing VMM intervention (that's also not supported
in Linux).

But the mark as NP for viommu is something I haven't investigated.

> Otherwise AFIAK, you flush the IOTLB to get the latest dirty bits and
> then read and clear them.
> 
It's the other way around AIUI. The dirty bits are sticky, so you flush
the IOTLB after clearing as means to notify the IOMMU to set the dirty bits
again on the next memory transaction (or ATS translation).

I am not entirely sure we need to unmap + mark non-present for non-viommu
That would actually mean something is not properly quiscieing the VF DMA.
Maybe we should .. to gate whether if we should actually continue with LM
if something kept doing DMA when it shouldn't have.

Also, considering the dirty-bitmap check happens on very-very long IOVA extents --
some measurements over how long the scans need to be done. Mostly
validated how much a much NIC was dirtying with traditional networking tools.

>> 4) Adjust the granularity of pagetables in place:
>> [This item wasn't done, but it is generic to any IOMMU because it
>> is mostly the ability to split existing IO pages in place.]
> 
> This seems like it would be some interesting amount of driver work,
> but yes it could be a generic new iommu_domina op.
> 
I am slightly at odds that .split and .collapse at .switch() are enough.
But, with iommu if we are working on top of an IOMMU domain object and
.split and .collapse are iommu_ops perhaps that looks to be enough
flexibility to give userspace the ability to decide what it wants to
split, if it starts eargerly/warming-up tracking dirty pages.

The split and collapsing is something I wanted to work on next, to get
to a stage closer to that of an RFC on the AMD side.

>> 4.b) Optionally starting dirtying earlier (at provisioning) and let
>> userspace dynamically split pages. This is to hopefully minimize the
>> IOTLB miss we induce ourselves in item 4.a) if we were to do eagerly.
>> So dirty tracking would be enabled at creation of the protection domain
>> after the vfio container is set up, and we would use pages dirtied
>> as a indication of what needs to be splited. Problem is for IO page
>> sizes bigger than 1G, which might unnecessarily lead to marking too
>> much as dirty early on; but at least it's better than transferring the
>> whole set.
> 
> I'm not sure running with dirty tracking permanently on would be good
> for guest performance either.
> 
Hmmm, judging how the IOMMU works I am not sure this is particularly
affecting DMA performance (not sure yet about RDMA, it's something I
curious to see how it gets to perform with 4K IOPTEs, and with dirty
tracking always enabled). Considering how the bits are sticky, and
unless CPU clears it, it's short of a nop? Unless of course the checking
for A^D during an atomic memory transaction is expensive. Needs some
performance testing nonetheless.

The IOTLB flushes pre/after clearing, tough, would visibly hurt DMA
performance :(

I forgot to mention, but the early enablement of IOMMU dirty tracking
was also meant to fully know since guest creation what needs to be
sent to the destination. Otherwise, wouldn't we need to send the whole
pinned set to destination, if we only start tracking dirty pages during
migration?

> I'd suspect you'd be better to have a warm up period where you track
> dirtys and split down pages.
> 
Yeah, but right now the splitting is done when switching on and off
dirty tracking, for the entirety of page tables.

And well at least on AMD, the flag is attached to a protection domain.
For SMMUv3. it appears that it's per IOVA range if I read the original code
correctly (which is interesting contrast to AMD/Intel)

Also, this is probably a differentiator for iommufd, if we were to provide
split and collapse semantics to IOMMU domain objects that userspace can use.
That would get more freedom, to switch dirty-tracking, and then do the warm
up thingie and piggy back on what it wants to split before migration.
perhaps the switch() should get some flag to pick where to split, I guess.

> It is interesting, this is a possible reason why device dirty tracking
> might actually perfom better because it can operate at a different
> granularity from the system iommu without disrupting the guest DMA
> performance.

I agree -- and the other advantage is that you don't depend on platform
support for dirty tracking, which is largely nonexistent.

Albeit I personally like to have the IOMMU support because it also
frees endpoints to do rather expensive dirty tracking, which is a rather
even more exclusive feature to have. It's trading off max performance for
LM commodity over, where DMA performance can be less (momentarily).

	Joao