Re: [PATCH RFC v2] vfio: Documentation for the migration region

From: Jason Gunthorpe <jgg@nvidia.com>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>,
	linux-doc@vger.kernel.org, Cornelia Huck <cohuck@redhat.com>,
	kvm@vger.kernel.org, Kirti Wankhede <kwankhede@nvidia.com>,
	Max Gurtovoy <mgurtovoy@nvidia.com>,
	Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>,
	Yishai Hadas <yishaih@nvidia.com>
Subject: Re: [PATCH RFC v2] vfio: Documentation for the migration region
Date: Tue, 30 Nov 2021 23:14:07 -0400	[thread overview]
Message-ID: <20211201031407.GG4670@nvidia.com> (raw)
In-Reply-To: <20211130153541.131c9729.alex.williamson@redhat.com>

On Tue, Nov 30, 2021 at 03:35:41PM -0700, Alex Williamson wrote:

> > From what HNS said the device driver would have to trap every MMIO to
> > implement NDMA as it must prevent touches to the physical HW MMIO to
> > maintain the NDMA state.
> > 
> > The issue is that the HW migration registers can stop processing the
> > queue and thus enter NDMA but a MMIO touch can resume queue
> > processing, so NDMA cannot be sustained.
> > 
> > Trapping every MMIO would have a huge negative performance impact.  So
> > it doesn't make sense to do so for a device that is not intended to be
> > used in any situation where NDMA is required.
> 
> But migration is a cooperative activity with userspace.  If necessary
> we can impose a requirement that mmap access to regions (other than the
> migration region itself) are dropped when we're in the NDMA or !RUNNING
> device_state.  

It is always NDMA|RUNNING, so we can't fully drop access to
MMIO. Userspace would have to transfer from direct MMIO to
trapping. With enough new kernel infrastructure and qemu support it
could be done.

Even so, we can't trap accesses through the IOMMU so such a scheme
would still require removing IOMMU acess to the device. Given that the
basic qemu mitigation for no NDMA support is to eliminate P2P cases by
removing the IOMMU mappings this doesn't seem to advance anything and
only creates complexity.

At least I'm not going to insist that hns do all kinds of work like
this for a edge case they don't care about as a precondition to get a
migration driver.

> There's no reason that mediation while in the NDMA state needs to
> impose any performance penalty against the default RUNNING state. 

Eh? Mitigation of no NDMA support would have to mediate the MMIO on a
a performance doorbell path, there is no escaping a performance
hit. I'm not sure what you mean

> > > Some discussion of this requirement would be useful in the doc,
> > > otherwise it seems easier to deprecate the v1 migration region
> > > sub-type, and increment to a v2 where NDMA is a required feature.  
> > 
> > I can add some words a the bottom, but since NDMA is a completely
> > transparent optional feature I don't see any reason to have v2.
> 
> It's hardly transparent, aiui userspace is going to need to impose a
> variety of loosely defined restrictions for devices without NDMA
> support.  It would be far easier if we could declare NDMA support to be
> a requirement.

It would make userspace a bit simpler at the cost of excluding or
complicating devices like hns for a use case they don't care about.

On the other hand, the simple solution in qemu is when there is no
universal NDMA it simply doesn't include any MMIO ranges in the
IOMMU.

> As I think Connie also had trouble with, combining device_state with
> IOMMU migration features and VMM state, without any preceding context
> and visual cues makes the section confusing.  I did gain context as I
> read further though the doc, but I also had the advantage of being
> rather familiar with the topic.  Maybe a table format would help to
> segment the responsibilities?

I moved the context to the bottom exactly because Connie said it was
confusing at the start. :)

This is a RST document so I not keen to make huge formatting
adventures for minimal readability gain.

I view this as something that probably needs to be read a few times,
along with the code and header files for someone brand new to
understand. I'm Ok with that, it is about consistent with kernel docs
of this level.

What I would like is if userspace focused readers can get their
important bits of information with less work.

> > It is exsisting behavior of qemu - which is why we documented it.
> 
> QEMU resets devices as part of initializing the VM, but I don't see
> that QEMU specifically resets a device in order to transition it to
> the RESUMING device_state. 

We instrumented the kernel and monitored qemu, it showed up on the
resume traces.

> > Either qemu shouldn't do it as devices must fully self-reset, or we
> > should have it part of the canonical flow and devices may as well
> > expect it. It is useful because post VFIO_DEVICE_RESET all DMA is
> > quiet, no outstanding PRIs exist, etc etc.
> 
> It's valid for QEMU to reset the device any time it wants, saying that
> it cannot perform a reset before transitioning to the RESUMING state is
> absurd.  Userspace can do redundant things for their own convenience.

I didn't say cannot, I said it shouldn't.

Since qemu is the only implementation it would be easy for drivers to
rely on the implicit reset it seems to do, it seems an important point
that should be written either way.

I don't have a particular requirement to have the reset, but it does
seem like a good idea. If you feel strongly, then let's say the
opposite that the driver must enter RESUME with no preconditions,
doing an internal reset if required.

> We don't currently specify any precondition for a device to enter the
> RESUMING state.  The driver can of course nak the state change with an
> errno, or hard nak it with an errno and ERROR device_state, which would
> require userspace to make use of VFIO_DEVICE_RESET.

I don't think we should be relying on every driver doing something
totally differnt on the standard path. That is only going to hurt
interoperability.

> > > As with the previous flows, it seems like there's a ton of implicit
> > > knowledge here.  Why are we documenting these here rather than in the
> > > uAPI header?  
> > 
> > Because this is 300 lines already and is too complicated/long to
> > properly live in a uapi header.
> 
> Minimally we need to resolve that this document must be consistent with
> the uAPI.  I'm not sure that's entirely the case in this draft.

Can you point to something please? I can't work with "I'm not sure"

IMO the header file doesn't really say much and can be read in a way
that is consistent with this more specific document.

> >  - qemu doesn't support P2P cases due to the NDMA topic
> 
> Or rather QEMU does support p2p cases regardless of the NDMA topic.

I mean support in a way that is actually usable as without NDMA it
corrupts the VM when it migrates it.

> >  - simple devices like HNS will work, but not robustly in the face of
> >    a hostile VM and multiple VFIO devices.
> 
> So what's the goal here, are we trying to make the one currently
> implemented and unsupported userspace be the gold standard to which
> drivers should base their implementation?  

I have no idea anymore. You asked for docs and complete picture as a
percondition for merging a driver. Here it is.

What do you want?

> We've tried to define a specification that's more flexible than a
> single implementation and by these standards we seem to be flipping
> that implementation back into the specification.

What specification!?! All we have is a couple lines in a header file
that is no where near detailed enough for multi-driver
interoperability with userspace. You have no idea how much effort has
been expended to get this far based on the few breadcrumbs that were
left, and we have access to the team that made the only other
implementation!

*flexible* is not a specification.

> Userspace can attempt RESUMING -> RUNNING regardless of what we specify,
> so a driver needs to be prepared for such an attempted state change
> either way.  So what's the advantage to telling a driver author that
> they can expect a given behavior?

The above didn't tell a driver author to expect a certain behavior, it
tells userspace what to do.

> It doesn't make much sense to me to glue two separate userspace
> operations together to say these must be done in this sequence, back to
> back.  If we want the device to be reset in order to enter RESUMING, the
> driver should simply reset the device as necessary during the state
> transition.  The outward effect to the user is to specify that device
> internal state may not be retained on transition from RUNNING ->
> RESUMING.

Maybe, and I'm happy if you want to specify this instead. It just
doesn't match what we observe qemu to be doing.

> > Do you have an alternative language? This is quite complicated, I
> > advise people to refer to mlx5's implementation.
> 
> I agree with Connie on this, if the reader of the documentation needs
> to look at a specific driver implementation to understand the
> reasoning, the documentation has failed.  

Lets agree on some objective here, this is not trying to be fully
comprehensive, or fully standalone. It is intended to drive agreement,
be informative to userspace, and be supplemental to the actual code.

> If it can be worked out by looking at the device_state write
> function of the mlx5 driver, then surely a sentence or two for each
> priority item can be added here.

Please give me a suggestion then, because I don't know what will help
here?

> Part of the problem is that the nomenclature is unclear, we're listing
> bit combinations, but not the changed bit(s) and we need to infer the
> state.

Each line lists the new state, the changed bits are thus any bits that
make up the new state.

If you look at how mlx5 is constructed each if has a 'did it change'
test followed by 'what state is it in now'

So the document is read as listing the order the driver enters the new
states. I clarified it as ""must process the new device_state bits in
a priority order""

> flips in the presence of an existing state.  I'm not able to obviously
> map the listing above to the latest posted version of the mlx5 driver.

One of the things we've done is align mlx5 more clearly to this. For
instance it no longer has a mixture of state and old state in the if
statements, it always tests the new state so the tests logically
follow what is written here

Stripping away the excess the expressions now look like this:

 !(state & VFIO_DEVICE_STATE_RUNNING)
 ((state & (VFIO_DEVICE_STATE_RUNNING | VFIO_DEVICE_STATE_SAVING)) == VFIO_DEVICE_STATE_SAVING))
 (state & VFIO_DEVICE_STATE_RESUMING)

Which mirror what is written here.

> > > > +  As Peer to Peer DMA is a MMIO touch like any other, it is important that
> > > > +  userspace suspend these accesses before entering any device_state where MMIO
> > > > +  is not permitted, such as !RUNNING. This can be accomplished with the NDMA
> > > > +  state. Userspace may also choose to remove MMIO mappings from the IOMMU if the
> > > > +  device does not support NDMA and rely on that to guarantee quiet MMIO.  
> > > 
> > > Seems that would have its own set of consequences.  
> > 
> > Sure, userspace has to make choices here.
> 
> It seems a bit loaded to suggest an alternative choice if it's not
> practical or equivalent.  Maybe it's largely the phrasing, I read
> "remove MMIO mappings" as to drop them dynamically, when I think we've
> discussed that userspace might actually preclude these mappings for
> non-NDMA devices such that p2p DMA cannot exist, ever.

I mean the latter. How about "never install MMIO mappings" ?

> > Overall it must work in this basic way, and devices have freedom about
> > what internal state they can/will log. There is just a clear division
> > that every internal state in the first step is either immutable or
> > logged, and that the second step is a delta over the first.
> 
> I agree that it's a reasonable approach, though as I read the proposed
> text, there's no mention of immutable state and no reason a driver
> would implement a dirty log for immutable state, therefore we seem to
> be suggesting such data for the stop-and-copy phase when it would
> actually be preferable to include it in pre-copy.

I'd say that is a detail we don't need to discuss/define, it has no
user space visible consequence.

> I think the fact that a user is not required to run the pre-copy
> phase until completion is also noteworthy.

This text doesn't try to detail how the migration window works, that
is a different large task. The intention is that the migration window
must be fully drained to be successful.

I added this for some clarity ""The entire migration data, up to each
end of stream must be transported from the saving to resuming side.""

> > Yishai has a patch already to add NDMA to mlx5, it will come in the
> > next iteration once we can agree on this document. qemu will follow
> > sometime later.
> 
> So it's not really a TBD, it's resolved in a uAPI update that will be
> included with the next revision?  Thanks,

There is a patch yes, the TBD here is to include a few words about how
to detect NDMA

Jason