Re: [PATCH V7 mlx5-next 09/15] vfio: Extend the device migration protocol with RUNNING_P2P

From: Jason Gunthorpe <jgg@nvidia.com>
To: "Tian, Kevin" <kevin.tian@intel.com>
Cc: Yishai Hadas <yishaih@nvidia.com>,
	"alex.williamson@redhat.com" <alex.williamson@redhat.com>,
	"bhelgaas@google.com" <bhelgaas@google.com>,
	"saeedm@nvidia.com" <saeedm@nvidia.com>,
	"linux-pci@vger.kernel.org" <linux-pci@vger.kernel.org>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	"netdev@vger.kernel.org" <netdev@vger.kernel.org>,
	"kuba@kernel.org" <kuba@kernel.org>,
	"leonro@nvidia.com" <leonro@nvidia.com>,
	"kwankhede@nvidia.com" <kwankhede@nvidia.com>,
	"mgurtovoy@nvidia.com" <mgurtovoy@nvidia.com>,
	"maorg@nvidia.com" <maorg@nvidia.com>,
	"Raj, Ashok" <ashok.raj@intel.com>,
	"shameerali.kolothum.thodi@huawei.com" 
	<shameerali.kolothum.thodi@huawei.com>
Subject: Re: [PATCH V7 mlx5-next 09/15] vfio: Extend the device migration protocol with RUNNING_P2P
Date: Tue, 15 Feb 2022 11:56:02 -0400	[thread overview]
Message-ID: <20220215155602.GB1046125@nvidia.com> (raw)
In-Reply-To: <BN9PR11MB5276D169554630B8345DB7598C349@BN9PR11MB5276.namprd11.prod.outlook.com>

On Tue, Feb 15, 2022 at 10:18:11AM +0000, Tian, Kevin wrote:
> > From: Yishai Hadas <yishaih@nvidia.com>
> > Sent: Tuesday, February 8, 2022 1:22 AM
> > 
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > 
> > The RUNNING_P2P state is designed to support multiple devices in the same
> > VM that are doing P2P transactions between themselves. When in
> > RUNNING_P2P
> > the device must be able to accept incoming P2P transactions but should not
> > generate outgoing transactions.
> 
> outgoing 'P2P' transactions.

Yes

> > As an optional extension to the mandatory states it is defined as
> > inbetween STOP and RUNNING:
> >    STOP -> RUNNING_P2P -> RUNNING -> RUNNING_P2P -> STOP
> > 
> > For drivers that are unable to support RUNNING_P2P the core code silently
> > merges RUNNING_P2P and RUNNING together. Drivers that support this will
> 
> It would be clearer if following message could be also reflected here:
> 
>   + * The optional states cannot be used with SET_STATE if the device does not
>   + * support them. The user can discover if these states are supported by using
>   + * VFIO_DEVICE_FEATURE_MIGRATION. 
> 
> Otherwise the original context reads like RUNNING_P2P can be used as
> end state even if the underlying driver doesn't support it then makes me
> wonder what is the point of the new capability bit.

You've read it right. Lets just add a simple "Unless driver support is
present the new state cannot be used in SET_STATE"

> >  	*next_fsm = vfio_from_fsm_table[cur_fsm][new_fsm];
> > +	while ((state_flags_table[*next_fsm] & device->migration_flags) !=
> > +			state_flags_table[*next_fsm])
> > +		*next_fsm = vfio_from_fsm_table[*next_fsm][new_fsm];
> > +
> 
> A comment highlighting the silent merging of unsupported states would
> be informative here.

	/*
	 * Arcs touching optional and unsupported states are skipped over. The
	 * driver will instead  see an arc from the original state to the next
	 * logical state, as per the above comment.
	 */

> Defining RUNNING_P2P in above way implies that RUNNING_P2P inherits 
> all behaviors in RUNNING except blocking outbound P2P:
> 	* generate interrupts and DMAs
> 	* respond to MMIO
> 	* all vfio regions are functional
> 	* device may advance its internal state
> 	* drain and block outstanding P2P requests

Correct.

The device must be able to recieve and process any MMIO P2P
transaction during this state.

We discussed and left interrupts as allowed behavior.

> I think this is not the intended behavior when NDMA was being discussed
> in previous threads, as above definition suggests the user could continue
> to submit new requests after outstanding P2P requests are completed given
> all vfio regions are functional when the device is in RUNNING_P2P.

It is the desired behavior. The device must internally stop generating
DMA from new work, it cannot rely on external things not poking it
with MMIO, because the whole point of the state is that MMIO P2P is
still allowed to happen.

What gets confusing is that in normal cases I wouldn't expect any P2P
activity to trigger a new work submission.

Probably, since many devices can't implement this, we will end up with
devices providing a weaker version where they do RUNNING_P2P but this
relies on the VM operating the device "sanely" without programming P2P
work submission. It is similar to your notion that migration requires
guest co-operation in the vPRI case.

I don't like it, and better devices really should avoid requiring
guest co-operation, but it seems like where things are going.

> Though just a naming thing, possibly what we really require is a STOPPING_P2P
> state which indicates the device is moving to the STOP (or STOPPED)
> state.

No, I've deliberately avoided STOP because this isn't anything like
STOP. It is RUNNING with one restriction.

> In this state the device is functional but vfio regions are not so the user still
> needs to restrict device access:

The device is not functional in STOP. STOP means the device does not
provide working MMIO. Ie mlx5 devices will discard all writes and
read all 0's when in STOP.

The point of RUNNING_P2P is to allow the device to continue to recieve
all MMIO while halting generation of MMIO to other devices.

> In virtualization this means Qemu must stop vCPU first before entering
> STOPPING_P2P for a device.

This is already the case. RUNNING/STOP here does not refer to the
vCPU, it refers to this device.

> Back to your earlier suggestion on reusing RUNNING_P2P to cover vPRI 
> usage via a new capability bit [1]:
> 
>     "A cap like "running_p2p returns an event fd, doesn't finish until the
>     VCPU does stuff, and stops pri as well as p2p" might be all that is
>     required here (and not an actual new state)"
> 
> vPRI requires a RUNNING semantics. A new capability bit can change 
> the behaviors listed above for STOPPING_P2P to below:
> 	* both P2P and vPRI requests should be drained and blocked;
> 	* all vfio regions are functional (with a RUNNING behavior) so
> 	  vCPUs can continue running to help drain vPRI requests;
> 	* an eventfd is returned for the user to poll-wait the completion
> 	  of state transition;

vPRI draining is not STOP either. If the device is expected to provide
working MMIO it is not STOP by definition.

> One additional requirement in driver side is to dynamically mediate the 
> fast path and queue any new request which may trigger vPRI or P2P
> before moving out of RUNNING_P2P. If moving to STOP_COPY, then
> queued requests will also be included as device state to be replayed
> in the resuming path.

This could make sense. I don't know how you dynamically mediate
though, or how you will trap ENQCMD..

> Does above sound a reasonable understanding of this FSM mechanism? 

Other than mis-using the STOP label, it is close yes.

> > + * The optional states cannot be used with SET_STATE if the device does not
> > + * support them. The user can disocver if these states are supported by
> 
> 'disocver' -> 'discover'

Yep, thanks

Jason