RE: [RFC PATCH] vfio: Update/Clarify migration uAPI, add NDMA state

From: "Tian, Kevin" <kevin.tian@intel.com>
To: Jason Gunthorpe <jgg@nvidia.com>,
	Alex Williamson <alex.williamson@redhat.com>
Cc: "cohuck@redhat.com" <cohuck@redhat.com>,
	"corbet@lwn.net" <corbet@lwn.net>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	"linux-doc@vger.kernel.org" <linux-doc@vger.kernel.org>,
	"farman@linux.ibm.com" <farman@linux.ibm.com>,
	"mjrosato@linux.ibm.com" <mjrosato@linux.ibm.com>,
	"pasic@linux.ibm.com" <pasic@linux.ibm.com>
Subject: RE: [RFC PATCH] vfio: Update/Clarify migration uAPI, add NDMA state
Date: Mon, 10 Jan 2022 07:55:16 +0000	[thread overview]
Message-ID: <BN9PR11MB52767CB9E4C30143065549C98C509@BN9PR11MB5276.namprd11.prod.outlook.com> (raw)
In-Reply-To: <20220106212057.GM2328285@nvidia.com>

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Friday, January 7, 2022 5:21 AM
> 
> We were also thinking to retain STOP. SAVING -> STOP could possibly
> serve as the abort path to avoid a double action, and some of the use
> cases you ID'd below are achievable if STOP remains.

what is the exact difference between a null state {} (implying !RUNNING)
and STOP in this context?

If they are different, who (user or driver) should conduct and when do 
we expect transitioning the device into a null state?

> 
> > We have 20 possible transitions.  I've marked those available via the
> > "odd ascii art" diagram as (a), that's 7 transitions.  We could
> > essentially remove the NULL state as unreachable as there seems little
> > value in the 2 transitions marked (a)* if we look only at migration,
> > that would bring us down to 5 of 12 possible transitions.  We need to
> > give userspace an abort path though, so we minimally need the 2
> > transitions marked (b) (7/12).
> 
> > So now we can discuss the remaining 5 transitions:
> >
> > {SAVING} -> {RESUMING}
> > 	If not supported, user can achieve this via:
> > 		{SAVING}->{RUNNING}->{RESUMING}
> > 		{SAVING}-RESET->{RUNNING}->{RESUMING}
> 
> This can be:
> 
> SAVING -> STOP -> RESUMING

From Alex's original description the default device state is RUNNING.
This supposed to be the initial state on the dest machine for the
device assigned to Qemu before Qemu resumes the device state.
Then how do we eliminate the RUNNING state in above flow? Who
makes STOP as the initial state on the dest node?

> > drivers follow the previously provided pseudo algorithm with the
> > requirement that they cannot pass through an invalid state, we need to
> > formally address whether the NULL state is invalid or just not
> > reachable by the user.
> 
> What is a NULL state?

Hah, seems I'm not the only one having this confusion. 😊

> 
> We have defined (from memory, forgive me I don't have access to
> Yishai's latest code at the moment) 8 formal FSM states:
> 
>  RUNNING
>  PRECOPY
>  PRECOPY_NDMA
>  STOP_COPY
>  STOP
>  RESUMING
>  RESUMING_NDMA
>  ERROR (perhaps MUST_RESET)

ERROR->SHUTDOWN? Usually a shutdown state implies reset required...

> 
> > But I think you've identified two classes of DMA, MSI and everything
> > else.  The device can assume that an MSI is special and not included in
> > NDMA, but it can't know whether other arbitrary DMAs are p2p or memory.
> > If we define that the minimum requirement for multi-device migration is
> > to quiesce p2p DMA, ex. by not allowing it at all, then NDMA is
> > actually significantly more restrictive while it's enabled.
> 
> You are right, but in any practical physical device NDMA will be
> implemented by halting all DMAs, not just p2p ones.
> 
> I don't mind what we label this, so long as we understand that halting
> all DMA is a valid device implementation.
> 
> Actually, having reflected on this now, one of the things on my list
> to fix in iommufd, is that mdevs can get access to P2P pages at all.
> 
> This is currently buggy as-is because they cannot DMA map these
> things, touch them with the CPU and kmap, or do, really, anything with
> them.

Can you elaborate why mdev cannot access p2p pages?

> 
> So we should be blocking mdev's from accessing P2P, and in that case a
> mdev driver can quite rightly say it doesn't support P2P at all and
> safely NOP NDMA if NDMA is defined to only impact P2P transactions.
> 
> Perhaps that answers the question for the S390 drivers as well.
> 
> > Should a device in the ERROR state continue operation or be in a
> > quiesced state?  It seems obvious to me that since the ERROR state is
> > essentially undefined, the device should cease operations and be
> > quiesced by the driver.  If the device is continuing to operate in the
> > previous state, why would the driver place the device in the ERROR
> > state?  It should have returned an errno and left the device in the
> > previous state.
> 
> What we found while implementing is the use of ERROR arises when the
> driver has been forced to do multiple actions and is unable to fully
> unwind the state. So the device_state is not fully the original state
> or fully the target state. Thus it is ERROR.
> 
> The additional requirement that the driver do another action to
> quiesce the device only introduces the possiblity for triple failure.
> 
> Since it doesn't bring any value to userspace, I prefer we not define
> things in this complicated way.

So ERROR is really about unrecoverable failures. If recoverable suppose
errno should have been returned and the device is still in the original
state. Is this understanding correct?

btw which errno indicates to the user that the device is back to the
original state or in the ERROR state? or want the user to always check
the device state upon any transition error?

> 
> > > I prefer a model where the device is allowed to keep doing whatever it
> > > was doing before it hit the error. You are pushing for a model where
> > > upon error we must force the device to stop.
> >
> > If the device continues operating in the previous mode then it
> > shouldn't enter the ERROR state, it should return errno and re-reading
> > device_state should indicate it's in the previous state.
> 
> Continues operating in the new/previous state is an 'upper bound' on
> what it is allowed to do. For instance if we are going from RUNNING ->
> SAVING mlx5 must transit through 'RUNNING|NDMA' as part of its
> internal design.
> 
> If it then fails it can't continue to pretend it is RUNNING when it is
> doing RUNNING|NDMA and a double failure means we can't restore
> RUNNING.
> 
> Allowing ERROR to continue any behavior allowed by RUNNING allows the
> device to be left in RUNNING|NDMA and eliminates the possiblity of
> triple failure in a transition to 'STOP'.
> 
> Indeed we can simplify the driver by removing failure recoveries for
> cases that have a double fault and simply go to ERROR. This is not so
> viable if ERROR itself requires an action to enter it as we get back
> to having to deal with double and triple faults.

Or have a way for the user to specify the policy when entering ERROR,
e.g. asking the migration driver to conduct an internal reset if undefined
behavior in this state is a concern and the user doesn't want to further
diagnose the device context?

Thanks
Kevin