linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Tian, Kevin" <kevin.tian@intel.com>
To: Jason Gunthorpe <jgg@nvidia.com>
Cc: Yishai Hadas <yishaih@nvidia.com>,
	"alex.williamson@redhat.com" <alex.williamson@redhat.com>,
	"bhelgaas@google.com" <bhelgaas@google.com>,
	"saeedm@nvidia.com" <saeedm@nvidia.com>,
	"linux-pci@vger.kernel.org" <linux-pci@vger.kernel.org>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	"netdev@vger.kernel.org" <netdev@vger.kernel.org>,
	"kuba@kernel.org" <kuba@kernel.org>,
	"leonro@nvidia.com" <leonro@nvidia.com>,
	"kwankhede@nvidia.com" <kwankhede@nvidia.com>,
	"mgurtovoy@nvidia.com" <mgurtovoy@nvidia.com>,
	"maorg@nvidia.com" <maorg@nvidia.com>,
	"Raj, Ashok" <ashok.raj@intel.com>,
	"shameerali.kolothum.thodi@huawei.com" 
	<shameerali.kolothum.thodi@huawei.com>
Subject: RE: [PATCH V7 mlx5-next 09/15] vfio: Extend the device migration protocol with RUNNING_P2P
Date: Wed, 16 Feb 2022 02:52:55 +0000	[thread overview]
Message-ID: <BN9PR11MB5276BED0FC008B974B8750058C359@BN9PR11MB5276.namprd11.prod.outlook.com> (raw)
In-Reply-To: <20220215155602.GB1046125@nvidia.com>

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, February 15, 2022 11:56 PM
> 
> > Defining RUNNING_P2P in above way implies that RUNNING_P2P inherits
> > all behaviors in RUNNING except blocking outbound P2P:
> > 	* generate interrupts and DMAs
> > 	* respond to MMIO
> > 	* all vfio regions are functional
> > 	* device may advance its internal state
> > 	* drain and block outstanding P2P requests
> 
> Correct.
> 
> The device must be able to recieve and process any MMIO P2P
> transaction during this state.
> 
> We discussed and left interrupts as allowed behavior.
> 
> > I think this is not the intended behavior when NDMA was being discussed
> > in previous threads, as above definition suggests the user could continue
> > to submit new requests after outstanding P2P requests are completed
> given
> > all vfio regions are functional when the device is in RUNNING_P2P.
> 
> It is the desired behavior. The device must internally stop generating
> DMA from new work, it cannot rely on external things not poking it
> with MMIO, because the whole point of the state is that MMIO P2P is
> still allowed to happen.
> 
> What gets confusing is that in normal cases I wouldn't expect any P2P
> activity to trigger a new work submission.
> 
> Probably, since many devices can't implement this, we will end up with
> devices providing a weaker version where they do RUNNING_P2P but this
> relies on the VM operating the device "sanely" without programming P2P
> work submission. It is similar to your notion that migration requires
> guest co-operation in the vPRI case.
> 
> I don't like it, and better devices really should avoid requiring
> guest co-operation, but it seems like where things are going.

Make sense to me now. 

btw can disabling PCI bus master be a general means for devices which
don't have a way of blocking P2P to implement RUNNING_P2P? 

> 
> > Though just a naming thing, possibly what we really require is a
> STOPPING_P2P
> > state which indicates the device is moving to the STOP (or STOPPED)
> > state.
> 
> No, I've deliberately avoided STOP because this isn't anything like
> STOP. It is RUNNING with one restriction.

With above explanation I'm fine with it.

> 
> > In this state the device is functional but vfio regions are not so the user still
> > needs to restrict device access:
> 
> The device is not functional in STOP. STOP means the device does not
> provide working MMIO. Ie mlx5 devices will discard all writes and
> read all 0's when in STOP.

btw I used 'STOPPING' to indicate a transitional state between RUNNING
and STOP thus its definition could be defined separately from STOP. But 
it doesn't matter now.

> 
> The point of RUNNING_P2P is to allow the device to continue to recieve
> all MMIO while halting generation of MMIO to other devices.
> 
> > In virtualization this means Qemu must stop vCPU first before entering
> > STOPPING_P2P for a device.
> 
> This is already the case. RUNNING/STOP here does not refer to the
> vCPU, it refers to this device.

I know that point. Originally I thought that having 'RUNNING' in RUNNING_P2P
implies that vCPU doesn't need to be stopped first given all vfio regions are
functional. But now I think the rationale is clear. If guest-operation exists
then vCPU can be active when entering RUNNING_P2P since the guest will
guarantee no P2P submission (via vCPU or via P2P). Otherwise vCPU must be 
stopped first to block potential P2P work submissions as a brute-force operation.

> 
> > Back to your earlier suggestion on reusing RUNNING_P2P to cover vPRI
> > usage via a new capability bit [1]:
> >
> >     "A cap like "running_p2p returns an event fd, doesn't finish until the
> >     VCPU does stuff, and stops pri as well as p2p" might be all that is
> >     required here (and not an actual new state)"
> >
> > vPRI requires a RUNNING semantics. A new capability bit can change
> > the behaviors listed above for STOPPING_P2P to below:
> > 	* both P2P and vPRI requests should be drained and blocked;
> > 	* all vfio regions are functional (with a RUNNING behavior) so
> > 	  vCPUs can continue running to help drain vPRI requests;
> > 	* an eventfd is returned for the user to poll-wait the completion
> > 	  of state transition;
> 
> vPRI draining is not STOP either. If the device is expected to provide
> working MMIO it is not STOP by definition.
> 
> > One additional requirement in driver side is to dynamically mediate the
> > fast path and queue any new request which may trigger vPRI or P2P
> > before moving out of RUNNING_P2P. If moving to STOP_COPY, then
> > queued requests will also be included as device state to be replayed
> > in the resuming path.
> 
> This could make sense. I don't know how you dynamically mediate
> though, or how you will trap ENQCMD..

Qemu can ask KVM to temporarily clear EPT mapping of the cmd portal 
to enable mediation on src and then restore the mapping before resuming
vCPU on dest. In our internal POC the cmd portal address is hard coded
in Qemu which is not good. Possibly we need a general mechanism so
migration driver which supports vPRI and extended RUNNING_P2P behavior
can report to the user a list of pages which must be accessed via read()/
write() instead of mmap when the device is in RUNNING_P2P and vCPUs
are active. Based on that information Qemu can zap related EPT mappings 
before moving the device to RUNNING_P2P.

> 
> > Does above sound a reasonable understanding of this FSM mechanism?
> 
> Other than mis-using the STOP label, it is close yes.
> 

Thanks
Kevin

  reply	other threads:[~2022-02-16  2:53 UTC|newest]

Thread overview: 50+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-02-07 17:22 [PATCH V7 mlx5-next 00/15] Add mlx5 live migration driver and v2 migration protocol Yishai Hadas
2022-02-07 17:22 ` [PATCH V7 mlx5-next 01/15] PCI/IOV: Add pci_iov_vf_id() to get VF index Yishai Hadas
2022-02-07 17:22 ` [PATCH V7 mlx5-next 02/15] net/mlx5: Reuse exported virtfn index function call Yishai Hadas
2022-02-07 17:22 ` [PATCH V7 mlx5-next 03/15] net/mlx5: Disable SRIOV before PF removal Yishai Hadas
2022-02-07 17:22 ` [PATCH V7 mlx5-next 04/15] PCI/IOV: Add pci_iov_get_pf_drvdata() to allow VF reaching the drvdata of a PF Yishai Hadas
2022-02-07 17:22 ` [PATCH V7 mlx5-next 05/15] net/mlx5: Expose APIs to get/put the mlx5 core device Yishai Hadas
2022-02-07 17:22 ` [PATCH V7 mlx5-next 06/15] net/mlx5: Introduce migration bits and structures Yishai Hadas
2022-02-07 17:22 ` [PATCH V7 mlx5-next 07/15] vfio: Have the core code decode the VFIO_DEVICE_FEATURE ioctl Yishai Hadas
2022-02-07 17:22 ` [PATCH V7 mlx5-next 08/15] vfio: Define device migration protocol v2 Yishai Hadas
2022-02-09  0:07   ` Alex Williamson
2022-02-09  2:36     ` Jason Gunthorpe
2022-02-15 10:41       ` Tian, Kevin
2022-02-15 16:04         ` Jason Gunthorpe
2022-02-15 23:32           ` Alex Williamson
2022-02-16  1:17             ` Jason Gunthorpe
2022-02-16  3:17           ` Tian, Kevin
2022-02-16 12:14             ` Jason Gunthorpe
2022-02-17  2:29               ` Tian, Kevin
2022-02-15 10:58       ` Tian, Kevin
2022-02-15 13:13         ` Jason Gunthorpe
2022-02-15  8:04   ` Tian, Kevin
2022-02-15 15:33     ` Jason Gunthorpe
2022-02-16  3:04       ` Tian, Kevin
2022-02-07 17:22 ` [PATCH V7 mlx5-next 09/15] vfio: Extend the device migration protocol with RUNNING_P2P Yishai Hadas
2022-02-15 10:18   ` Tian, Kevin
2022-02-15 15:56     ` Jason Gunthorpe
2022-02-16  2:52       ` Tian, Kevin [this message]
2022-02-16 12:11         ` Jason Gunthorpe
2022-02-07 17:22 ` [PATCH V7 mlx5-next 10/15] vfio: Remove migration protocol v1 documentation Yishai Hadas
2022-02-11 11:03   ` Cornelia Huck
2022-02-07 17:22 ` [PATCH V7 mlx5-next 11/15] vfio/mlx5: Expose migration commands over mlx5 device Yishai Hadas
2022-02-07 17:22 ` [PATCH V7 mlx5-next 12/15] vfio/mlx5: Implement vfio_pci driver for mlx5 devices Yishai Hadas
2022-02-09  0:07   ` Alex Williamson
2022-02-07 17:22 ` [PATCH V7 mlx5-next 13/15] vfio/pci: Expose vfio_pci_core_aer_err_detected() Yishai Hadas
2022-02-07 17:22 ` [PATCH V7 mlx5-next 14/15] vfio/mlx5: Use its own PCI reset_done error handler Yishai Hadas
2022-02-09  0:08   ` Alex Williamson
2022-02-09  2:39     ` Jason Gunthorpe
2022-02-10 16:48       ` Alex Williamson
2022-02-10 17:27         ` Jason Gunthorpe
2022-02-07 17:22 ` [PATCH V7 mlx5-next 15/15] vfio: Extend the device migration protocol with PRE_COPY Yishai Hadas
2022-02-17 17:15   ` Alex Williamson
2022-02-18  0:03     ` Jason Gunthorpe
2022-02-18  8:01   ` Tian, Kevin
2022-02-18 14:06     ` Jason Gunthorpe
2022-02-22  1:43       ` Tian, Kevin
2022-02-22 15:50         ` Jason Gunthorpe
2022-02-23  0:40           ` Tian, Kevin
2022-02-23  0:44             ` Jason Gunthorpe
2022-02-23  1:46               ` Tian, Kevin
2022-02-18  8:11 ` [PATCH V7 mlx5-next 00/15] Add mlx5 live migration driver and v2 migration protocol Tarun Gupta (SW-GPU)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=BN9PR11MB5276BED0FC008B974B8750058C359@BN9PR11MB5276.namprd11.prod.outlook.com \
    --to=kevin.tian@intel.com \
    --cc=alex.williamson@redhat.com \
    --cc=ashok.raj@intel.com \
    --cc=bhelgaas@google.com \
    --cc=jgg@nvidia.com \
    --cc=kuba@kernel.org \
    --cc=kvm@vger.kernel.org \
    --cc=kwankhede@nvidia.com \
    --cc=leonro@nvidia.com \
    --cc=linux-pci@vger.kernel.org \
    --cc=maorg@nvidia.com \
    --cc=mgurtovoy@nvidia.com \
    --cc=netdev@vger.kernel.org \
    --cc=saeedm@nvidia.com \
    --cc=shameerali.kolothum.thodi@huawei.com \
    --cc=yishaih@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).