Re: [RFC PATCH for-QEMU-5.2] vfio: Make migration support experimental

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: "Neo Jia" <cjia@nvidia.com>,
	"Juan Quintela" <quintela@redhat.com>,
	"Cornelia Huck" <cohuck@redhat.com>,
	qemu-devel@nongnu.org, "Kirti Wankhede" <kwankhede@nvidia.com>,
	"Philippe Mathieu-Daudé" <philmd@redhat.com>
Subject: Re: [RFC PATCH for-QEMU-5.2] vfio: Make migration support experimental
Date: Tue, 10 Nov 2020 09:10:37 +0000	[thread overview]
Message-ID: <20201110091037.GA3108@work-vm> (raw)
In-Reply-To: <20201109132950.6c2dfe02@w520.home>

* Alex Williamson (alex.williamson@redhat.com) wrote:
> On Mon, 9 Nov 2020 19:44:17 +0000
> "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> 
> > * Alex Williamson (alex.williamson@redhat.com) wrote:
> > > Per the proposed documentation for vfio device migration:
> > > 
> > >   Dirty pages are tracked when device is in stop-and-copy phase
> > >   because if pages are marked dirty during pre-copy phase and
> > >   content is transfered from source to destination, there is no
> > >   way to know newly dirtied pages from the point they were copied
> > >   earlier until device stops. To avoid repeated copy of same
> > >   content, pinned pages are marked dirty only during
> > >   stop-and-copy phase.
> > > 
> > > Essentially, since we don't have hardware dirty page tracking for
> > > assigned devices at this point, we consider any page that is pinned
> > > by an mdev vendor driver or pinned and mapped through the IOMMU to
> > > be perpetually dirty.  In the worst case, this may result in all of
> > > guest memory being considered dirty during every iteration of live
> > > migration.  The current vfio implementation of migration has chosen
> > > to mask device dirtied pages until the final stages of migration in
> > > order to avoid this worst case scenario.
> > > 
> > > Allowing the device to implement a policy decision to prioritize
> > > reduced migration data like this jeopardizes QEMU's overall ability
> > > to implement any degree of service level guarantees during migration.
> > > For example, any estimates towards achieving acceptable downtime
> > > margins cannot be trusted when such a device is present.  The vfio
> > > device should participate in dirty page tracking to the best of its
> > > ability throughout migration, even if that means the dirty footprint
> > > of the device impedes migration progress, allowing both QEMU and
> > > higher level management tools to decide whether to continue the
> > > migration or abort due to failure to achieve the desired behavior.  
> > 
> > I don't feel particularly badly about the decision to squash it in
> > during the stop-and-copy phase; for devices where the pinned memory
> > is large, I don't think doing it during the main phase makes much sense;
> > especially if you then have to deal with tracking changes in pinning.
> 
> 
> AFAIK the kernel support for tracking changes in page pinning already
> exists, this is largely the vfio device in QEMU that decides when to
> start exposing the device dirty footprint to QEMU.  I'm a bit surprised
> by this answer though, we don't really know what the device memory
> footprint is.  It might be large, it might be nothing, but by not
> participating in dirty page tracking until the VM is stopped, we can't
> know what the footprint is and how it will affect downtime.  Is it
> really the place of a QEMU device driver to impose this sort of policy?

If it could actually track changes then I'd agree we shouldn't impose
any policy; but if it's just marking the whole area as dirty we're going
to need a bodge somewhere; this bodge doesn't look any worse than the
others to me.

> 
> > Having said that, I agree with marking it as experimental, because
> > I'm dubious how useful it will be for the same reason, I worry
> > about whether the downtime will be so large to make it pointless.
> 
> 
> TBH I think that's the wrong reason to mark it experimental.  There's
> clearly demand for vfio device migration and even if the practical use
> cases are initially small, they will expand over time and hardware will
> get better.  My objection is that the current behavior masks the
> hardware and device limitations, leading to unrealistic expectations.
> If the user expects minimal downtime, configures convergence to account
> for that, QEMU thinks it can achieve it, and then the device marks
> everything dirty, that's not supportable. 

Yes, agreed.

> OTOH if the vfio device
> participates in dirty tracking through pre-copy, then the practical use
> cases will find themselves as migrations will either be aborted because
> downtime tolerances cannot be achieved or downtimes will be configured
> to match reality.  Thanks,

Without a way to prioritise the unpinned memory during that period,
we're going to be repeatedly sending the pinned memory which is going to
lead to a much larger bandwidth usage that required; so that's going in
completely the wrong direction and also wrong from the point of view of
the user.

Dave

> 
> Alex
> 
> > Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > 
> > > Link: https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg00807.html
> > > Cc: Kirti Wankhede <kwankhede@nvidia.com>
> > > Cc: Neo Jia <cjia@nvidia.com>
> > > Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
> > > Cc: Juan Quintela <quintela@redhat.com>
> > > Cc: Philippe Mathieu-DaudÃ© <philmd@redhat.com>
> > > Cc: Cornelia Huck <cohuck@redhat.com>
> > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > > ---
> > > 
> > > Given that our discussion in the link above seems to be going in
> > > circles, I'm afraid it seems necessary to both have a contigency
> > > plan and to raise the visibility of the current behavior to
> > > determine whether others agree that this is a sufficiently
> > > troubling behavior to consider migration support experimental
> > > at this stage.  Please voice your opinion or contribute patches
> > > to resolve this before QEMU 5.2.  Thanks,
> > > 
> > > Alex
> > > 
> > >  hw/vfio/migration.c           |    2 +-
> > >  hw/vfio/pci.c                 |    2 ++
> > >  include/hw/vfio/vfio-common.h |    1 +
> > >  3 files changed, 4 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> > > index 3ce285ea395d..cd44d465a50b 100644
> > > --- a/hw/vfio/migration.c
> > > +++ b/hw/vfio/migration.c
> > > @@ -882,7 +882,7 @@ int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
> > >      Error *local_err = NULL;
> > >      int ret = -ENOTSUP;
> > >  
> > > -    if (!container->dirty_pages_supported) {
> > > +    if (!vbasedev->enable_migration || !container->dirty_pages_supported) {
> > >          goto add_blocker;
> > >      }
> > >  
> > > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> > > index 58c0ce8971e3..1349b900e513 100644
> > > --- a/hw/vfio/pci.c
> > > +++ b/hw/vfio/pci.c
> > > @@ -3194,6 +3194,8 @@ static Property vfio_pci_dev_properties[] = {
> > >                      VFIO_FEATURE_ENABLE_REQ_BIT, true),
> > >      DEFINE_PROP_BIT("x-igd-opregion", VFIOPCIDevice, features,
> > >                      VFIO_FEATURE_ENABLE_IGD_OPREGION_BIT, false),
> > > +    DEFINE_PROP_BOOL("x-enable-migration", VFIOPCIDevice,
> > > +                     vbasedev.enable_migration, false),
> > >      DEFINE_PROP_BOOL("x-no-mmap", VFIOPCIDevice, vbasedev.no_mmap, false),
> > >      DEFINE_PROP_BOOL("x-balloon-allowed", VFIOPCIDevice,
> > >                       vbasedev.ram_block_discard_allowed, false),
> > > diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> > > index baeb4dcff102..2119872c8af1 100644
> > > --- a/include/hw/vfio/vfio-common.h
> > > +++ b/include/hw/vfio/vfio-common.h
> > > @@ -123,6 +123,7 @@ typedef struct VFIODevice {
> > >      bool needs_reset;
> > >      bool no_mmap;
> > >      bool ram_block_discard_allowed;
> > > +    bool enable_migration;
> > >      VFIODeviceOps *ops;
> > >      unsigned int num_irqs;
> > >      unsigned int num_regions;
> > >   
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK