Re: [PATCH 4/9] vfio/migration: Skip pre-copy if dirty page tracking is not supported

From: Alex Williamson <alex.williamson@redhat.com>
To: Jason Gunthorpe <jgg@nvidia.com>
Cc: Juan Quintela <quintela@redhat.com>,
	Avihai Horon <avihaih@nvidia.com>,
	qemu-devel@nongnu.org, "Michael S . Tsirkin" <mst@redhat.com>,
	Cornelia Huck <cohuck@redhat.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	"Dr . David Alan Gilbert" <dgilbert@redhat.com>,
	Yishai Hadas <yishaih@nvidia.com>, Mark Bloch <mbloch@nvidia.com>,
	Maor Gottlieb <maorg@nvidia.com>,
	Kirti Wankhede <kwankhede@nvidia.com>,
	Tarun Gupta <targupta@nvidia.com>
Subject: Re: [PATCH 4/9] vfio/migration: Skip pre-copy if dirty page tracking is not supported
Date: Tue, 17 May 2022 10:00:45 -0600	[thread overview]
Message-ID: <20220517100045.27a696c9.alex.williamson@redhat.com> (raw)
In-Reply-To: <20220516230832.GP1343366@nvidia.com>

On Mon, 16 May 2022 20:08:32 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Mon, May 16, 2022 at 02:22:00PM -0600, Alex Williamson wrote:
> > On Mon, 16 May 2022 13:22:14 +0200
> > Juan Quintela <quintela@redhat.com> wrote:
> >   
> > > Avihai Horon <avihaih@nvidia.com> wrote:  
> > > > Currently, if IOMMU of a VFIO container doesn't support dirty page
> > > > tracking, migration is blocked completely. This is because a DMA-able
> > > > VFIO device can dirty RAM pages without updating QEMU about it, thus
> > > > breaking the migration.
> > > >
> > > > However, this doesn't mean that migration can't be done at all. If
> > > > migration pre-copy phase is skipped, the VFIO device doesn't have a
> > > > chance to dirty RAM pages that have been migrated already, thus
> > > > eliminating the problem previously mentioned.
> > > >
> > > > Hence, in such case allow migration but skip pre-copy phase.
> > > >
> > > > Signed-off-by: Avihai Horon <avihaih@nvidia.com>    
> > > 
> > > I don't know (TM).
> > > Several issues:
> > > - Patch is ugly as hell (ok, that depends on taste)
> > > - It changes migration_iteration_run() instead of directly
> > >   migration_thread.
> > > - There is already another case where we skip the sending of RAM
> > >   (localhost migration with shared memory)
> > > 
> > > In migration/ram.c:
> > > 
> > > static int ram_find_and_save_block(RAMState *rs, bool last_stage)
> > > {
> > >     PageSearchStatus pss;
> > >     int pages = 0;
> > >     bool again, found;
> > > 
> > >     /* No dirty page as there is zero RAM */
> > >     if (!ram_bytes_total()) {
> > >         return pages;
> > >     }
> > > 
> > > This is the other place where we _don't_ send any RAM at all.
> > > 
> > > I don't have a great idea about how to make things clear at a higher
> > > level, I have to think about this.  
> > 
> > It seems like if we have devices dictating what type of migrations can
> > be performed then there probably needs to be a switch to restrict use of
> > such devices just as we have the -only-migratable switch now to prevent
> > attaching devices that don't support migration.  I'd guess that we need
> > the switch to opt-in to allowing such devices to maintain
> > compatibility.  There's probably a whole pile of qapi things missing to
> > expose this to management tools as well.  Thanks,  
> 
> This is really intended to be a NOP from where things are now, as if
> you use mlx5 live migration without a patch like this then it causes a
> botched pre-copy since everything just ends up permanently dirty.
> 
> If it makes more sense we could abort the pre-copy too - in the end
> there will be dirty tracking so I don't know if I'd invest in a big
> adventure to fully define non-dirty tracking migration.

How is pre-copy currently "botched" without a patch like this?  If it's
simply that the pre-copy doesn't converge and the downtime constraints
don't allow the VM to enter stop-and-copy, that's the expected behavior
AIUI, and supports backwards compatibility with existing SLAs.

I'm assuming that by setting this new skip_precopy flag that we're
forcing the VM to move to stop-and-copy, regardless of any other SLA
constraints placed on the migration.  It seems like a better solution
would be to expose to management tools that the VM contains a device
that does not support the pre-copy phase so that downtime expectations
can be adjusted.  Thanks,

Alex