Re: [PATCH v6 09/10] hisi_acc_vfio_pci: Add support for VFIO live migration

From: Alex Williamson <alex.williamson@redhat.com>
To: Jason Gunthorpe <jgg@nvidia.com>
Cc: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-crypto@vger.kernel.org" <linux-crypto@vger.kernel.org>,
	"cohuck@redhat.com" <cohuck@redhat.com>,
	"mgurtovoy@nvidia.com" <mgurtovoy@nvidia.com>,
	"yishaih@nvidia.com" <yishaih@nvidia.com>,
	Linuxarm <linuxarm@huawei.com>,
	liulongfang <liulongfang@huawei.com>,
	"Zengtao (B)" <prime.zeng@hisilicon.com>,
	Jonathan Cameron <jonathan.cameron@huawei.com>,
	"Wangzhou (B)" <wangzhou1@hisilicon.com>
Subject: Re: [PATCH v6 09/10] hisi_acc_vfio_pci: Add support for VFIO live migration
Date: Tue, 1 Mar 2022 12:30:47 -0700	[thread overview]
Message-ID: <20220301123047.1171c730.alex.williamson@redhat.com> (raw)
In-Reply-To: <20220301131528.GW219866@nvidia.com>

On Tue, 1 Mar 2022 09:15:28 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Mon, Feb 28, 2022 at 09:41:10PM -0700, Alex Williamson wrote:
> 
> > > + * returning readable. ENOMSG may not be returned in STOP_COPY. Support
> > > + * for this ioctl is required when VFIO_MIGRATION_PRE_COPY is set.  
> > 
> > This entire ioctl on the data_fd seems a bit strange given the previous
> > fuss about how difficult it is for a driver to estimate their migration
> > data size.  Now drivers are forced to provide those estimates even if
> > they only intend to use PRE_COPY as an early compatibility test?  
> 
> Well, yes. PRE_COPY is designed to be general, not just to serve for
> compatability. Qemu needs data to understand how the internal dirty
> accumulation in the device is progressing. So everything has to
> provide estimates, and at least for acc this is trivial.

But we're not really even living up to that expectation of dirty bytes
with acc afaict.  We're giving QEMU some initial data it can have
early, but it looks like the latest proposal hard codes dirty-bytes to
zero, so QEMU doesn't gain any insight into dirty accumulation, nor
does it know that the field is invalid.

Wouldn't it make more sense if initial-bytes started at QM_MATCH_SIZE
and dirty-bytes was always sizeof(vf_data) - QM_MATCH_SIZE?  ie. QEMU
would know that it has sizeof(vf_data) - QM_MATCH_SIZE remaining even
while it's getting ENOMSG after reading QM_MATCH_SIZE bytes of data.

> > Obviously it's trivial for the acc driver that doesn't support dirty
> > tracking and only has a fixed size migration structure, but it seems to
> > contradict your earlier statements.   
> 
> mlx5 knows exactly this data size once it completes entering
> STOP_COPY, it has a migf->total_size just like acc, so no problem to
> generate this ioctl. We just don't have a use case for it and qemu
> would never call it, so trying not to add dead things to the kernel.
> 
> Are you are talking about the prior discussion about getting this data
> before reaching STOP_COPY?

That would be the ideal scenario, but again if knowing this information
once we're in STOP_COPY continues to be useful, which I infer by the
ongoing operation of this ioctl, then why don't we switch to an ioctl
that just reports bytes-remaining at that point rather than trying to
fit the square peg in the round hole to contort a STOP_COPY data
representation into initial-bytes and dirty-bytes?  If that's not
useful yet and you don't want to add dead kernel code, then let's
define that this ioctl is only available in the PRE_COPY* states and
returns -errno in the STOP_COPY state.

> > For instance, can mlx5 implement a PRE_COPY solely for compatibility
> > testing or is it blocked by an inability to provide data estimates
> > for this ioctl?  
> 
> I expect it can, it works very similar to acc. It just doesn't match
> where we are planning for compatability. mlx5 has a more dynamic
> compatability requirement, it needs to be exposed to orchestration not
> hidden in pre_copy. acc looks like it is static, so 'have acc' is
> enough info for orchestration.
> 
> > Now if we propose that this ioctl is useful during the STOP_COPY phase,
> > how does a non-PRE_COPY driver opt-in to that beneficial use case?    
> 
> Just implement it - userspace will learn if the driver supports it on
> the first ioctl = ENOTTY means no support.
> 
> > Do we later add a different, optional ioctl for non-PRE_COPY and
> > then require userspace to support two different methods of getting
> > remaining data estimates for a device in STOP_COPY?  
> 
> I wouldn't add a new ioctl unless we discover a new requirement when
> an implementation is made.

So let's define that this ioctl is only valid in PRE_COPY* states and
return -errno in STOP_COPY so that we have consistency between all
devices in STOP_COPY and let's also define if there's actually anything
userspace can infer about remaining STOP_COPY data size while in
PRE_COPY* via this ioctl.  For example, is dirty-bytes zero or the
remaining data structure size?

...
> > I'm sure that raises questions about how we correlate a
> > PRE_COPY* session to a STOP_COPY session though, but this PRE_COPY*
> > specific but ongoing usage in STOP_COPY ioctl seems ad-hoc.  
> 
> I do not think it is "pre_copy specific" - the ioctl returns the
> estimated length of the data_fd, this is always a valid concept.

Yes, some sort of remaining-bytes is always a valid concept.  Splitting
that into initial-bytes and dirty-bytes doesn't make any sense at
STOP_COPY though.  If there's no use case for this ioctl in STOP_COPY
and the partitioning of data exposed in PRE_COPY* mode is fundamentally
specific to pre-copy support, why not disable the ioctl with the
intention to replace it with a common one specific to STOP_COPY once
there's a userspace use case?  Thanks,

Alex