All of lore.kernel.org
 help / color / mirror / Atom feed
From: Alex Williamson <alex.williamson@redhat.com>
To: Jason Gunthorpe <jgg@nvidia.com>
Cc: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-crypto@vger.kernel.org" <linux-crypto@vger.kernel.org>,
	"cohuck@redhat.com" <cohuck@redhat.com>,
	"mgurtovoy@nvidia.com" <mgurtovoy@nvidia.com>,
	"yishaih@nvidia.com" <yishaih@nvidia.com>,
	Linuxarm <linuxarm@huawei.com>,
	liulongfang <liulongfang@huawei.com>,
	"Zengtao (B)" <prime.zeng@hisilicon.com>,
	Jonathan Cameron <jonathan.cameron@huawei.com>,
	"Wangzhou (B)" <wangzhou1@hisilicon.com>
Subject: Re: [PATCH v6 09/10] hisi_acc_vfio_pci: Add support for VFIO live migration
Date: Tue, 1 Mar 2022 12:30:47 -0700	[thread overview]
Message-ID: <20220301123047.1171c730.alex.williamson@redhat.com> (raw)
In-Reply-To: <20220301131528.GW219866@nvidia.com>

On Tue, 1 Mar 2022 09:15:28 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Mon, Feb 28, 2022 at 09:41:10PM -0700, Alex Williamson wrote:
> 
> > > + * returning readable. ENOMSG may not be returned in STOP_COPY. Support
> > > + * for this ioctl is required when VFIO_MIGRATION_PRE_COPY is set.  
> > 
> > This entire ioctl on the data_fd seems a bit strange given the previous
> > fuss about how difficult it is for a driver to estimate their migration
> > data size.  Now drivers are forced to provide those estimates even if
> > they only intend to use PRE_COPY as an early compatibility test?  
> 
> Well, yes. PRE_COPY is designed to be general, not just to serve for
> compatability. Qemu needs data to understand how the internal dirty
> accumulation in the device is progressing. So everything has to
> provide estimates, and at least for acc this is trivial.

But we're not really even living up to that expectation of dirty bytes
with acc afaict.  We're giving QEMU some initial data it can have
early, but it looks like the latest proposal hard codes dirty-bytes to
zero, so QEMU doesn't gain any insight into dirty accumulation, nor
does it know that the field is invalid.

Wouldn't it make more sense if initial-bytes started at QM_MATCH_SIZE
and dirty-bytes was always sizeof(vf_data) - QM_MATCH_SIZE?  ie. QEMU
would know that it has sizeof(vf_data) - QM_MATCH_SIZE remaining even
while it's getting ENOMSG after reading QM_MATCH_SIZE bytes of data.

> > Obviously it's trivial for the acc driver that doesn't support dirty
> > tracking and only has a fixed size migration structure, but it seems to
> > contradict your earlier statements.   
> 
> mlx5 knows exactly this data size once it completes entering
> STOP_COPY, it has a migf->total_size just like acc, so no problem to
> generate this ioctl. We just don't have a use case for it and qemu
> would never call it, so trying not to add dead things to the kernel.
> 
> Are you are talking about the prior discussion about getting this data
> before reaching STOP_COPY?

That would be the ideal scenario, but again if knowing this information
once we're in STOP_COPY continues to be useful, which I infer by the
ongoing operation of this ioctl, then why don't we switch to an ioctl
that just reports bytes-remaining at that point rather than trying to
fit the square peg in the round hole to contort a STOP_COPY data
representation into initial-bytes and dirty-bytes?  If that's not
useful yet and you don't want to add dead kernel code, then let's
define that this ioctl is only available in the PRE_COPY* states and
returns -errno in the STOP_COPY state.

> > For instance, can mlx5 implement a PRE_COPY solely for compatibility
> > testing or is it blocked by an inability to provide data estimates
> > for this ioctl?  
> 
> I expect it can, it works very similar to acc. It just doesn't match
> where we are planning for compatability. mlx5 has a more dynamic
> compatability requirement, it needs to be exposed to orchestration not
> hidden in pre_copy. acc looks like it is static, so 'have acc' is
> enough info for orchestration.
> 
> > Now if we propose that this ioctl is useful during the STOP_COPY phase,
> > how does a non-PRE_COPY driver opt-in to that beneficial use case?    
> 
> Just implement it - userspace will learn if the driver supports it on
> the first ioctl = ENOTTY means no support.
> 
> > Do we later add a different, optional ioctl for non-PRE_COPY and
> > then require userspace to support two different methods of getting
> > remaining data estimates for a device in STOP_COPY?  
> 
> I wouldn't add a new ioctl unless we discover a new requirement when
> an implementation is made.

So let's define that this ioctl is only valid in PRE_COPY* states and
return -errno in STOP_COPY so that we have consistency between all
devices in STOP_COPY and let's also define if there's actually anything
userspace can infer about remaining STOP_COPY data size while in
PRE_COPY* via this ioctl.  For example, is dirty-bytes zero or the
remaining data structure size?

...
> > I'm sure that raises questions about how we correlate a
> > PRE_COPY* session to a STOP_COPY session though, but this PRE_COPY*
> > specific but ongoing usage in STOP_COPY ioctl seems ad-hoc.  
> 
> I do not think it is "pre_copy specific" - the ioctl returns the
> estimated length of the data_fd, this is always a valid concept.

Yes, some sort of remaining-bytes is always a valid concept.  Splitting
that into initial-bytes and dirty-bytes doesn't make any sense at
STOP_COPY though.  If there's no use case for this ioctl in STOP_COPY
and the partitioning of data exposed in PRE_COPY* mode is fundamentally
specific to pre-copy support, why not disable the ioctl with the
intention to replace it with a common one specific to STOP_COPY once
there's a userspace use case?  Thanks,

Alex


  reply	other threads:[~2022-03-01 19:30 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-02-28  9:01 [PATCH v6 00/10] vfio/hisilicon: add ACC live migration driver Shameer Kolothum
2022-02-28  9:01 ` [PATCH v6 01/10] crypto: hisilicon/qm: Move the QM header to include/linux Shameer Kolothum
2022-02-28  9:01 ` [PATCH v6 02/10] crypto: hisilicon/qm: Move few definitions to common header Shameer Kolothum
2022-02-28  9:01 ` [PATCH v6 03/10] hisi_acc_qm: Move PCI device IDs " Shameer Kolothum
2022-02-28 17:33   ` Alex Williamson
2022-02-28 20:12     ` Bjorn Helgaas
2022-02-28 20:23       ` Alex Williamson
2022-02-28 20:55         ` Bjorn Helgaas
2022-02-28  9:01 ` [PATCH v6 04/10] hisi_acc_vfio_pci: add new vfio_pci driver for HiSilicon ACC devices Shameer Kolothum
2022-02-28  9:01 ` [PATCH v6 05/10] hisi_acc_vfio_pci: Restrict access to VF dev BAR2 migration region Shameer Kolothum
2022-02-28  9:01 ` [PATCH v6 06/10] hisi_acc_vfio_pci: Add helper to retrieve the struct pci_driver Shameer Kolothum
2022-02-28  9:01 ` [PATCH v6 07/10] vfio: Extend the device migration protocol with PRE_COPY Shameer Kolothum
2022-02-28  9:01 ` [PATCH v6 08/10] crypto: hisilicon/qm: Set the VF QM state register Shameer Kolothum
2022-02-28  9:01 ` [PATCH v6 09/10] hisi_acc_vfio_pci: Add support for VFIO live migration Shameer Kolothum
2022-02-28 14:57   ` Jason Gunthorpe
2022-02-28 18:01     ` Shameerali Kolothum Thodi
2022-02-28 18:05       ` Jason Gunthorpe
2022-02-28 20:16         ` Alex Williamson
2022-02-28 20:29           ` Jason Gunthorpe
2022-02-28 21:20             ` Alex Williamson
2022-02-28 23:47               ` Jason Gunthorpe
2022-03-01  4:41                 ` Alex Williamson
2022-03-01 13:15                   ` Jason Gunthorpe
2022-03-01 19:30                     ` Alex Williamson [this message]
2022-03-01 20:39                       ` Jason Gunthorpe
2022-03-01 22:44                         ` Alex Williamson
2022-03-02  0:03                           ` Jason Gunthorpe
2022-03-02  9:07                             ` Shameerali Kolothum Thodi
2022-02-28  9:01 ` [PATCH v6 10/10] hisi_acc_vfio_pci: Use its own PCI reset_done error handler Shameer Kolothum

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220301123047.1171c730.alex.williamson@redhat.com \
    --to=alex.williamson@redhat.com \
    --cc=cohuck@redhat.com \
    --cc=jgg@nvidia.com \
    --cc=jonathan.cameron@huawei.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-crypto@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linuxarm@huawei.com \
    --cc=liulongfang@huawei.com \
    --cc=mgurtovoy@nvidia.com \
    --cc=prime.zeng@hisilicon.com \
    --cc=shameerali.kolothum.thodi@huawei.com \
    --cc=wangzhou1@hisilicon.com \
    --cc=yishaih@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.