kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jason Gunthorpe <jgg@nvidia.com>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>,
	Cornelia Huck <cohuck@redhat.com>,
	Yishai Hadas <yishaih@nvidia.com>,
	bhelgaas@google.com, saeedm@nvidia.com,
	linux-pci@vger.kernel.org, kvm@vger.kernel.org,
	netdev@vger.kernel.org, kuba@kernel.org, leonro@nvidia.com,
	kwankhede@nvidia.com, mgurtovoy@nvidia.com, maorg@nvidia.com,
	"Dr. David Alan Gilbert" <dgilbert@redhat.com>
Subject: Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices
Date: Tue, 16 Nov 2021 15:25:05 -0400	[thread overview]
Message-ID: <20211116192505.GB2105516@nvidia.com> (raw)
In-Reply-To: <20211116105736.0388a183.alex.williamson@redhat.com>

On Tue, Nov 16, 2021 at 10:57:36AM -0700, Alex Williamson wrote:

> > I think userspace should decide if it wants to use mlx5 built in or
> > the system IOMMU to do dirty tracking.
> 
> What information does userspace use to inform such a decision?

Kernel can't know which approach performs better. Operators should
benchmark and make a choice for their deployment HW. Maybe device
tracking severely impacts device performance or vice versa.

Kernel doesn't easily know what userspace has done, maybe one device
supports migration driver dirty tracking and one device does not.

Is user space going to use a system IOMMU for both devices? 

Is it going to put the simple device in NDMA early and continue to
dirty track to shutdown the other devices?

> Ultimately userspace just wants the finest granularity of tracking,
> shouldn't that guide our decisions which to provide?

At least for mlx5 there is going to some trade off curve of device
performance, dirty tracking page size, and working set.

Even lower is better is not necessarily true. After overheads on a
400GB RDMA NIC there is not such a big difference between doing a 4k
and 16k scatter transfer. The CPU work to process all the extra bitmap
data may not be a net win compared to block transfer times.

Conversly someone doing 1G TCP transfers probably cares a lot to
minimize block size.

Overall, I think there is far too much up in the air and unmeasured to
firmly commit the kernel to a fixed policy.

So, I would like to see userspace control most of the policy aspects,
including the dirty track provider.

> I believe the intended progression of dirty tracking is that by default
> all mapped ranges are dirty.  If the device supports page pinning, then
> we reduce the set of dirty pages to those pages which are pinned.  A
> device that doesn't otherwise need page pinning, such as a fully IOMMU

How does userspace know if dirty tracking works or not? All I see
VFIO_IOMMU_DIRTY_PAGES_FLAG_START unconditionally allocs some bitmaps.

I'm surprised it doesn't check that only NO_IOMMU's devices are
attached to the container and refuse to dirty track otherwise - since
it doesn't work..

> backed device, would use gratuitous page pinning triggered by the
> _SAVING state activation on the device.  It sounds like mlx5 could use
> this existing support today.

How does mlx5 know if it should turn on its dirty page tracking on
SAVING or if the system IOMMU covers it? Or for some reason userspace
doesn't want dirty tracking but is doing pre-copy?

When we mix dirty track with pre-copy, the progression seems to be:

  DITRY TRACKING | RUNNING
     Copy every page to the remote
  DT | SAVING | RUNNING
     Copy pre-copy migration data to the remote
  SAVING | NDMA | RUNNING
     Read and clear dirty track device bitmap
  DT | SAVING | RUNNING
     Copy new dirtied data
     (maybe loop back to NDMA a few times?)
  SAVING | NDMA | RUNNING
     P2P grace state
  0
    Read the dirty track and copy data
    Read and send the migration state

Can we do something so complex using only SAVING?

.. and along the lines of the above how do we mix in NDMA to the iommu
container, and how does it work if only some devices support NDMA?

> We had also discussed variants to page pinning that might be more
> useful as device dirty page support improves.  For example calls to
> mark pages dirty once rather than the perpetual dirtying of pinned
> pages, calls to pin pages for read vs write, etc.  We didn't dive much
> into system IOMMU dirtying, but presumably we'd have a fault handler
> triggered if a page is written by the device and go from there.

Would be interesting to know for sure what current IOMMU HW has
done. I'm supposing the easiest implementation is to write a dirty bit
to the IO PTE the same as the CPU writes a dirty bit the normal PTE.

> > In light of all this I'm wondering if device dirty tracking should
> > exist as new ioctls on the device FD and reserve the type1 code to
> > only work the IOMMU dirty tracking.
> 
> Our existing model is working towards the IOMMU, ie. container,
> interface aggregating dirty page context.  

This creates inefficiencies in the kernel, we copy from the mlx5
formed data structure to new memory in the iommu through a very
ineffficent API and then again we do an ioctl to copy it once more and
throw all the extra work away. It does not seem good for something
where we want performance.

> For example when page pinning is used, it's only when all devices
> within the container are using page pinning that we can report the
> pinned subset as dirty.  Otherwise userspace needs to poll each
> device, which I suppose enables your idea that userspace decides
> which source to use, but why?

Efficiency, and user selectable policy.

Userspace can just allocate an all zeros bitmap and feed it to each of
the providers in the kernel using a 'or in your dirty' semantic.

No redundant kernel data marshaling, userspace gets to decide which
tracking provider to use, and it is simple to implement in the kernel.

Userspace has to do this anyhow if it has configurations with multiple
containers. For instance because it was forced to split the containers
due to one device not supporting NDMA.

> Does the IOMMU dirty page tracking exclude devices if the user
> queries the device separately?  

What makes sense to me is multiple tracking providers. Each can be
turned on and off.

If the container tracking provider says it supports tracking then it
means it can track DMA from every device it is connected to (unlike
today?). eg by using IOMMU HW that naturally does this, or by only
having only NO_IOMMU devices.

If the migration driver says it supports tracking, then it only tracks
DMA from that device.

> How would it know?  What's the advantage?  It seems like this
> creates too many support paths that all need to converge on the same
> answer.  Consolidating DMA dirty page tracking to the DMA mapping
> interface for all devices within a DMA context makes more sense to
> me.

What I see is a lot of questions and limitations with this
approach. If we stick to funneling everything through the iommu then
answering the questions seem to create a large amount of kernel
work. Enough to ask if it is worthwhile..

.. and then we have to ask how does this all work in IOMMUFD where it
is not so reasonable to tightly couple the migration driver and the
IOAS and I get more questions :)

Jason

  reply	other threads:[~2021-11-16 19:25 UTC|newest]

Thread overview: 100+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-10-19 10:58 [PATCH V2 mlx5-next 00/14] Add mlx5 live migration driver Yishai Hadas
2021-10-19 10:58 ` [PATCH V2 mlx5-next 01/14] PCI/IOV: Add pci_iov_vf_id() to get VF index Yishai Hadas
2021-10-19 10:58 ` [PATCH V2 mlx5-next 02/14] net/mlx5: Reuse exported virtfn index function call Yishai Hadas
2021-10-19 10:58 ` [PATCH V2 mlx5-next 03/14] net/mlx5: Disable SRIOV before PF removal Yishai Hadas
2021-10-19 10:58 ` [PATCH V2 mlx5-next 04/14] PCI/IOV: Add pci_iov_get_pf_drvdata() to allow VF reaching the drvdata of a PF Yishai Hadas
2021-10-19 10:58 ` [PATCH V2 mlx5-next 05/14] net/mlx5: Expose APIs to get/put the mlx5 core device Yishai Hadas
2021-10-19 10:58 ` [PATCH V2 mlx5-next 06/14] vdpa/mlx5: Use mlx5_vf_get_core_dev() to get PF device Yishai Hadas
2021-10-19 11:16   ` Max Gurtovoy
2021-10-20  8:58     ` Yishai Hadas
2021-10-19 10:58 ` [PATCH V2 mlx5-next 07/14] vfio: Fix VFIO_DEVICE_STATE_SET_ERROR macro Yishai Hadas
2021-10-19 10:58 ` [PATCH V2 mlx5-next 08/14] vfio: Add a macro for VFIO_DEVICE_STATE_ERROR Yishai Hadas
2021-10-19 15:48   ` Alex Williamson
2021-10-19 15:50     ` Alex Williamson
2021-10-20  7:35       ` Yishai Hadas
2021-10-19 10:58 ` [PATCH V2 mlx5-next 09/14] vfio/pci_core: Make the region->release() function optional Yishai Hadas
2021-10-19 10:58 ` [PATCH V2 mlx5-next 10/14] net/mlx5: Introduce migration bits and structures Yishai Hadas
2021-10-19 10:58 ` [PATCH V2 mlx5-next 11/14] vfio/mlx5: Expose migration commands over mlx5 device Yishai Hadas
2021-10-19 10:58 ` [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices Yishai Hadas
2021-10-19 18:43   ` Alex Williamson
2021-10-19 19:23     ` Jason Gunthorpe
2021-10-19 20:58       ` Alex Williamson
2021-10-19 23:04         ` Jason Gunthorpe
2021-10-20  8:28           ` Yishai Hadas
2021-10-20 16:52             ` Alex Williamson
2021-10-20 18:59               ` Jason Gunthorpe
2021-10-20 21:07                 ` Alex Williamson
2021-10-21  9:34                   ` Cornelia Huck
2021-10-21 21:47                     ` Alex Williamson
2021-10-25 12:29                       ` Jason Gunthorpe
2021-10-25 14:28                         ` Alex Williamson
2021-10-25 14:56                           ` Jason Gunthorpe
2021-10-26 14:42                             ` Alex Williamson
2021-10-26 15:18                               ` Jason Gunthorpe
2021-10-26 19:50                                 ` Alex Williamson
2021-10-26 23:43                                   ` Jason Gunthorpe
2021-10-27 19:05                                     ` Alex Williamson
2021-10-27 19:23                                       ` Jason Gunthorpe
2021-10-28 15:08                                         ` Cornelia Huck
2021-10-29  0:26                                           ` Jason Gunthorpe
2021-10-29  7:35                                             ` Yishai Hadas
2021-10-28 15:30                                         ` Alex Williamson
2021-10-28 23:47                                           ` Jason Gunthorpe
2021-10-29  6:57                                             ` Cornelia Huck
2021-10-29  7:48                                               ` Yishai Hadas
2021-10-29 10:32                                             ` Shameerali Kolothum Thodi
2021-10-29 12:15                                               ` Jason Gunthorpe
2021-10-29 22:06                                             ` Alex Williamson
2021-11-01 17:25                                               ` Jason Gunthorpe
2021-11-02 11:19                                                 ` Shameerali Kolothum Thodi
2021-11-02 14:56                                                 ` Alex Williamson
2021-11-02 15:54                                                   ` Jason Gunthorpe
2021-11-02 16:22                                                     ` Alex Williamson
2021-11-02 16:36                                                       ` Jason Gunthorpe
2021-11-02 20:15                                                         ` Alex Williamson
2021-11-03 12:09                                                           ` Jason Gunthorpe
2021-11-03 15:44                                                             ` Alex Williamson
2021-11-03 16:10                                                               ` Jason Gunthorpe
2021-11-03 18:04                                                                 ` Alex Williamson
2021-11-04 11:19                                                                   ` Cornelia Huck
2021-11-05 16:53                                                                     ` Cornelia Huck
2021-11-16 16:59                                                                       ` Cornelia Huck
2021-11-05 13:24                                                                   ` Jason Gunthorpe
2021-11-05 15:31                                                                     ` Alex Williamson
2021-11-15 23:29                                                                       ` Jason Gunthorpe
2021-11-16 17:57                                                                         ` Alex Williamson
2021-11-16 19:25                                                                           ` Jason Gunthorpe [this message]
2021-11-16 21:10                                                                             ` Alex Williamson
2021-11-17  1:48                                                                               ` Jason Gunthorpe
2021-11-18 18:15                                                                                 ` Alex Williamson
2021-11-22 19:18                                                                                   ` Jason Gunthorpe
2021-11-08  8:53                                 ` Tian, Kevin
2021-11-08 12:35                                   ` Jason Gunthorpe
2021-11-09  0:58                                     ` Tian, Kevin
2021-11-09 12:45                                       ` Jason Gunthorpe
2021-10-25 16:34               ` Dr. David Alan Gilbert
2021-10-25 17:55                 ` Alex Williamson
2021-10-25 18:47                   ` Dr. David Alan Gilbert
2021-10-25 19:15                     ` Jason Gunthorpe
2021-10-26  8:40                       ` Dr. David Alan Gilbert
2021-10-26 12:13                         ` Jason Gunthorpe
2021-10-26 14:52                           ` Alex Williamson
2021-10-26 15:56                             ` Jason Gunthorpe
2021-10-26 14:29                     ` Alex Williamson
2021-10-26 14:51                       ` Dr. David Alan Gilbert
2021-10-26 15:25                         ` Jason Gunthorpe
2021-10-20  8:01     ` Yishai Hadas
2021-10-20 16:25       ` Jason Gunthorpe
2021-10-21 10:46         ` Yishai Hadas
2021-10-19 10:58 ` [PATCH V2 mlx5-next 13/14] vfio/pci: Expose vfio_pci_aer_err_detected() Yishai Hadas
2021-10-19 10:58 ` [PATCH V2 mlx5-next 14/14] vfio/mlx5: Use its own PCI reset_done error handler Yishai Hadas
2021-10-19 18:55   ` Alex Williamson
2021-10-19 19:10     ` Jason Gunthorpe
2021-10-20  8:46       ` Yishai Hadas
2021-10-20 16:46         ` Jason Gunthorpe
2021-10-20 17:45           ` Alex Williamson
2021-10-20 18:57             ` Jason Gunthorpe
2021-10-20 21:38               ` Alex Williamson
2021-10-21 10:39             ` Yishai Hadas
2021-11-17 16:42 ` vfio migration discussions (was: [PATCH V2 mlx5-next 00/14] Add mlx5 live migration driver) Cornelia Huck
2021-11-17 17:47   ` Jason Gunthorpe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20211116192505.GB2105516@nvidia.com \
    --to=jgg@nvidia.com \
    --cc=alex.williamson@redhat.com \
    --cc=bhelgaas@google.com \
    --cc=cohuck@redhat.com \
    --cc=dgilbert@redhat.com \
    --cc=kuba@kernel.org \
    --cc=kvm@vger.kernel.org \
    --cc=kwankhede@nvidia.com \
    --cc=leonro@nvidia.com \
    --cc=linux-pci@vger.kernel.org \
    --cc=maorg@nvidia.com \
    --cc=mgurtovoy@nvidia.com \
    --cc=netdev@vger.kernel.org \
    --cc=saeedm@nvidia.com \
    --cc=shameerali.kolothum.thodi@huawei.com \
    --cc=yishaih@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).