Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices

From: Jason Gunthorpe <jgg@nvidia.com>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>,
	Cornelia Huck <cohuck@redhat.com>,
	Yishai Hadas <yishaih@nvidia.com>,
	bhelgaas@google.com, saeedm@nvidia.com,
	linux-pci@vger.kernel.org, kvm@vger.kernel.org,
	netdev@vger.kernel.org, kuba@kernel.org, leonro@nvidia.com,
	kwankhede@nvidia.com, mgurtovoy@nvidia.com, maorg@nvidia.com,
	"Dr. David Alan Gilbert" <dgilbert@redhat.com>
Subject: Re: [PATCH V2 mlx5-next 12/14] vfio/mlx5: Implement vfio_pci driver for mlx5 devices
Date: Tue, 16 Nov 2021 21:48:31 -0400	[thread overview]
Message-ID: <20211117014831.GH2105516@nvidia.com> (raw)
In-Reply-To: <20211116141031.443e8936.alex.williamson@redhat.com>

On Tue, Nov 16, 2021 at 02:10:31PM -0700, Alex Williamson wrote:
> On Tue, 16 Nov 2021 15:25:05 -0400
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Tue, Nov 16, 2021 at 10:57:36AM -0700, Alex Williamson wrote:
> > 
> > > > I think userspace should decide if it wants to use mlx5 built in or
> > > > the system IOMMU to do dirty tracking.  
> > > 
> > > What information does userspace use to inform such a decision?  
> > 
> > Kernel can't know which approach performs better. Operators should
> > benchmark and make a choice for their deployment HW. Maybe device
> > tracking severely impacts device performance or vice versa.
> 
> I'm all for keeping policy decisions out of the kernel, but it's pretty
> absurd to expect a userspace operator to benchmark various combination
> and wire various knobs through the user interface for this. 

Huh? This is the standard operating procedure in netdev land, mm and
others. Max performance requires tunables. And here we really do care
alot about migration time. I've seen the mm complexity supporting the
normal vCPU migration to know this is a high priority for many people.

> It seems to me that the kernel, ie. the vfio variant driver, *can*
> know the best default.  

If the kernel can have an algorithm to find the best default then qemu
can implement the same algorithm too. I'm also skeptical about this
claim the best is knowable.

> > Kernel doesn't easily know what userspace has done, maybe one device
> > supports migration driver dirty tracking and one device does not.
> 
> And that's exactly why the current type1 implementation exposes the
> least common denominator to userspace, ie. pinned pages only if all
> devices in the container have enabled this degree of granularity.

The current implementation forces no dirty tracking. That is a policy
choice that denies userspace the option to run one device with dirty
track and simply suspend the other.

> > So, I would like to see userspace control most of the policy aspects,
> > including the dirty track provider.
> 
> This sounds like device specific migration parameter tuning via a
> devlink interface to me, tbh.  How would you propose a generic
> vfio/iommufd interface to tune this sort of thing?

As I said, if each page track provider has its own interface it is
straightforward to make policy in userspace. The only tunable beyond
that is the dirty page tracking granularity. That is already
provided by userspace, but not as an parameter during START.

I don't see why we'd need something as complicated as devlink just
yet.

> > How does userspace know if dirty tracking works or not? All I see
> > VFIO_IOMMU_DIRTY_PAGES_FLAG_START unconditionally allocs some bitmaps.
> 
> IIRC, it's always supported by type1.  In the worst case we always
> report all mapped pages as dirty.

Again this denies userspace a policy choice. Why do dirty tracking
gyrations if they don't work? Just directly suspend the devices and
then copy.

> > I'm surprised it doesn't check that only NO_IOMMU's devices are
> > attached to the container and refuse to dirty track otherwise - since
> > it doesn't work..
> 
> No-IOMMU doesn't use type1, the ioctl returns errno.

Sorry, I mistyped that, I ment emulated iommu, as HCH has called it:

vfio_register_emulated_iommu_dev()

> > When we mix dirty track with pre-copy, the progression seems to be:
> > 
> >   DITRY TRACKING | RUNNING
> >      Copy every page to the remote
> >   DT | SAVING | RUNNING
> >      Copy pre-copy migration data to the remote
> >   SAVING | NDMA | RUNNING
> >      Read and clear dirty track device bitmap
> >   DT | SAVING | RUNNING
> >      Copy new dirtied data
> >      (maybe loop back to NDMA a few times?)
> >   SAVING | NDMA | RUNNING
> >      P2P grace state
> >   0
> >     Read the dirty track and copy data
> >     Read and send the migration state
> > 
> > Can we do something so complex using only SAVING?
> 
> I'm not demanding that triggering device dirty tracking on saving is
> how this must be done, I'm only stating that's an idea that was
> discussed.  If we need more complicated triggers between the IOMMU and
> device, let's define those, but I don't see that doing so negates the
> benefits of aggregated dirty bitmaps in the IOMMU context.

Okay. As far as your request to document things as we seem them
upcoming I belive we should have some idea how dirty tracking control
fits in. I agree that is not related to how the bitmap is reported. We
will continue to think about dirty tracking as not connected to
SAVING.

> > This creates inefficiencies in the kernel, we copy from the mlx5
> > formed data structure to new memory in the iommu through a very
> > ineffficent API and then again we do an ioctl to copy it once more and
> > throw all the extra work away. It does not seem good for something
> > where we want performance.
> 
> So maybe the dirty bitmaps for the IOMMU context need to be exposed to
> and directly modifiable by the drivers using atomic bitmap ops.  Maybe
> those same bitmaps can be mmap'd to userspace.  These bitmaps are not
> insignificant, do we want every driver managing their own copies?

If we look at mlx5 for example there is no choice. The dirty log is in
some device specific format and is not sharable. We must allocate
memory to work with it.

What I don't need is the bitmap memory in the iommu container, that is
all useless for mlx5.

So, down this path we need some API for the iommu context to not
allocate its memory at all and refer to storage from the tracking
provider for cases where that makes sense.

> > Userspace has to do this anyhow if it has configurations with multiple
> > containers. For instance because it was forced to split the containers
> > due to one device not supporting NDMA.
> 
> Huh?  When did that become a requirement?  

It is not a requirement, it is something userspace can do, if it
wants. And we talked about this, if NDMA isn't supported the P2P can't
work, and a way to enforce that is to not include P2P in the IOMMU
mapping. Except if you have an asymmetric NDMA you may want an
asymmetric IOMMU mapping too where the NDMA devices can do P2P and the
others don't. That is two containers and two dirty bitmaps.

Who knows, it isn't the job of the kernel to make these choices, the
kernel just provides tools.

> I feel like there are a lot of excuses listed here, but nothing that
> really demands a per device interface, 

"excuses" and "NIH" is a bit rude Alex.

From my side, you are asking for a lot work from us (who else?) to
define and implement a wack of missing kernel functionality.

I think it is very reasonable to ask what the return to the community
is for this work. "makes more sense to me" is not something that
is really compelling.

So, I'd rather you tell me concretely why doing this work, in this
way, is a good idea.

> > If the migration driver says it supports tracking, then it only tracks
> > DMA from that device.
> 
> I don't see what this buys us.  Userspace is only going to do a
> migration if all devices support the per device migration region.  At
> that point we need the best representation of the dirty bitmap we can
> provide per IOMMU context.  It makes sense to me to aggregate per
> device dirtying into that one context.

Again, there are policy choices userspace can make, like just
suspending the device that doesn't do dirty tracking and continuing to
dirty track the ones that do.

This might be very logical if a non-dirty tracking device is something
like IDXD that will just cause higher request latency and the dirty
tracking is mlx5 that will cause externally visable artifacts.

My point is *we don't know* what people will want.

I also think you are too focused on a one-size solution that fits into
a qemu sort of enteprise product. While I can agree with your position
relative to an enterprise style customer, NVIDIA has different
customers, many that have their own customized VMMs that are tuned for
their HW choices. For these customers I do like to see that the kernel
allows options, and I don't think it is appropriate to be so
dismissive of this perspective.

> > What I see is a lot of questions and limitations with this
> > approach. If we stick to funneling everything through the iommu then
> > answering the questions seem to create a large amount of kernel
> > work. Enough to ask if it is worthwhile..
> 
> If we need a common userspace IOMMU subsystem like IOMMUfd that can
> handle driver page pinning, IOMMU faults, and dirty tracking, why does
> it suddenly become an unbearable burden to allow other means besides
> page pinning for a driver to relay DMA page writes?

I can see some concrete reasons for iommufd, like it allows this code
to be shared between mutliple subsystems that need it. Its design
supports more functionality than the vfio container can.

Here, I don't quite see it. If userspace does

  for (i = 0; i != num_trackers; i++)
     ioctl(merge dirty bitmap, i, &bitmap)

Or
   ioctl(read diry bitmap, &bitmap)

Hardly seems decisive. What bothers me is the overhead and kernel
complexity.

If we split them it looks quite simple:
 - new ioctl on vfio device to read & clear dirty bitmap
    & extension flag to show this is supported
 - DIRTY_TRACK migration state bit to start/stop
   Simple logic that read is only possible in NDMA/!RUNNING
 - new ioctl on vfio device to report dirty tracking capability flags:
    - internal dirty track supported
    - real dirty track through attached container supported
      (only mdev can set this today)

Scenarios:

a. If userspace has a mlx5 and a mdev then it does a for loop as above to
   start and read the dirty bitmaps.

b. If userspace has only mlx5 it reads one bitmap

c. If userspace has mlx5 and some other PCI device it can activate
   mlx5 and leave the container alone. Suspend the PCI device early.
   Or directly give up on dirty track

For funneling through the container ioctls.. Humm:
 - Still track per device/container connection if internal/external
   dirty track is supported. Add an new ioctl so userspace can have
   this info for (c)
 - For external dirty track have some ops callbacks for start/stop,
   read bitmap and clear. (So, no migration state flag?)
 - On start make the policy choice if the external/internal will be
   used, then negotiate some uniform tracking size and iterate over
   all externals to call start
 - On read.. to avoid overheads iterate over the internal bitmap
   and read ops on all external bitmaps and or them together
   then copy to user. Just ignore NDMA and rely on userspace to
   do it right?
 - Clear iterates and zeros bitmaps
 - Change logic to not allocate tracking bitmaps if no mdevs

a. works the same, kernel turns on both trackers
b. works the same, kernel turns on only mlx5
c. Hum. We have to disable the 'no tracker report everything as
   dirty' feature somehow so we can access only the mlx5 tracker
   without forcing evey page seen as dirty. Needs a details

And figure out how this relates to the ongoing iommufd project (which,
BTW, we have now invested a lot in too).

I'm not as convinced as you are that the 2nd is obviously better, and
on principle I don't like avoidable policy choices baked in the
kernel.

And I don't see that it makes userspace really much simpler anyhow. On
the other hand it looks like a lot more kernel work..

> OTOH, aggregating these features in the IOMMU reduces both overhead
> of per device bitmaps and user operations to create their own
> consolidated view.

I don't understand this, funneling will not reduce overhead, at best
with some work we can almost break even by not allocating the SW
bitmaps.

Jason