Re: [net-next v4 00/15] Add mlx5 subfunction support

From: Jason Gunthorpe <jgg@nvidia.com>
To: Alexander Duyck <alexander.duyck@gmail.com>
Cc: Saeed Mahameed <saeed@kernel.org>,
	"David S. Miller" <davem@davemloft.net>,
	Jakub Kicinski <kuba@kernel.org>,
	Leon Romanovsky <leonro@nvidia.com>,
	Netdev <netdev@vger.kernel.org>, <linux-rdma@vger.kernel.org>,
	David Ahern <dsahern@kernel.org>,
	Jacob Keller <jacob.e.keller@intel.com>,
	Sridhar Samudrala <sridhar.samudrala@intel.com>,
	"Ertman, David M" <david.m.ertman@intel.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Kiran Patil <kiran.patil@intel.com>,
	Greg KH <gregkh@linuxfoundation.org>
Subject: Re: [net-next v4 00/15] Add mlx5 subfunction support
Date: Wed, 16 Dec 2020 20:38:29 -0400	[thread overview]
Message-ID: <20201217003829.GN552508@nvidia.com> (raw)
In-Reply-To: <CAKgT0UfuSA9PdtR6ftcq0_JO48Yp4N2ggEMiX9zrXkK6tN4Pmw@mail.gmail.com>

On Wed, Dec 16, 2020 at 02:53:07PM -0800, Alexander Duyck wrote:

> It isn't about the association, it is about who is handling the
> traffic. Going back to the macvlan model what we did is we had a group
> of rings on the device that would automatically forward unicast
> packets to the macvlan interface and would be reserved for
> transmitting packets from the macvlan interface. We took care of
> multicast and broadcast replication in software.

Okay, maybe I'm starting to see where you are coming from.

First, I think some clarity here, as I see it the devlink
infrastructure is all about creating the auxdevice for a switchdev
port.

What goes into that auxdevice is *completely* up to the driver. mlx5
is doing a SF which == VF, but that is not a requirement of the design
at all.

If an Intel driver wants to put a queue block into the aux device and
that is != VF, it is just fine.

The Intel netdev that binds to the auxdevice can transform the queue
block and specific switchdev config into a netdev identical to
accelerated macvlan. Nothing about the breaks the switchdev model.

Essentially think of it as generalizing the acceleration plugin for a
netdev. Instead of making something specific to limited macvlan, the
driver gets to provide exactly the structure that matches its HW to
provide the netdev as the user side of the switchdev port. I see no
limitation here so long as the switchdev model for controlling traffic
is followed.

Let me segue into a short story from RDMA.. We've had a netdev called
IPoIB for a long time. It is actually kind of similar to this general
thing you are talking about, in that there is a programming layer
under the IPOIB netdev called RDMA verbs that generalizes the actual
HW. Over the years this became more complicated because every new
netdev offloaded needed mirroring into the RDMA verbs general
API. TSO, GSO, checksum offload, endlessly onwards. It became quite
dumb in the end. We gave up and said the HW driver should directly
implement netdev. Implementing a middle API layer makes zero sense
when netdev is already perfectly suited to implement ontop of
HW. Removing SW layers caused performance to go up something like
2x.

The hard earned lesson I take from that is don't put software layers
between a struct net_device and the actual HW. The closest coupling is
really the best thing. Provide libary code in the kernel to help
drivers implement common patterns when making their netdevs, do not
provide wrapper netdevs around drivers.

IMHO the approach of macvlan accleration made some sense in 2013, but
today I would say it is mashing unrelated layers together and
polluting what should be a pure SW implementation with HW hooks.

I see from the mailing list comments this was done because creating a
device specific netdev via 'ip link add' was rightly rejected. However
here we *can* create a device specific vmdq *auxdevice*.  This is OK
because the netdev is controlling and containing the aux device via
switchdev.

So, Intel can get the "VMDQ link type" that was originally desired more
or less directly, so long as the associated switchdev port controls
the MAC filter process, not "ip link add".

And if you want to make the vmdq auxdevice into an ADI by user DMA to
queues, then sure, that model is completely sane too (vs hacking up
macvlan to expose user queues) - so long as the kernel controls the
selection of traffic into those queues and follows the switchdev
model. I would recommend creating a simple RDMA raw ethernet queue
driver over the aux device for something like this :)

> That might be a bad example, I was thinking of the issues we have had
> with VFs and direct assignment to Qemu based guests in the past.

As described, this is solved by VDPA.

> Essentially what I am getting at is that the setup in the container
> should be vendor agnostic. The interface exposed shouldn't be specific
> to any one vendor. So if I want to fire up a container or Mellanox,
> Broadcom, or some other vendor it shouldn't matter or be visible to
> the user. They should just see a vendor agnostic subfunction
> netdevice.

Agree. The agnostic container user interface here is 'struct
net_device'.

> > I have the feeling this stuff you are asking for is already done..
> 
> The case you are describing has essentially solved it for Qemu
> virtualization and direct assignment. It still doesn't necessarily
> solve it for the container case though.

The container case doesn't need solving.

Any scheme I've heard for container live migration, like CRIU,
essentially hot plugs the entire kernel in/out of a user process. We
rely on the kernel providing low leakage of the implementation details
of the struct net_device as part of it's uAPI contract. When CRIU
swaps the kernel the new kernel can have any implementation of the
container netdev it wants.

I've never heard of a use case to hot swap the implemention *under* a
netdev from a container. macvlan can't do this today. If you have a
use case here, it really has nothing to do with with this series.

Jason