All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jason Gunthorpe <jgg@nvidia.com>
To: Alexander Duyck <alexander.duyck@gmail.com>
Cc: Saeed Mahameed <saeed@kernel.org>,
	"David S. Miller" <davem@davemloft.net>,
	Jakub Kicinski <kuba@kernel.org>,
	Leon Romanovsky <leonro@nvidia.com>,
	Netdev <netdev@vger.kernel.org>, <linux-rdma@vger.kernel.org>,
	David Ahern <dsahern@kernel.org>,
	Jacob Keller <jacob.e.keller@intel.com>,
	Sridhar Samudrala <sridhar.samudrala@intel.com>,
	"Ertman, David M" <david.m.ertman@intel.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Kiran Patil <kiran.patil@intel.com>,
	Greg KH <gregkh@linuxfoundation.org>
Subject: Re: [net-next v4 00/15] Add mlx5 subfunction support
Date: Wed, 16 Dec 2020 20:38:29 -0400	[thread overview]
Message-ID: <20201217003829.GN552508@nvidia.com> (raw)
In-Reply-To: <CAKgT0UfuSA9PdtR6ftcq0_JO48Yp4N2ggEMiX9zrXkK6tN4Pmw@mail.gmail.com>

On Wed, Dec 16, 2020 at 02:53:07PM -0800, Alexander Duyck wrote:
 
> It isn't about the association, it is about who is handling the
> traffic. Going back to the macvlan model what we did is we had a group
> of rings on the device that would automatically forward unicast
> packets to the macvlan interface and would be reserved for
> transmitting packets from the macvlan interface. We took care of
> multicast and broadcast replication in software.

Okay, maybe I'm starting to see where you are coming from.

First, I think some clarity here, as I see it the devlink
infrastructure is all about creating the auxdevice for a switchdev
port.

What goes into that auxdevice is *completely* up to the driver. mlx5
is doing a SF which == VF, but that is not a requirement of the design
at all.

If an Intel driver wants to put a queue block into the aux device and
that is != VF, it is just fine.

The Intel netdev that binds to the auxdevice can transform the queue
block and specific switchdev config into a netdev identical to
accelerated macvlan. Nothing about the breaks the switchdev model.

Essentially think of it as generalizing the acceleration plugin for a
netdev. Instead of making something specific to limited macvlan, the
driver gets to provide exactly the structure that matches its HW to
provide the netdev as the user side of the switchdev port. I see no
limitation here so long as the switchdev model for controlling traffic
is followed.

Let me segue into a short story from RDMA.. We've had a netdev called
IPoIB for a long time. It is actually kind of similar to this general
thing you are talking about, in that there is a programming layer
under the IPOIB netdev called RDMA verbs that generalizes the actual
HW. Over the years this became more complicated because every new
netdev offloaded needed mirroring into the RDMA verbs general
API. TSO, GSO, checksum offload, endlessly onwards. It became quite
dumb in the end. We gave up and said the HW driver should directly
implement netdev. Implementing a middle API layer makes zero sense
when netdev is already perfectly suited to implement ontop of
HW. Removing SW layers caused performance to go up something like
2x.

The hard earned lesson I take from that is don't put software layers
between a struct net_device and the actual HW. The closest coupling is
really the best thing. Provide libary code in the kernel to help
drivers implement common patterns when making their netdevs, do not
provide wrapper netdevs around drivers.

IMHO the approach of macvlan accleration made some sense in 2013, but
today I would say it is mashing unrelated layers together and
polluting what should be a pure SW implementation with HW hooks.

I see from the mailing list comments this was done because creating a
device specific netdev via 'ip link add' was rightly rejected. However
here we *can* create a device specific vmdq *auxdevice*.  This is OK
because the netdev is controlling and containing the aux device via
switchdev.

So, Intel can get the "VMDQ link type" that was originally desired more
or less directly, so long as the associated switchdev port controls
the MAC filter process, not "ip link add".

And if you want to make the vmdq auxdevice into an ADI by user DMA to
queues, then sure, that model is completely sane too (vs hacking up
macvlan to expose user queues) - so long as the kernel controls the
selection of traffic into those queues and follows the switchdev
model. I would recommend creating a simple RDMA raw ethernet queue
driver over the aux device for something like this :)

> That might be a bad example, I was thinking of the issues we have had
> with VFs and direct assignment to Qemu based guests in the past.

As described, this is solved by VDPA.

> Essentially what I am getting at is that the setup in the container
> should be vendor agnostic. The interface exposed shouldn't be specific
> to any one vendor. So if I want to fire up a container or Mellanox,
> Broadcom, or some other vendor it shouldn't matter or be visible to
> the user. They should just see a vendor agnostic subfunction
> netdevice.

Agree. The agnostic container user interface here is 'struct
net_device'.

> > I have the feeling this stuff you are asking for is already done..
> 
> The case you are describing has essentially solved it for Qemu
> virtualization and direct assignment. It still doesn't necessarily
> solve it for the container case though.

The container case doesn't need solving.

Any scheme I've heard for container live migration, like CRIU,
essentially hot plugs the entire kernel in/out of a user process. We
rely on the kernel providing low leakage of the implementation details
of the struct net_device as part of it's uAPI contract. When CRIU
swaps the kernel the new kernel can have any implementation of the
container netdev it wants.

I've never heard of a use case to hot swap the implemention *under* a
netdev from a container. macvlan can't do this today. If you have a
use case here, it really has nothing to do with with this series.

Jason

  reply	other threads:[~2020-12-17  0:39 UTC|newest]

Thread overview: 65+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-14 21:43 [net-next v4 00/15] Add mlx5 subfunction support Saeed Mahameed
2020-12-14 21:43 ` [net-next v4 01/15] net/mlx5: Fix compilation warning for 32-bit platform Saeed Mahameed
2020-12-14 22:31   ` Alexander Duyck
2020-12-14 22:45     ` Saeed Mahameed
2020-12-15  4:59     ` Leon Romanovsky
2020-12-14 21:43 ` [net-next v4 02/15] devlink: Prepare code to fill multiple port function attributes Saeed Mahameed
2020-12-14 21:43 ` [net-next v4 03/15] devlink: Introduce PCI SF port flavour and port attribute Saeed Mahameed
2020-12-14 21:43 ` [net-next v4 04/15] devlink: Support add and delete devlink port Saeed Mahameed
2020-12-14 21:43 ` [net-next v4 05/15] devlink: Support get and set state of port function Saeed Mahameed
2020-12-14 21:43 ` [net-next v4 06/15] net/mlx5: Introduce vhca state event notifier Saeed Mahameed
2020-12-14 21:43 ` [net-next v4 07/15] net/mlx5: SF, Add auxiliary device support Saeed Mahameed
2020-12-14 21:43 ` [net-next v4 08/15] net/mlx5: SF, Add auxiliary device driver Saeed Mahameed
2020-12-14 21:43 ` [net-next v4 09/15] net/mlx5: E-switch, Prepare eswitch to handle SF vport Saeed Mahameed
2020-12-14 21:43 ` [net-next v4 10/15] net/mlx5: E-switch, Add eswitch helpers for " Saeed Mahameed
2020-12-14 21:43 ` [net-next v4 11/15] net/mlx5: SF, Add port add delete functionality Saeed Mahameed
2020-12-14 21:43 ` [net-next v4 12/15] net/mlx5: SF, Port function state change support Saeed Mahameed
2020-12-14 21:43 ` [net-next v4 13/15] devlink: Add devlink port documentation Saeed Mahameed
2020-12-14 21:43 ` [net-next v4 14/15] devlink: Extend devlink port documentation for subfunctions Saeed Mahameed
2020-12-14 21:43 ` [net-next v4 15/15] net/mlx5: Add devlink subfunction port documentation Saeed Mahameed
2020-12-15  1:53 ` [net-next v4 00/15] Add mlx5 subfunction support Alexander Duyck
2020-12-15  2:44   ` David Ahern
2020-12-15 16:16     ` Alexander Duyck
2020-12-15 16:59       ` Parav Pandit
2020-12-15  5:48   ` Parav Pandit
2020-12-15 18:47     ` Alexander Duyck
2020-12-15 20:05       ` Saeed Mahameed
2020-12-15 21:03       ` Jason Gunthorpe
2020-12-16  1:12       ` Edwin Peer
2020-12-16  2:39         ` Jason Gunthorpe
2020-12-16  3:12         ` Alexander Duyck
2020-12-15 20:59     ` David Ahern
2020-12-15  6:15   ` Saeed Mahameed
2020-12-15 19:12     ` Alexander Duyck
2020-12-15 20:35       ` Saeed Mahameed
2020-12-15 21:28         ` Jakub Kicinski
2020-12-16  6:50           ` Leon Romanovsky
2020-12-16 17:59             ` Saeed Mahameed
2020-12-15 21:41         ` Alexander Duyck
2020-12-16  0:19           ` Jason Gunthorpe
2020-12-16  2:19             ` Alexander Duyck
2020-12-16  3:03               ` Jason Gunthorpe
2020-12-16  4:13                 ` Alexander Duyck
2020-12-16  4:45                   ` Parav Pandit
2020-12-16 13:33                   ` Jason Gunthorpe
2020-12-16 16:31                     ` Alexander Duyck
2020-12-16 17:51                       ` Jason Gunthorpe
2020-12-16 19:27                         ` Alexander Duyck
2020-12-16 20:35                           ` Jason Gunthorpe
2020-12-16 22:53                             ` Alexander Duyck
2020-12-17  0:38                               ` Jason Gunthorpe [this message]
2020-12-17 18:48                                 ` Alexander Duyck
2020-12-17 19:40                                   ` Jason Gunthorpe
2020-12-17 21:05                                     ` Alexander Duyck
2020-12-18  0:08                                       ` Jason Gunthorpe
2020-12-18  1:30                               ` David Ahern
2020-12-18  3:11                                 ` Alexander Duyck
2020-12-18  3:55                                   ` David Ahern
2020-12-18 15:54                                     ` Alexander Duyck
2020-12-18  5:20                                   ` Parav Pandit
2020-12-18  5:36                                     ` Parav Pandit
2020-12-18 16:01                                     ` Alexander Duyck
2020-12-18 18:01                                       ` Parav Pandit
2020-12-18 19:22                                         ` Alexander Duyck
2020-12-18 20:18                                           ` Jason Gunthorpe
2020-12-19  0:03                                             ` Alexander Duyck

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201217003829.GN552508@nvidia.com \
    --to=jgg@nvidia.com \
    --cc=alexander.duyck@gmail.com \
    --cc=dan.j.williams@intel.com \
    --cc=davem@davemloft.net \
    --cc=david.m.ertman@intel.com \
    --cc=dsahern@kernel.org \
    --cc=gregkh@linuxfoundation.org \
    --cc=jacob.e.keller@intel.com \
    --cc=kiran.patil@intel.com \
    --cc=kuba@kernel.org \
    --cc=leonro@nvidia.com \
    --cc=linux-rdma@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=saeed@kernel.org \
    --cc=sridhar.samudrala@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.