All of lore.kernel.org
 help / color / mirror / Atom feed
From: Alexander Duyck <alexander.duyck@gmail.com>
To: Jason Gunthorpe <jgg@nvidia.com>
Cc: Saeed Mahameed <saeed@kernel.org>,
	"David S. Miller" <davem@davemloft.net>,
	Jakub Kicinski <kuba@kernel.org>,
	Leon Romanovsky <leonro@nvidia.com>,
	Netdev <netdev@vger.kernel.org>,
	linux-rdma@vger.kernel.org, David Ahern <dsahern@kernel.org>,
	Jacob Keller <jacob.e.keller@intel.com>,
	Sridhar Samudrala <sridhar.samudrala@intel.com>,
	"Ertman, David M" <david.m.ertman@intel.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Kiran Patil <kiran.patil@intel.com>,
	Greg KH <gregkh@linuxfoundation.org>
Subject: Re: [net-next v4 00/15] Add mlx5 subfunction support
Date: Thu, 17 Dec 2020 13:05:03 -0800	[thread overview]
Message-ID: <CAKgT0Ue9+cd-Mp4qgusorDX1mnjfzMXrQvB2FqLaH+ouzVTMRQ@mail.gmail.com> (raw)
In-Reply-To: <20201217194035.GT552508@nvidia.com>

On Thu, Dec 17, 2020 at 11:40 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Thu, Dec 17, 2020 at 10:48:48AM -0800, Alexander Duyck wrote:
>
> > Just to clarify I am not with Intel, nor do I plan to work on any
> > Intel drivers related to this.
>
> Sure
>
> > I disagree here. In my mind a design where two interfaces, which both
> > exist in the kernel, have to go to hardware in order to communicate is
> > very limiting. The main thing I am wanting to see is the option of
> > being able to pass traffic directly between the switchdev and the SF
> > without the need to touch the hardware.
>
> I view the SW bypass path you are talking about similarly to
> GSO/etc. It should be accessed by the HW driver as an optional service
> provided by the core netdev, not implemented as some wrapper netdev
> around a HW implementation.

I view it as being something that would be a part of the switchdev API
itself. Basically the switchev and endpoint would need to be able to
control something like this because if XDP were enabled on one end or
the other you would need to be able to switch it off so that all of
the packets followed the same flow and could be scanned by the XDP
program.

> If you feel strongly it is needed then there is nothing standing in
> the way to implement it in the switchdev auxdevice model.
>
> It is simple enough, the HW driver's tx path would somehow detect
> east/west and queue it differently, and the rx path would somehow be
> able to mux in skbs from a SW queue. Not seeing any blockers here.

In my mind the simple proof of concept for this would be to check for
the multicast bit being set in the destination MAC address for packets
coming from the subfunction. If it is then shunt to this bypass route,
and if not then you transmit to the hardware queues. In the case of
packets coming from the switchdev port it would probably depend. The
part I am not sure about is if any packets need to be actually
transmitted to the hardware in the standard case for packets going
from the switchdev port to the subfunction. If there is no XDP or
anything like that present in the subfunction it probably wouldn't
matter and you could just shunt it straight across and bypass the
hardware, however if XDP is present you would need to get the packets
into the ring which would force the bypass to be turned off.

> > > model. I would recommend creating a simple RDMA raw ethernet queue
> > > driver over the aux device for something like this :)
> >
> > You lost me here, I'm not seeing how RDMA and macvlan are connected.
>
> RDMA is the standard uAPI to get a userspace HW DMA queue for ethernet
> packets.

Ah, I think you are talking about device assignment. In my mind I was
just talking about the interface assigned to the container which as
you have stated is basically just a netdev.

> > > > Essentially what I am getting at is that the setup in the container
> > > > should be vendor agnostic. The interface exposed shouldn't be specific
> > > > to any one vendor. So if I want to fire up a container or Mellanox,
> > > > Broadcom, or some other vendor it shouldn't matter or be visible to
> > > > the user. They should just see a vendor agnostic subfunction
> > > > netdevice.
> > >
> > > Agree. The agnostic container user interface here is 'struct
> > > net_device'.
> >
> > I disagree here. The fact is a mellanox netdev, versus a broadcom
> > netdev, versus an intel netdev all have a very different look at feel
> > as the netdev is essentially just the base device you are building
> > around.
>
> Then fix the lack of standardization of netdev implementations!

We're trying to work on that, but trying to fix it after the fact is
like herding cats.

> Adding more abstraction layers isn't going to fix that fundamental
> problem.
>
> Frankly it seems a bit absurd to complain that the very basic element
> of the common kernel uAPI - struct net_device - is so horribly
> fragmented and vendor polluted that we can't rely on it as a stable
> interface for containers.

The problem isn't necessarily the net_device it is more the
net_device_ops and the fact that there are so many different ways to
get things done. Arguably the flexibility of the netd_device is great
for allowing vendors to expose their features. However at the same
time it allows for features to be left out so what you end up with a
wide variety of things that are net_devices.

> Even if that is true, I don't belive for a second that adding a
> different HW abstraction layer is going to somehow undo the mistakes
> of the last 20 years.

It depends on how it is done. The general idea is to address the
biggest limitation that has occured, which is the fact that in many
cases we don't have software offloads to take care of things when the
hardware offloads provided by a certain piece of hardware are not
present. It would basically allow us to reset the feature set. If
something cannot be offloaded in software in a reasonable way, it is
not allowed to be present in the interface provided to a container.
That way instead of having to do all the custom configuration in the
container recipe it can be centralized to one container handling all
of the switching and hardware configuration.

> > Again, the hot-swap isn't necessarily what I am talking about. I am
> > talking about setting up a config for a set of containers in a
> > datacenter. What I don't want to do is have to have one set of configs
> > for an mlx5 SF, another for a broadcom SF, and yet another set for any
> > other vendors out there. I would much rather have all of that dealt
> > with within the namespace that is handling the switchdev setup.
>
> If there is real problems here then I very much encourage you to start
> an effort to push all the vendors to implement a consistent user
> experience for the HW netdevs.

To some extent that has been going on for some time. It is one of the
reasons why there is supposed to be software offloads for any datapath
features that get added to the hardware. Such as GSO to offset TSO.
However there always ends up the occasional thing that ends up getting
past and that is where the frustration comes in.

> I don't know what your issues are, but it sounds like it would be a
> very interesting conference presentation.
>
> But it has nothing to do with this series.
>
> Jason

There I disagree. Now I can agree that most of the series is about
presenting the aux device and that part I am fine with. However when
the aux device is a netdev and that netdev is being loaded into the
same kernel as the switchdev port is where the red flags start flying,
especially when we start talking about how it is the same as a VF.

In my mind we are talking about how the switchdev will behave and it
makes sense to see about defining if a east-west bypass makes sense
and how it could be implemented, rather than saying we won't bother
for now and potentially locking in the subfunction to virtual function
equality. In my mind we need more than just the increased count to
justify going to subfunctions, and I think being able to solve the
east-west problem at least in terms of containers would be such a
thing.

  reply	other threads:[~2020-12-17 21:06 UTC|newest]

Thread overview: 65+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-14 21:43 [net-next v4 00/15] Add mlx5 subfunction support Saeed Mahameed
2020-12-14 21:43 ` [net-next v4 01/15] net/mlx5: Fix compilation warning for 32-bit platform Saeed Mahameed
2020-12-14 22:31   ` Alexander Duyck
2020-12-14 22:45     ` Saeed Mahameed
2020-12-15  4:59     ` Leon Romanovsky
2020-12-14 21:43 ` [net-next v4 02/15] devlink: Prepare code to fill multiple port function attributes Saeed Mahameed
2020-12-14 21:43 ` [net-next v4 03/15] devlink: Introduce PCI SF port flavour and port attribute Saeed Mahameed
2020-12-14 21:43 ` [net-next v4 04/15] devlink: Support add and delete devlink port Saeed Mahameed
2020-12-14 21:43 ` [net-next v4 05/15] devlink: Support get and set state of port function Saeed Mahameed
2020-12-14 21:43 ` [net-next v4 06/15] net/mlx5: Introduce vhca state event notifier Saeed Mahameed
2020-12-14 21:43 ` [net-next v4 07/15] net/mlx5: SF, Add auxiliary device support Saeed Mahameed
2020-12-14 21:43 ` [net-next v4 08/15] net/mlx5: SF, Add auxiliary device driver Saeed Mahameed
2020-12-14 21:43 ` [net-next v4 09/15] net/mlx5: E-switch, Prepare eswitch to handle SF vport Saeed Mahameed
2020-12-14 21:43 ` [net-next v4 10/15] net/mlx5: E-switch, Add eswitch helpers for " Saeed Mahameed
2020-12-14 21:43 ` [net-next v4 11/15] net/mlx5: SF, Add port add delete functionality Saeed Mahameed
2020-12-14 21:43 ` [net-next v4 12/15] net/mlx5: SF, Port function state change support Saeed Mahameed
2020-12-14 21:43 ` [net-next v4 13/15] devlink: Add devlink port documentation Saeed Mahameed
2020-12-14 21:43 ` [net-next v4 14/15] devlink: Extend devlink port documentation for subfunctions Saeed Mahameed
2020-12-14 21:43 ` [net-next v4 15/15] net/mlx5: Add devlink subfunction port documentation Saeed Mahameed
2020-12-15  1:53 ` [net-next v4 00/15] Add mlx5 subfunction support Alexander Duyck
2020-12-15  2:44   ` David Ahern
2020-12-15 16:16     ` Alexander Duyck
2020-12-15 16:59       ` Parav Pandit
2020-12-15  5:48   ` Parav Pandit
2020-12-15 18:47     ` Alexander Duyck
2020-12-15 20:05       ` Saeed Mahameed
2020-12-15 21:03       ` Jason Gunthorpe
2020-12-16  1:12       ` Edwin Peer
2020-12-16  2:39         ` Jason Gunthorpe
2020-12-16  3:12         ` Alexander Duyck
2020-12-15 20:59     ` David Ahern
2020-12-15  6:15   ` Saeed Mahameed
2020-12-15 19:12     ` Alexander Duyck
2020-12-15 20:35       ` Saeed Mahameed
2020-12-15 21:28         ` Jakub Kicinski
2020-12-16  6:50           ` Leon Romanovsky
2020-12-16 17:59             ` Saeed Mahameed
2020-12-15 21:41         ` Alexander Duyck
2020-12-16  0:19           ` Jason Gunthorpe
2020-12-16  2:19             ` Alexander Duyck
2020-12-16  3:03               ` Jason Gunthorpe
2020-12-16  4:13                 ` Alexander Duyck
2020-12-16  4:45                   ` Parav Pandit
2020-12-16 13:33                   ` Jason Gunthorpe
2020-12-16 16:31                     ` Alexander Duyck
2020-12-16 17:51                       ` Jason Gunthorpe
2020-12-16 19:27                         ` Alexander Duyck
2020-12-16 20:35                           ` Jason Gunthorpe
2020-12-16 22:53                             ` Alexander Duyck
2020-12-17  0:38                               ` Jason Gunthorpe
2020-12-17 18:48                                 ` Alexander Duyck
2020-12-17 19:40                                   ` Jason Gunthorpe
2020-12-17 21:05                                     ` Alexander Duyck [this message]
2020-12-18  0:08                                       ` Jason Gunthorpe
2020-12-18  1:30                               ` David Ahern
2020-12-18  3:11                                 ` Alexander Duyck
2020-12-18  3:55                                   ` David Ahern
2020-12-18 15:54                                     ` Alexander Duyck
2020-12-18  5:20                                   ` Parav Pandit
2020-12-18  5:36                                     ` Parav Pandit
2020-12-18 16:01                                     ` Alexander Duyck
2020-12-18 18:01                                       ` Parav Pandit
2020-12-18 19:22                                         ` Alexander Duyck
2020-12-18 20:18                                           ` Jason Gunthorpe
2020-12-19  0:03                                             ` Alexander Duyck

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAKgT0Ue9+cd-Mp4qgusorDX1mnjfzMXrQvB2FqLaH+ouzVTMRQ@mail.gmail.com \
    --to=alexander.duyck@gmail.com \
    --cc=dan.j.williams@intel.com \
    --cc=davem@davemloft.net \
    --cc=david.m.ertman@intel.com \
    --cc=dsahern@kernel.org \
    --cc=gregkh@linuxfoundation.org \
    --cc=jacob.e.keller@intel.com \
    --cc=jgg@nvidia.com \
    --cc=kiran.patil@intel.com \
    --cc=kuba@kernel.org \
    --cc=leonro@nvidia.com \
    --cc=linux-rdma@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=saeed@kernel.org \
    --cc=sridhar.samudrala@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.