All of lore.kernel.org
 help / color / mirror / Atom feed
From: Alexander Duyck <alexander.duyck@gmail.com>
To: Jason Gunthorpe <jgg@nvidia.com>
Cc: Saeed Mahameed <saeed@kernel.org>,
	"David S. Miller" <davem@davemloft.net>,
	Jakub Kicinski <kuba@kernel.org>,
	Leon Romanovsky <leonro@nvidia.com>,
	Netdev <netdev@vger.kernel.org>,
	linux-rdma@vger.kernel.org, David Ahern <dsahern@kernel.org>,
	Jacob Keller <jacob.e.keller@intel.com>,
	Sridhar Samudrala <sridhar.samudrala@intel.com>,
	"Ertman, David M" <david.m.ertman@intel.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Kiran Patil <kiran.patil@intel.com>,
	Greg KH <gregkh@linuxfoundation.org>
Subject: Re: [net-next v4 00/15] Add mlx5 subfunction support
Date: Wed, 16 Dec 2020 08:31:44 -0800	[thread overview]
Message-ID: <CAKgT0UcRfB8a61rSWW-NPdbGh3VcX_=LCZ5J+-YjqYNtm+RhVg@mail.gmail.com> (raw)
In-Reply-To: <20201216133309.GI552508@nvidia.com>

On Wed, Dec 16, 2020 at 5:33 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Tue, Dec 15, 2020 at 08:13:21PM -0800, Alexander Duyck wrote:
>
> > > > Ugh, don't get me started on switchdev. The biggest issue as I see it
> > > > with switchev is that you have to have a true switch in order to
> > > > really be able to use it.
> > >
> > > That cuts both ways, suggesting HW with a true switch model itself
> > > with VMDq is equally problematic.
> >
> > Yes and no. For example the macvlan offload I had setup could be
> > configured both ways and it made use of VMDq. I'm not necessarily
> > arguing that we need to do VMDq here, however at the same time saying
> > that this is only meant to replace SR-IOV becomes problematic since we
> > already have SR-IOV so why replace it with something that has many of
> > the same limitations?
>
> Why? Because SR-IOV is the *only* option for many use cases. Still. I
> said this already, something more generic does not magicaly eliminate
> SR-IOV.
>
> The SIOV ADI model is a small refinement to the existing VF scheme, it
> is completely parallel to making more generic things.
>
> It is not "repeating mistakes" it is accepting the limitations of
> SR-IOV because benefits exist and applications need those benefits.

If we have two interfaces, both with pretty much the same limitations
then many would view it as "repeating mistakes". The fact is we
already have SR-IOV. Why introduce yet another interface that has the
same functionality?

You say this will scale better but I am not even sure about that. The
fact is SR-IOV could scale to 256 VFs, but for networking I kind of
doubt the limitation would have been the bus number and would more
likely be issues with packet replication and PCIe throughput,
especially when you start dealing with east-west traffic within the
same system.

> > That said I understand your argument, however I view the elimination
> > of SR-IOV to be something we do after we get this interface right and
> > can justify doing so.
>
> Elimination of SR-IOV isn't even a goal here!

Sorry you used the word "replace", and my assumption here was that the
goal is to get something in place that can take the place of SR-IOV so
that you wouldn't be maintaining the two systems at the same time.
That is my concern as I don't want us having SR-IOV, and then several
flavors of SIOV. We need to decide on one thing that will be the way
forward.

> > Also it might be useful to call out the flavours and planned flavours
> > in the cover page. Admittedly the description is somewhat lacking in
> > that regard.
>
> This is more of a general switchdev remark though. In the swithdev
> model you have a the switch and a switch port. Each port has a
> swichdev representor on the switch side and a "user port" of some
> kind.
>
> It can be a physical thing:
>  - SFP
>  - QSFP
>  - WiFi Antennae
>
> It could be a semi-physical thing outside the view of the kernel:
>  - SmartNIC VF/SF attached to another CPU
>
> It can be a semi-physical thing in view of this kernel:
>  - SRIOV VF (struct pci device)
>  - SF (struct aux device)
>
> It could be a SW construct in this kernel:
>  - netdev (struct net device)
>
> *all* of these different port types are needed. Probably more down the
> road!
>
> Notice I don't have VPDA, VF/SF netdev, or virtio-mdev as a "user
> port" type here. Instead creating the user port pci or aux device
> allows the user to use the Linux driver model to control what happens
> to the pci/aux device next.

I get that. That is why I said switchdev isn't a standard for the
endpoint. One of the biggest issues with SR-IOV that I have seen is
the fact that the last piece isn't really defined. We never did a good
job of defining how the ADI should look to the guest and as a result
it kind of stalled in adoption.

> > I would argue that is one of the reasons why this keeps being
> > compared to either VMDq or VMQ as it is something that SR-IOV has
> > yet to fully replace and has many features that would be useful in
> > an interface that is a subpartition of an existing interface.
>
> In what sense does switchdev and a VF not fully replace macvlan VMDq?

One of the biggest is east-west traffic. You quickly run up against
the PCIe bandwidth bottleneck and then the performance tanks. I have
seen a number of cases where peer-to-peer on the same host swamps the
network interface.

> > The Intel drivers still have the macvlan as the assignable ADI and
> > make use of VMDq to enable it.
>
> Is this in-tree or only in the proprietary driver? AFAIK there is no
> in-tree way to extract the DMA queue from the macvlan netdev into
> userspace..
>
> Remeber all this VF/SF/VDPA stuff results in a HW dataplane, not a SW
> one. It doesn't really make sense to compare a SW dataplane to a HW
> one. HW dataplanes come with limitations and require special driver
> code.

I get that. At the same time we can mask some of those limitations by
allowing for the backend to be somewhat abstract so you have the
possibility of augmenting the hardware dataplane with a software one
if needed.

> > The limitation as I see it is that the macvlan interface doesn't allow
> > for much in the way of custom offloads and the Intel hardware doesn't
> > support switchdev. As such it is good for a basic interface, but
> > doesn't really do well in terms of supporting advanced vendor-specific
> > features.
>
> I don't know what it is that prevents Intel from modeling their
> selector HW in switchdev, but I think it is on them to work with the
> switchdev folks to figure something out.

They tried for the ixgbe and i40e. The problem is the hardware
couldn't conform to what was asked for if I recall. It has been a few
years since I worked in the Ethernet group at intel so I don't recall
the exact details.

> I'm a bit surprised HW that can do macvlan can't be modeled with
> switchdev? What is missing?

If I recall it was the fact that the hardware defaults to transmitting
everything that doesn't match an existing rule to the external port
unless it comes from the external port.

> > > That is goal here. This is not about creating just a netdev, this is
> > > about the whole kit: rdma, netdev, vdpa virtio-net, virtio-mdev.
> >
> > One issue is right now we are only seeing the rdma and netdev. It is
> > kind of backwards as it is using the ADIs on the host when this was
> > really meant to be used for things like mdev.
>
> This is second 15 patch series on this path already. It is not
> possible to pack every single thing into this series. This is the
> micro step of introducing the SF idea and using SF==VF to show how the
> driver stack works. The minimal changing to the existing drivers
> implies this can support an ADI as well.
>
> Further, this does already show an ADI! vdpa_mlx5 will run on the
> VF/SF and eventually causes qemu to build a virtio-net ADI that
> directly passes HW DMA rings into the guest.
>
> Isn't this exactly the kind of generic SRIOV replacement option you
> have been asking for? Doesn't this completely supersede stuff built on
> macvlan?

Something like the vdpa model is more like what I had in mind. Only
vdpa only works for the userspace networking case.

Basically the idea is to have an assignable device interface that
isn't directly tied to the hardware. Instead it is making use of a
slice of it and referencing the PF as the parent leaving the PF as the
owner of the slice. If at some point in the future we could make
changes to allow for software to step in and do some switching if
needed. The key bit is the abstraction of the assignable interface so
that it is vendor agnostic and could be switched over to pure software
backing if needed.

> > expected to work. The swtichdev API puts some restrictions in place
> > but there still ends up being parts without any definition.
>
> I'm curious what you see as needing definition here?
>
> The SRIOV model has the HW register programming API is device
> specific.
>
> The switchdev model is: no matter what HW register programing is done
> on the VF/SF all the packets tx/rx'd will flow through the switchdev.
>
> The purpose of switchdev/SRIOV/SIOV has never been to define a single
> "one register set to rule them all".
>
> That is the area that VDPA virtio-net and others are covering.

That is fine and that covers it for direct assigned devices. However
that doesn't cover the container case. My thought is if we are going
to partition a PF into multiple netdevices we should have some generic
interface that can be provided to represent the netdevs so that if
they are pushed into containers you don't have to rip them out if for
some reason you need to change the network configuration. For the
Intel NICs we did that with macvlan in the VMDq case. I see no reason
why you couldn't do something like that here with the subfunction
case.

  reply	other threads:[~2020-12-16 16:32 UTC|newest]

Thread overview: 65+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-14 21:43 [net-next v4 00/15] Add mlx5 subfunction support Saeed Mahameed
2020-12-14 21:43 ` [net-next v4 01/15] net/mlx5: Fix compilation warning for 32-bit platform Saeed Mahameed
2020-12-14 22:31   ` Alexander Duyck
2020-12-14 22:45     ` Saeed Mahameed
2020-12-15  4:59     ` Leon Romanovsky
2020-12-14 21:43 ` [net-next v4 02/15] devlink: Prepare code to fill multiple port function attributes Saeed Mahameed
2020-12-14 21:43 ` [net-next v4 03/15] devlink: Introduce PCI SF port flavour and port attribute Saeed Mahameed
2020-12-14 21:43 ` [net-next v4 04/15] devlink: Support add and delete devlink port Saeed Mahameed
2020-12-14 21:43 ` [net-next v4 05/15] devlink: Support get and set state of port function Saeed Mahameed
2020-12-14 21:43 ` [net-next v4 06/15] net/mlx5: Introduce vhca state event notifier Saeed Mahameed
2020-12-14 21:43 ` [net-next v4 07/15] net/mlx5: SF, Add auxiliary device support Saeed Mahameed
2020-12-14 21:43 ` [net-next v4 08/15] net/mlx5: SF, Add auxiliary device driver Saeed Mahameed
2020-12-14 21:43 ` [net-next v4 09/15] net/mlx5: E-switch, Prepare eswitch to handle SF vport Saeed Mahameed
2020-12-14 21:43 ` [net-next v4 10/15] net/mlx5: E-switch, Add eswitch helpers for " Saeed Mahameed
2020-12-14 21:43 ` [net-next v4 11/15] net/mlx5: SF, Add port add delete functionality Saeed Mahameed
2020-12-14 21:43 ` [net-next v4 12/15] net/mlx5: SF, Port function state change support Saeed Mahameed
2020-12-14 21:43 ` [net-next v4 13/15] devlink: Add devlink port documentation Saeed Mahameed
2020-12-14 21:43 ` [net-next v4 14/15] devlink: Extend devlink port documentation for subfunctions Saeed Mahameed
2020-12-14 21:43 ` [net-next v4 15/15] net/mlx5: Add devlink subfunction port documentation Saeed Mahameed
2020-12-15  1:53 ` [net-next v4 00/15] Add mlx5 subfunction support Alexander Duyck
2020-12-15  2:44   ` David Ahern
2020-12-15 16:16     ` Alexander Duyck
2020-12-15 16:59       ` Parav Pandit
2020-12-15  5:48   ` Parav Pandit
2020-12-15 18:47     ` Alexander Duyck
2020-12-15 20:05       ` Saeed Mahameed
2020-12-15 21:03       ` Jason Gunthorpe
2020-12-16  1:12       ` Edwin Peer
2020-12-16  2:39         ` Jason Gunthorpe
2020-12-16  3:12         ` Alexander Duyck
2020-12-15 20:59     ` David Ahern
2020-12-15  6:15   ` Saeed Mahameed
2020-12-15 19:12     ` Alexander Duyck
2020-12-15 20:35       ` Saeed Mahameed
2020-12-15 21:28         ` Jakub Kicinski
2020-12-16  6:50           ` Leon Romanovsky
2020-12-16 17:59             ` Saeed Mahameed
2020-12-15 21:41         ` Alexander Duyck
2020-12-16  0:19           ` Jason Gunthorpe
2020-12-16  2:19             ` Alexander Duyck
2020-12-16  3:03               ` Jason Gunthorpe
2020-12-16  4:13                 ` Alexander Duyck
2020-12-16  4:45                   ` Parav Pandit
2020-12-16 13:33                   ` Jason Gunthorpe
2020-12-16 16:31                     ` Alexander Duyck [this message]
2020-12-16 17:51                       ` Jason Gunthorpe
2020-12-16 19:27                         ` Alexander Duyck
2020-12-16 20:35                           ` Jason Gunthorpe
2020-12-16 22:53                             ` Alexander Duyck
2020-12-17  0:38                               ` Jason Gunthorpe
2020-12-17 18:48                                 ` Alexander Duyck
2020-12-17 19:40                                   ` Jason Gunthorpe
2020-12-17 21:05                                     ` Alexander Duyck
2020-12-18  0:08                                       ` Jason Gunthorpe
2020-12-18  1:30                               ` David Ahern
2020-12-18  3:11                                 ` Alexander Duyck
2020-12-18  3:55                                   ` David Ahern
2020-12-18 15:54                                     ` Alexander Duyck
2020-12-18  5:20                                   ` Parav Pandit
2020-12-18  5:36                                     ` Parav Pandit
2020-12-18 16:01                                     ` Alexander Duyck
2020-12-18 18:01                                       ` Parav Pandit
2020-12-18 19:22                                         ` Alexander Duyck
2020-12-18 20:18                                           ` Jason Gunthorpe
2020-12-19  0:03                                             ` Alexander Duyck

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAKgT0UcRfB8a61rSWW-NPdbGh3VcX_=LCZ5J+-YjqYNtm+RhVg@mail.gmail.com' \
    --to=alexander.duyck@gmail.com \
    --cc=dan.j.williams@intel.com \
    --cc=davem@davemloft.net \
    --cc=david.m.ertman@intel.com \
    --cc=dsahern@kernel.org \
    --cc=gregkh@linuxfoundation.org \
    --cc=jacob.e.keller@intel.com \
    --cc=jgg@nvidia.com \
    --cc=kiran.patil@intel.com \
    --cc=kuba@kernel.org \
    --cc=leonro@nvidia.com \
    --cc=linux-rdma@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=saeed@kernel.org \
    --cc=sridhar.samudrala@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.