All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC v2] current devlink extension plan for NICs
@ 2020-05-01  9:14 Jiri Pirko
  2020-05-04  2:12 ` Samudrala, Sridhar
                   ` (4 more replies)
  0 siblings, 5 replies; 19+ messages in thread
From: Jiri Pirko @ 2020-05-01  9:14 UTC (permalink / raw)
  To: netdev
  Cc: davem, kuba, parav, yuvalav, jgg, saeedm, leon,
	andrew.gospodarek, michael.chan, moshe, ayal, eranbe, vladbu,
	kliteyn, dchickles, sburla, fmanlunas, tariqt, oss-drivers,
	snelson, drivers, aelior, GR-everest-linux-l2, grygorii.strashko,
	mlxsw, idosch, markz, jacob.e.keller, valex, linyunsheng,
	lihong.yang, vikas.gupta, sridhar.samudrala

Hi all.

First, I would like to apologize for very long email. But I think it
would be beneficial to the see the whole picture, where we are going.

Currently we are working internally on several features with
need of extension of the current devlink infrastructure. I took a stab
at putting it all together in a single txt file, inlined below.

Most of the stuff is based on a new port sub-object called "func"
(called "slice" previously" and "subdev" originally in Yuval's patchsets
sent some while ago).

The text describes how things should behave and provides a draft
of user facing console input/outputs. I think it is important to clear
that up before we go in and implement the devlink core and
driver pieces.

I would like to ask you to read this and comment. Especially, I would
like to ask vendors if what is described fits the needs of your
NIC/e-switch.

Please note that something is already implemented, but most of this
isn't (see "what needs to be implemented" section).

v1->v2
- mainly move from separate slice object into port/func subobject
- couple of small fixes here and there




==================================================================
||                                                              ||
||            Overall illustration of example setup             ||
||                                                              ||
==================================================================

Note that there are 2 hosts in the picture. Host A may be the smartnic host,
Host B may be one of the hosts which gets PF. Also, you might omit
the Host B and just see Host A like an ordinary nic in a host.

Note that the PF is merged with physical port representor.
That is due to simpler and flawless transition from legacy mode and back.
The devlink_ports and netdevs for physical ports are staying during
the transition.

                        +-----------+
                        |phys port 2+-----------------------------------+
                        +-----------+                                   |
                        +-----------+                                   |
                        |phys port 1+---------------------------------+ |
                        +-----------+                                 | |
                                                                      | |
+------------------------------------------------------------------+  | |
|  devlink instance for the whole ASIC                   HOST A    |  | |
|                                                                  |  | |
|  pci/0000:06:00.0  (devlink dev)                                 |  | |
|  +->health reporters, params, info, sb, dpipe,                   |  | |
|  |  resource, traps                                              |  | |
|  |                                                               |  | |
|  +-+port_pci/0000:06:00.0/0+-------------------------------------|--+ |
|  | |  flavour physical pfnum 0  (phys port and pf)               |    |
|  | |  netdev enp6s0f0np1                                         |    |
|  | +->health reporters, params                                   |    |
|  |                                                               |    |
|  +-+port_pci/0000:06:00.0/1+-------------------------------------|----+
|  | |  flavour physical pfnum 1  (phys port and pf)               |
|  | |  netdev enp6s0f0np2                                         |
|  | +->health reporters, params                                   |
|  |                                                               |
|  +-+port_pci/0000:06:00.0/2+---------------------------------+   |
|  | |  flavour pcipf pfnum 2                                  |   |
|  | |  netdev enp6s0f0pf2                                     |   |
|  | +->params                                                 |   |
|  | +->func                                                   |   |
|  |                                                           |   |
|  +-+port_pci/0000:06:00.0/3+------------------------------+  |   |
|  | |  flavour pcivf pfnum 2 vfnum 0                       |  |   |
|  | |  netdev enp6s0pf2vf0                                 |  |   |
|  | +->params                                              |  |   |
|  | +-+func                                                |  |   |
|  |   +->state, rate (qos), mpgroup, hw_addr               |  |   |
|  |                                                        |  |   |
|  +-+port_pci/0000:06:00.0/4+---------------------------+  |  |   |
|  | |  flavour pcivf pfnum 0 vfnum 0                    |  |  |   |
|  | |  netdev enp6s0pf0vf0                              |  |  |   |
|  | +->params                                           |  |  |   |
|  | +-+func                                             |  |  |   |
|  |   +->state, rate (qos), mpgroup, hw_addr            |  |  |   |
|  |                                                     |  |  |   |
|  +-+port_pci/0000:06:00.0/5+------------------------+  |  |  |   |
|  | |  flavour pcisf pfnum 0 sfnum 1                 |  |  |  |   |
|  | |  netdev enp6s0pf0sf1                           |  |  |  |   |
|  | +->params                                        |  |  |  |   |
|  | +-+func                                          |  |  |  |   |
|  |   +->state, rate (qos), mpgroup, hw_addr         |  |  |  |   |
|  |                                                  |  |  |  |   |
|  +-+port_pci/0000:06:00.0/6+-------=-------------+  |  |  |  |   |
|    |  flavour pcivf pfnum 0 vfnum 1              |  |  |  |  |   |
|    |        (non-ethernet (IB, NVME)             |  |  |  |  |   |
|    +-+func                                       |  |  |  |  |   |
|      +->state                                    |  |  |  |  |   |
|                                                  |  |  |  |  |   |
+------------------------------------------------------------------+
                                                   |  |  |  |  |
                                                   |  |  |  |  |
                                                   |  |  |  |  |
+----------------------------------------------+   |  |  |  |  |
|  devlink instance PF (other host)    HOST B  |   |  |  |  |  |
|                                              |   |  |  |  |  |
|  pci/0000:10:00.0  (devlink dev)             |   |  |  |  |  |
|  +->health reporters, info                   |   |  |  |  |  |
|  |                                           |   |  |  |  |  |
|  +-+port_pci/0000:10:00.0/1+---------------------------------+
|    |  flavour virtual                        |   |  |  |  |
|    |  netdev enp16s0f0                       |   |  |  |  |
|    +->health reporters                       |   |  |  |  |
|                                              |   |  |  |  |
+----------------------------------------------+   |  |  |  |
                                                   |  |  |  |
+----------------------------------------------+   |  |  |  |
|  devlink instance VF (other host)    HOST B  |   |  |  |  |
|                                              |   |  |  |  |
|  pci/0000:10:00.1  (devlink dev)             |   |  |  |  |
|  +->health reporters, info                   |   |  |  |  |
|  |                                           |   |  |  |  |
|  +-+port_pci/0000:10:00.1/1+------------------------------+
|    |  flavour virtual                        |   |  |  |
|    |  netdev enp16s0f0v0                     |   |  |  |
|    +->health reporters                       |   |  |  |
|                                              |   |  |  |
+----------------------------------------------+   |  |  |
                                                   |  |  |
+----------------------------------------------+   |  |  |
|  devlink instance VF                 HOST A  |   |  |  |
|                                              |   |  |  |
|  pci/0000:06:00.1  (devlink dev)             |   |  |  |
|  +->health reporters, info                   |   |  |  |
|  |                                           |   |  |  |
|  +-+port_pci/0000:06:00.1/1+---------------------------+
|    |  flavour virtual                        |   |  |
|    |  netdev enp6s0f0v0                      |   |  |
|    +->health reporters                       |   |  |
|                                              |   |  |
+----------------------------------------------+   |  |
                                                   |  |
+----------------------------------------------+   |  |
|  devlink instance SF                 HOST A  |   |  |
|                                              |   |  |
|  pci/0000:06:00.0%sf1    (devlink dev)       |   |  |
|  +->health reporters, info                   |   |  |
|  |                                           |   |  |
|  +-+port_pci/0000:06:00.0%sf1/1+--------------------+
|    |  flavour virtual                        |   |
|    |  netdev enp6s0f0s1                      |   |
|    +->health reporters                       |   |
|                                              |   |
+----------------------------------------------+   |
                                                   |
+----------------------------------------------+   |
|  devlink instance VF                 HOST A  |   |
|                                              |   |
|  pci/0000:06:00.2  (devlink dev)+----------------+
|  +->health reporters, info                   |
|                                              |
+----------------------------------------------+




==================================================================
||                                                              ||
||                 what needs to be implemented                 ||
||                                                              ||
==================================================================

1) physical port "pfnum". When PF and physical port representor
   are merged, the instance of devlink_port representing the physical port
   and PF needs to have "pfnum" attribute to be in sync
   with other PF port representors.

2) per-port health reporters are not implemented yet.

3) devlink_port instance in PF/VF/SF flavour "virtual". In PF/VF/SF devlink
   instance (in VM for example), there would make sense to have devlink_port
   instance. At least to carry link to netdevice name (otherwise we have
   no easy way to find out devlink instance and netdevice belong to each other).
   I was thinking about flavour name, we have to distinguish from eswitch
   devlink port flavours "pcipf, pcivf, pcisf".

   This was recently implemented by Parav:
commit 0a303214f8cb8e2043a03f7b727dba620e07e68d
Merge: c04d102ba56e 162add8cbae4
Author: David S. Miller <davem@davemloft.net>
Date:   Tue Mar 3 15:40:40 2020 -0800

    Merge branch 'devlink-virtual-port'

   What is missing is the "virtual" flavour for nested PF.

4) port func is not implemented yet. This is the original "vdev/subdev" concept.
   See below section "Port func user cmdline API draft".

5) SFs are not implemented.
   See below section "SF (subfunction) user cmdline API draft".

6) rate for port func are not implemented yet.
   See below section "Port func rate user cmdline API draft".

7) mpgroup for port func is not implemented yet.
   See below section "Port func mpgroup user cmdline API draft".

8) VF manual creation using devlink is not implemented yet.
   See below section "VF manual creation and activation user cmdline API draft".
 
9) PF aliasing. One devlink instance and multiple PFs sharing it as they have one
   merged e-switch.

10) Exposing maximum number of SF ports as devlink resource.            

11) Configuring more port/func capabilities (netdevice, rdma device,
    nested eswitch) and resources (irq, queues, pages). 



==================================================================
||                                                              ||
||                  Issues/open questions                       ||
||                                                              ||
==================================================================

1) "pfnum" has to be per-asic(/devlink instance), not per-host.
   That means that in smartNIC scenario, we cannot have "pfnum 0"
   for smartNIC and "pfnum 0" for host as well.
   
2) Q: for TX, RX queues reporters, should it be bound to devlink_port?
   For which flavours this might make sense?
   Most probably for flavours "physical"/"virtual".
   How about the reporters in VF/SF?

3) How the management of nested switch is handled. The PFs created dynamically
   or the ones in hosts in smartnic scenario may themselves be each a manager
   of nested e-switch. How to toggle this capability?
   During creation by a cmdline option?
   During lifetime in case the PF does not have any childs (VFs/SFs)?

   It seems to make sense to have it configurable as a port/func attribute
   for PF port/func objects.

   It might make sense to have it configurable as a port/func attribute
   for PF port/func objects. User would set this before he activates the func.



==================================================================
||                                                              ||
||             Port func user cmdline API draft                 ||
||                                                              ||
==================================================================

Note that some of the "devlink port" attributes may be forgotten or misordered.

Funcs are created as sub-objects of ports where it makes sense to have them
The driver takes care of that. The "func" is a handle to configure "the other
side of the wire". The original port object has port leve properties,
the new "func" sub-object on the other hand has device level properties".

This is example for the HOST A from the example above:

$ devlink port show
pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f0np2
pci/0000:06:00.0/2: flavour pcipf pfnum 2 type eth netdev enp6s0pf2
                    func: hw_addr 10:22:33:44:55:66 state active
pci/0000:06:00.0/3: flavour pcivf pfnum 2 vfnum 0 type eth netdev enp6s0pf2vf0
                    func: hw_addr 10:22:33:44:55:77 state active
pci/0000:06:00.0/4: flavour pcivf pfnum 0 vfnum 0 type eth netdev enp6s0pf0vf0
                    func: hw_addr 10:22:33:44:55:88 state active
pci/0000:06:00.0/5: flavour pcisf pfnum 0 sfnum 1 type eth netdev enp6s0pf0sf1
                    func: hw_addr 10:22:33:44:55:99 state active
pci/0000:06:00.0/6: flavour pcivf pfnum 1 vfnum 2 type nvme
                    func: state active

There is a fixed "state" attribute with value "active". This is by
default as the VFs are always created active. In future, it is planned
to implement manual VF creation and activation, similar to what is below
described for SFs.

Now set a different MAC address for VF1 on PF0:

$ devlink port func set pci/0000:06:00.0/3 hw_addr aa:bb:cc:dd:ee:ff

$ devlink port show
pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f0np2
pci/0000:06:00.0/2: flavour pcipf pfnum 2 type eth netdev enp6s0pf2
                    func: hw_addr 10:22:33:44:55:66 state active
pci/0000:06:00.0/3: flavour pcivf pfnum 2 vfnum 0 type eth netdev enp6s0pf2vf0
                    func: hw_addr aa:bb:cc:dd:ee:ff state active
pci/0000:06:00.0/4: flavour pcivf pfnum 0 vfnum 0 type eth netdev enp6s0pf0vf0
                    func: hw_addr 10:22:33:44:55:88 state active
pci/0000:06:00.0/5: flavour pcisf pfnum 0 sfnum 1 type eth netdev enp6s0pf0sf1
                    func: hw_addr 10:22:33:44:55:99 state active
pci/0000:06:00.0/6: flavour pcivf pfnum 1 vfnum 2 type nvme
                    func: state active



==================================================================
||                                                              ||
||          SF (subfunction) user cmdline API draft             ||
||                                                              ||
==================================================================

Note that some of the "devlink port" attributes may be forgotten,
misordered or omitted on purpose.

$ devlink port show
pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f0np2
pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 type eth netdev enp6s0pf0vf0
                    func: hw_addr 10:22:33:44:55:66 state active

There is one VF on the NIC.

Now create subfunction of SF0 on PF1, index of the port is going to be 100:

$ devlink port add pci/0000.06.00.0/100 flavour pcisf pfnum 1 sfnum 10

The devlink kernel code calls down to device driver (devlink op) and asks
it to create a SF port with particular attributes. Driver then instantiates
the SF port in the same way it is done for VF.


Note that it may be possible to avoid passing port index and let the
kernel assign index for you:
$ devlink port add pci/0000.06.00.0 flavour pcisf pfnum 1 sfnum 10

This would work in a similar way as devlink region id assignment that
is being pushed now.


Set the func hw_address to aa:bb:cc:aa:bb:cc:

$ devlink port func set pci/0000.06.00.0/100 hw_addr aa:bb:cc:aa:bb:cc

$ devlink port show
pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f0np2
pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 type eth netdev enp6s0pf0vf0
    func: hw_addr 10:22:33:44:55:66 state active
pci/0000:06:00.0/100: flavour pcisf pfnum 1 sfnum 10 type eth netdev enp6s0pf1sf10
    func: hw_addr aa:bb:cc:aa:bb:cc state inactive


Note that the SF port is created but not active. That means the
entities are created on devlink side, the e-switch port representor
is created, but the SF device itself it not yet out there (same host
or different, depends on where the parent PF is - in this case the same host).
User might use e-switch port representor enp6s0pf1sf10 to do settings,
putting it into bridge, adding TC rules, etc.
It's like the cable is unplugged on the other side.


Now we activate (deploy) the SF port/func:
$ devlink port func set pci/0000:06:00.0/100 state active

$ devlink port show
pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f0np2
pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 type eth netdev enp6s0pf0vf0
    func: hw_addr 10:22:33:44:55:66 state active
pci/0000:06:00.0/100: flavour pcisf pfnum 1 sfnum 10 type eth netdev enp6s0pf1sf10
    func: hw_addr aa:bb:cc:aa:bb:cc state active

Upon the activation, the device driver asks the device to instantiate
the actual SF device on particular PF. Does not matter if that is
on the same host or not.

On the other side, the PF driver instance gets the event from device
that particular SF was activated. It's the cue to put the device on bus
probe it and instantiate netdev and devlink instances for it.

For every SF a device is created on virtbus with an ID assigned by the
virtbus code. For example:
/sys/bus/virtbus/devices/mlx5_sf.1

$ cat /sys/bus/virtbus/devices/mlx5_sf.1/sfnum
10

/sys/bus/virtbus/devices/mlx5_sf.1 is a symlink to:
../../../devices/pci0000:00/0000:00:03.0/0000:06:00.0/mlx5_sf.1

New devlink instance is named using alias:
$ devlink dev show
pci/0000:06:00.0%sf10

$ devlink port show
pci/0000:06:00.0%sf10/0: flavour virtual type eth netdev netdev enp6s0f0s10

You see that the udev used the sysfs files and symlink to assemble netdev name.

Note that this kind of aliasing is not implemented. Needs to be done in
devlink code in kernel. During SF devlink instance creation, there should
be passed parent PF device pointer and sfnum from which the alias dev_name
is assembled. This ensures persistent naming consistent in both smartnic
and host usecase.

If the user on smartnic or host does not want the virtbus device to get
probed automatically (for any reason), he can do it by:

$ echo "0" > /sys/bus/virtbus/drivers_autoprobe

This is enabled by default.


User can deactivate the SF port/func by:

$ devlink port func set pci/0000:06:00.0/100 state inactive

This eventually leads to event delivered to PF driver, which is a
cue to remove the SF device from virtbus and remove all related devlink
and netdev instances.

The port/func may be activated again.

Now on the teardown process, user might either remove the SF port
right away, without deactivation. However, it is possible to remove
deactivated SF too. To remove the SF, user should do:

$ devlink port del pci/0000:06:00.0/100

$ devlink port show
pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f0np2
pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 type eth netdev enp6s0pf0vf0
    func: hw_addr 10:22:33:44:55:66 state active



==================================================================
||                                                              ||
||   VF manual creation and activation user cmdline API draft   ||
||                                                              ||
==================================================================

To enter manual mode, the user has to turn off VF dummies creation:
$ devlink dev set pci/0000:06:00.0 vf_dummies disabled
$ devlink dev show
pci/0000:06:00.0: vf_dummies disabled

It is "enabled" by default in order not to break existing users.

By setting the "vf_dummies" attribute to "disabled", the driver
removes all dummy VFs. Only physical ports are present:

$ devlink port show
pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f0np2

Then the user is able to create them in a similar way as SFs:

$ devlink port add pci/0000:06:00.0/99 flavour pcivf pfnum 1 vfnum 8

The devlink kernel code calls down to device driver (devlink op) and asks
it to create a VF port with particular attributes. Driver then instantiates
the VF port with func.

Set the func hw_address to aa:bb:cc:aa:bb:cc:
$ devlink port func set pci/0000:06:00.0/99 hw_addr aa:bb:cc:aa:bb:cc

$ devlink port show
pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f0np2
pci/0000:06:00.0/99: flavour pcivf pfnum 1 vfnum 8 type eth netdev enp6s0pf1vf8
    func: hw_addr aa:bb:cc:aa:bb:cc state inactive

Now we activate (deploy) the VF:
$ devlink port func set pci/0000:06:00.0/99 state active

$ devlink port show
pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f0np2
pci/0000:06:00.0/99: flavour pcivf pfnum 1 vfnum 8 type eth netdev enp6s0pf1vf8
    func: hw_addr aa:bb:cc:aa:bb:cc state active



==================================================================
||                                                              ||
||                             PFs                              ||
||                                                              ||
==================================================================

There are 2 flavours of PFs:
1) Parent PF. That is coupled with uplink port. The flavour is:
    a) "physical" - in case the uplink port is actual port in the NIC.
    b) "virtual" - in case this Parent PF is actually a leg to
       upstream embedded switch.

   $ devlink port show
   pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1

   If there is another parent PF, say "0000:06:00.1", that share the
   same embedded switch, the aliasing is established for devlink handles.

   The user can use devlink handles:
   pci/0000:06:00.0
   pci/0000:06:00.1
   as equivalents, pointing to the same devlink instance.

   Parent PFs are the ones that may be in control of managing
   embedded switch, on any hierarchy level.

2) Child PF. This is a leg of a PF put to the parent PF. It is
   represented by a port a port with a netdevice and func:

   $ devlink port show
   pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
   pci/0000:06:00.0/1: flavour pcipf pfnum 2 type eth netdev enp6s0f0pf2
       func: hw_addr aa:bb:cc:aa:bb:87 state active

   This is a typical smartnic scenario. You would see this list on
   the smartnic CPU. The port pci/0000:06:00.0/1 is a leg to
   one of the hosts. If you send packets to enp6s0f0pf2, they will
   go to the child PF.

   Note that inside the host, the PF is represented again as "Parent PF"
   and may be used to configure nested embedded switch.



==================================================================
||                                                              ||
||           Port func operational state extension              ||
||                                                              ||
==================================================================

In addition to the "state" attribute that serves for the purpose
of setting the "admin state", there is "opstate" attribute added to
reflect the operational state of the port/func:


    opstate                description
    -----------            ------------
    1. attached    State when port/func devince is bound to the host
                   driver. When the func device is unbound from the
                   host driver, func device exits this state and
                   enters detaching state.

    2. detaching   State when host is notified to deactivate the
                   func device and func device may be undergoing
                   detachment from host driver. When func device is
                   fully detached from the host driver, func exits
                   this state and enters detached state.

    3. detached    State when driver is fully unbound, it enters
                   into detached state.

func state machine:
-------------------
                               func state set inactive
                              ----<------------------<---
                              | or port delete          |
                              |                         |
  __________              ____|_______              ____|_______
 |          |  port add  |            | func state |            |
 |          |-------->---|            |------>-----|            |
 | invalid  |            |  inactive  | set active |   active   |
 |          |  port del  |            |            |            |
 |__________|--<---------|____________|            |____________|



func operational state machine:
-------------------------------
  __________                ____________              ___________
 |          | func state   |            | driver bus |           |
 | invalid  |-------->-----|  detached  |------>-----| attached  |
 |          | set active   |            | probe()    |           |
 |__________|              |____________|            |___________|
                                 |                        |
                                 ^                    func set
                                 |                    inactive
                            successful detach             |
                              or pf reset                 |
                             ____|_______                 |
                            |            | driver bus     |
                 -----------| detaching  |---<-------------
                 |          |            | remove()
                 ^          |____________|
                 |   timeout      |
                 --<---------------



==================================================================
||                                                              ||
||             Port func rate user cmdline API draft            ||
||                                                              ||
==================================================================

Note that some of the "devlink port func" attributes in show commands
are omitted on purpose.

$ devlink port show
pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
pci/0000:06:00.0/1: flavour pcivf pfnum 0 vfnum 1 type eth netdev enp6s0pf0vf1
    func: hw_addr 10:22:33:44:55:66 state active
pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 2 type eth netdev enp6s0pf0vf2
    func: hw_addr 10:22:33:44:55:77 state active
pci/0000:06:00.0/3: flavour pcivf pfnum 0 vfnum 3 type eth netdev enp6s0pf0vf3
    func: hw_addr 10:22:33:44:55:88 state active
pci/0000:06:00.0/4: flavour pcisf pfnum 0 sfnum 1 type eth netdev enp6s0pf0vf1
    func: hw_addr 10:22:33:44:55:99 state active

port/func object is extended with new rate object.

$ devlink port func rate
pci/0000:06:00.0/1: type leaf
pci/0000:06:00.0/2: type leaf
pci/0000:06:00.0/3: type leaf
pci/0000:06:00.0/4: type leaf

This shows the leafs created by default alongside with port/func objects.
No min or max tx rates were set, so their values are omitted.


Now create new node rate object:

$ devlink port func rate add pci/0000:06:00.0/somerategroup

$ devlink port func rate
pci/0000:06:00.0/1: type leaf
pci/0000:06:00.0/2: type leaf
pci/0000:06:00.0/3: type leaf
pci/0000:06:00.0/4: type leaf
pci/0000:06:00.0/somerategroup: type node

New node rate object was created - the last line.


Now create another new node object was created, this time with some attributes:

$ devlink port func rate add pci/0000:06:00.0/secondrategroup min_tx_rate 20 max_tx_rate 1000

$ devlink port func rate
pci/0000:06:00.0/1: type leaf
pci/0000:06:00.0/2: type leaf
pci/0000:06:00.0/3: type leaf
pci/0000:06:00.0/4: type leaf
pci/0000:06:00.0/somerategroup: type node
pci/0000:06:00.0/secondrategroup: type node min_tx_rate 20 max_tx_rate 1000

Another new node object was created - the last line. The object has min and max
tx rates set, so they are displayed after the object type.


Now set node named somerategroup min/max rate using rate object:

$ devlink port func rate set pci/0000:06:00.0/somerategroup min_tx_rate 50 max_tx_rate 5000

$ devlink port func rate
pci/0000:06:00.0/1: type leaf
pci/0000:06:00.0/2: type leaf
pci/0000:06:00.0/3: type leaf
pci/0000:06:00.0/4: type leaf
pci/0000:06:00.0/somerategroup: type node min_tx_rate 50 max_tx_rate 5000
pci/0000:06:00.0/secondrategroup: type node min_tx_rate 20 max_tx_rate 1000


Now set leaf port/func rate using rate object:

$ devlink port func rate set pci/0000:06:00.0/2 min_tx_rate 10 max_tx_rate 10000

$ devlink port func rate
pci/0000:06:00.0/1: type leaf
pci/0000:06:00.0/2: type leaf min_tx_rate 10 max_tx_rate 10000
pci/0000:06:00.0/3: type leaf
pci/0000:06:00.0/4: type leaf
pci/0000:06:00.0/somerategroup: type node min_tx_rate 50 max_tx_rate 5000
pci/0000:06:00.0/secondrategroup: type node min_tx_rate 20 max_tx_rate 1000


Now set leaf func of port with index 2 parent node using rate object:

$ devlink port func rate set pci/0000:06:00.0/2 parent somerategroup

$ devlink port func rate
pci/0000:06:00.0/1: type leaf
pci/0000:06:00.0/2: type leaf min_tx_rate 10 max_tx_rate 10000 parent somerategroup
pci/0000:06:00.0/3: type leaf
pci/0000:06:00.0/4: type leaf
pci/0000:06:00.0/somerategroup: type node min_tx_rate 50 max_tx_rate 5000
pci/0000:06:00.0/secondrategroup: type node min_tx_rate 20 max_tx_rate 1000


Now set leaf func of port with index 1 parent node using rate object:

$ devlink port func rate set pci/0000:06:00.0/1 parent somerategroup

$ devlink port func rate
pci/0000:06:00.0/1: type leaf parent somerategroup
pci/0000:06:00.0/2: type leaf min_tx_rate 10 max_tx_rate 10000 parent somerategroup
pci/0000:06:00.0/3: type leaf
pci/0000:06:00.0/4: type leaf
pci/0000:06:00.0/somerategroup: type node min_tx_rate 50 max_tx_rate 5000
pci/0000:06:00.0/secondrategroup: type node min_tx_rate 20 max_tx_rate 1000


Now unset leaf func of port with index 1 parent node using rate object:

$ devlink port func rate set pci/0000:06:00.0/1 noparent

$ devlink port func rate
pci/0000:06:00.0/1: type leaf
pci/0000:06:00.0/2: type leaf min_tx_rate 10 max_tx_rate 10000 parent somerategroup
pci/0000:06:00.0/3: type leaf
pci/0000:06:00.0/4: type leaf
pci/0000:06:00.0/somerategroup: type node min_tx_rate 50 max_tx_rate 5000
pci/0000:06:00.0/secondrategroup: type node min_tx_rate 20 max_tx_rate 1000


Now delete node object:

$ devlink port func rate del pci/0000:06:00.0/somerategroup

$ devlink port func rate
pci/0000:06:00.0/1: type leaf
pci/0000:06:00.0/2: type leaf
pci/0000:06:00.0/3: type leaf
pci/0000:06:00.0/4: type leaf

Rate node object was removed and its only child pci/0000:06:00.0/2 automatically
detached.



==================================================================
||                                                              ||
||        Port func ib groupping user cmdline API draft         ||
||                                                              ||
==================================================================

Note that some of the "devlink port func" attributes in show commands
are omitted on purpose.

The reason for this IB groupping is that the VFs inside virtual machine
get information (via device) about which two of more VF devices should
be combined together to form one multi-port IB device. In the virtual
machine it is driver's responsibility to setup the combined
multi-port IB devices.

Consider following setup:

$ devlink port show
pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f0np2
pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 type eth netdev enp6s0pf0vf0
    func: hw_addr 10:22:33:44:55:66 state active
pci/0000:06:00.0/3: flavour pcivf pfnum 0 vfnum 1 type eth netdev enp6s0pf0vf1
    func: hw_addr 10:22:33:44:55:77 state active
pci/0000:06:00.0/4: flavour pcivf pfnum 1 vfnum 0 type eth netdev enp6s0pf1vf0
    func: hw_addr 10:22:33:44:55:88 state active
pci/0000:06:00.0/5: flavour pcivf pfnum 1 vfnum 1 type eth netdev enp6s0pf1vf1
    func: hw_addr 10:22:33:44:55:99 state active


Each VF/SF port/func has IB leaf object related to it:

$ devlink port func ib
pci/0000:06:00.0/2: type leaf
pci/0000:06:00.0/3: type leaf
pci/0000:06:00.0/4: type leaf
pci/0000:06:00.0/5: type leaf

You see that by default, each port/func is marked as a leaf.
There is no groupping set.


User may add a ib group node by issuing following command:

$ devlink port func ib add pci/0000:06:00.0/somempgroup1

$ devlink port func ib
pci/0000:06:00.0/2: type leaf
pci/0000:06:00.0/3: type leaf
pci/0000:06:00.0/4: type leaf
pci/0000:06:00.0/5: type leaf
pci/0000:06:00.0/someibgroup1: type node

New node ib node object was created - the last line.


Now set leaf func of port with index 2 parent node using ib object:

$ devlink port func ib set pci/0000:06:00.0/2 parent someibgroup1

$ devlink port func ib
pci/0000:06:00.0/2: type leaf parent someibgroup1
pci/0000:06:00.0/3: type leaf
pci/0000:06:00.0/4: type leaf
pci/0000:06:00.0/5: type leaf
pci/0000:06:00.0/someibgroup1: type node


Now set leaf func of port with index 5 parent node using ib object:

$ devlink port func ib set pci/0000:06:00.0/5 parent someibgroup1

$ devlink port func ib
pci/0000:06:00.0/2: type leaf parent someibgroup1
pci/0000:06:00.0/3: type leaf
pci/0000:06:00.0/4: type leaf
pci/0000:06:00.0/5: type leaf parent someibgroup1
pci/0000:06:00.0/someibgroup1: type node

Now you can see there are 2 leaf funcs configured to have one parent.


To remove the parent link, user should issue following command:

$ devlink port func ib set pci/0000:06:00.0/5 noparent

$ devlink port func ib
pci/0000:06:00.0/2: type leaf parent someibgroup1
pci/0000:06:00.0/3: type leaf
pci/0000:06:00.0/4: type leaf
pci/0000:06:00.0/5: type leaf
pci/0000:06:00.0/someibgroup1: type node


Now delete node object:

$ devlink port func ib del pci/0000:06:00.0/somempgroup1
$ devlink port func ib
pci/0000:06:00.0/2: type leaf
pci/0000:06:00.0/3: type leaf
pci/0000:06:00.0/4: type leaf
pci/0000:06:00.0/5: type leaf

Node object was removed and its only child pci/0000:06:00.0/2 automatically
detached.


It is not possible to delete leafs:

$ devlink port func ib del pci/0000:06:00.0/2
devlink answers: Operation not supported



==================================================================
||                                                              ||
||            Dynamic PFs user cmdline API draft                ||
||                                                              ||
==================================================================

User might want to create another PF, similar as VF.
TODO

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC v2] current devlink extension plan for NICs
  2020-05-01  9:14 [RFC v2] current devlink extension plan for NICs Jiri Pirko
@ 2020-05-04  2:12 ` Samudrala, Sridhar
  2020-05-04 11:42   ` Jiri Pirko
  2020-05-10 14:45 ` Jiri Pirko
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 19+ messages in thread
From: Samudrala, Sridhar @ 2020-05-04  2:12 UTC (permalink / raw)
  To: Jiri Pirko, netdev
  Cc: davem, kuba, parav, yuvalav, jgg, saeedm, leon,
	andrew.gospodarek, michael.chan, moshe, ayal, eranbe, vladbu,
	kliteyn, dchickles, sburla, fmanlunas, tariqt, oss-drivers,
	snelson, drivers, aelior, GR-everest-linux-l2, grygorii.strashko,
	mlxsw, idosch, markz, jacob.e.keller, valex, linyunsheng,
	lihong.yang, vikas.gupta



On 5/1/2020 2:14 AM, Jiri Pirko wrote:
> Hi all.
> 
> First, I would like to apologize for very long email. But I think it
> would be beneficial to the see the whole picture, where we are going.
> 
> Currently we are working internally on several features with
> need of extension of the current devlink infrastructure. I took a stab
> at putting it all together in a single txt file, inlined below.
> 
> Most of the stuff is based on a new port sub-object called "func"
> (called "slice" previously" and "subdev" originally in Yuval's patchsets
> sent some while ago).
> 
> The text describes how things should behave and provides a draft
> of user facing console input/outputs. I think it is important to clear
> that up before we go in and implement the devlink core and
> driver pieces.
> 
> I would like to ask you to read this and comment. Especially, I would
> like to ask vendors if what is described fits the needs of your
> NIC/e-switch.
> 
> Please note that something is already implemented, but most of this
> isn't (see "what needs to be implemented" section).
> 
> v1->v2
> - mainly move from separate slice object into port/func subobject
> - couple of small fixes here and there
> 

<snip>

> 
> 
> 
> ==================================================================
> ||                                                              ||
> ||             Port func user cmdline API draft                 ||
> ||                                                              ||
> ==================================================================
> 
> Note that some of the "devlink port" attributes may be forgotten or misordered.
> 
> Funcs are created as sub-objects of ports where it makes sense to have them
> The driver takes care of that. The "func" is a handle to configure "the other
> side of the wire". The original port object has port leve properties,
> the new "func" sub-object on the other hand has device level properties".
> 
> This is example for the HOST A from the example above:
> 
> $ devlink port show
> pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
> pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f0np2
> pci/0000:06:00.0/2: flavour pcipf pfnum 2 type eth netdev enp6s0pf2
>                      func: hw_addr 10:22:33:44:55:66 state active
> pci/0000:06:00.0/3: flavour pcivf pfnum 2 vfnum 0 type eth netdev enp6s0pf2vf0
>                      func: hw_addr 10:22:33:44:55:77 state active
> pci/0000:06:00.0/4: flavour pcivf pfnum 0 vfnum 0 type eth netdev enp6s0pf0vf0
>                      func: hw_addr 10:22:33:44:55:88 state active
> pci/0000:06:00.0/5: flavour pcisf pfnum 0 sfnum 1 type eth netdev enp6s0pf0sf1
>                      func: hw_addr 10:22:33:44:55:99 state active
> pci/0000:06:00.0/6: flavour pcivf pfnum 1 vfnum 2 type nvme
>                      func: state active


I am trying to understand how the current implementation of 'devlink 
port' is being refactored to support this new model.

Today 'devlink port show' on a system with 2 port mlx5 NIC with 1 VFs 
created on each PF shows

pci/0000:af:00.0/1: type eth netdev enp175s0f0np0 flavour physical port 0
pci/0000:af:00.1/1: type eth netdev enp175s0f1np1 flavour physical port 1
pci/0000:af:00.2/1: type eth netdev enp175s0f2np0 flavour virtual port 0
pci/0000:af:08.2/1: type eth netdev enp175s8f2np0 flavour virtual port 0


Can you tell me how this will be represented in the new model?

It looks like you are assigning a pfnum to physical port as well as PCI 
PF. However, i am little confused as both pfnum 0 and pfnum 1 which seem 
to be 2 physical ports have the same bus/dev/func 06:00.0 and also the 
VF ports.

Thanks
Sridhar

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC v2] current devlink extension plan for NICs
  2020-05-04  2:12 ` Samudrala, Sridhar
@ 2020-05-04 11:42   ` Jiri Pirko
  0 siblings, 0 replies; 19+ messages in thread
From: Jiri Pirko @ 2020-05-04 11:42 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: netdev, davem, kuba, parav, yuvalav, jgg, saeedm, leon,
	andrew.gospodarek, michael.chan, moshe, ayal, eranbe, vladbu,
	kliteyn, dchickles, sburla, fmanlunas, tariqt, oss-drivers,
	snelson, drivers, aelior, GR-everest-linux-l2, grygorii.strashko,
	mlxsw, idosch, markz, jacob.e.keller, valex, linyunsheng,
	lihong.yang, vikas.gupta

Mon, May 04, 2020 at 04:12:18AM CEST, sridhar.samudrala@intel.com wrote:
>
>
>On 5/1/2020 2:14 AM, Jiri Pirko wrote:
>> Hi all.
>> 
>> First, I would like to apologize for very long email. But I think it
>> would be beneficial to the see the whole picture, where we are going.
>> 
>> Currently we are working internally on several features with
>> need of extension of the current devlink infrastructure. I took a stab
>> at putting it all together in a single txt file, inlined below.
>> 
>> Most of the stuff is based on a new port sub-object called "func"
>> (called "slice" previously" and "subdev" originally in Yuval's patchsets
>> sent some while ago).
>> 
>> The text describes how things should behave and provides a draft
>> of user facing console input/outputs. I think it is important to clear
>> that up before we go in and implement the devlink core and
>> driver pieces.
>> 
>> I would like to ask you to read this and comment. Especially, I would
>> like to ask vendors if what is described fits the needs of your
>> NIC/e-switch.
>> 
>> Please note that something is already implemented, but most of this
>> isn't (see "what needs to be implemented" section).
>> 
>> v1->v2
>> - mainly move from separate slice object into port/func subobject
>> - couple of small fixes here and there
>> 
>
><snip>
>
>> 
>> 
>> 
>> ==================================================================
>> ||                                                              ||
>> ||             Port func user cmdline API draft                 ||
>> ||                                                              ||
>> ==================================================================
>> 
>> Note that some of the "devlink port" attributes may be forgotten or misordered.
>> 
>> Funcs are created as sub-objects of ports where it makes sense to have them
>> The driver takes care of that. The "func" is a handle to configure "the other
>> side of the wire". The original port object has port leve properties,
>> the new "func" sub-object on the other hand has device level properties".
>> 
>> This is example for the HOST A from the example above:
>> 
>> $ devlink port show
>> pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
>> pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f0np2
>> pci/0000:06:00.0/2: flavour pcipf pfnum 2 type eth netdev enp6s0pf2
>>                      func: hw_addr 10:22:33:44:55:66 state active
>> pci/0000:06:00.0/3: flavour pcivf pfnum 2 vfnum 0 type eth netdev enp6s0pf2vf0
>>                      func: hw_addr 10:22:33:44:55:77 state active
>> pci/0000:06:00.0/4: flavour pcivf pfnum 0 vfnum 0 type eth netdev enp6s0pf0vf0
>>                      func: hw_addr 10:22:33:44:55:88 state active
>> pci/0000:06:00.0/5: flavour pcisf pfnum 0 sfnum 1 type eth netdev enp6s0pf0sf1
>>                      func: hw_addr 10:22:33:44:55:99 state active
>> pci/0000:06:00.0/6: flavour pcivf pfnum 1 vfnum 2 type nvme
>>                      func: state active
>
>
>I am trying to understand how the current implementation of 'devlink port' is
>being refactored to support this new model.
>
>Today 'devlink port show' on a system with 2 port mlx5 NIC with 1 VFs created
>on each PF shows
>
>pci/0000:af:00.0/1: type eth netdev enp175s0f0np0 flavour physical port 0
>pci/0000:af:00.1/1: type eth netdev enp175s0f1np1 flavour physical port 1
>pci/0000:af:00.2/1: type eth netdev enp175s0f2np0 flavour virtual port 0
>pci/0000:af:08.2/1: type eth netdev enp175s8f2np0 flavour virtual port 0

The representor instances are not present here. They should be added.

The output would look like this:

pci/0000:af:00.0/1: type eth netdev enp175s0f0np0 flavour physical port 0 pfnum 0
pci/0000:af:00.0/2: type eth netdev enp175s0f1np1 flavour physical port 1 pfnum 1
pci/0000:af:00.0/3: type eth netdev enp175s0f0pf0vf0 flavour pcivf pfnum 0 vfnum 0
                      func: hw_addr 10:22:33:44:55:66 state active
pci/0000:af:00.0/4: type eth netdev enp175s0f1pf1vf0 flavour pcivf pfnum 1 vfnum 0
                      func: hw_addr 10:22:33:44:55:77 state active

pci/0000:af:00.2/1: type eth netdev enp175s0f2np0 flavour virtual port 0
pci/0000:af:08.2/1: type eth netdev enp175s8f2np0 flavour virtual port 0


Handle pci/0000:af:00.1 is alias for pci/0000:af:00.0. It's the same
instance of devlink object.




>
>
>Can you tell me how this will be represented in the new model?
>
>It looks like you are assigning a pfnum to physical port as well as PCI PF.
>However, i am little confused as both pfnum 0 and pfnum 1 which seem to be 2
>physical ports have the same bus/dev/func 06:00.0 and also the VF ports.

Two ports of the same nic can be under 1 PF or each can be a separate
PF. In both cases, there should be 1 devlink instances shared for both.
In my example, I mixed this up a bit, sorry about that. It should be one
of following:
1) Same PF for both ports:
  pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
  pci/0000:06:00.0/1: flavour physical pfnum 0 type eth netdev enp6s0f0np2
2) Separate per-port PF:
  pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
  pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f1np2

  and pci/0000:06:00.1 is alias for pci/0000:06:00.0


Will fix. Thanks!

>
>Thanks
>Sridhar

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC v2] current devlink extension plan for NICs
  2020-05-01  9:14 [RFC v2] current devlink extension plan for NICs Jiri Pirko
  2020-05-04  2:12 ` Samudrala, Sridhar
@ 2020-05-10 14:45 ` Jiri Pirko
  2020-05-10 16:30   ` Dave Taht
  2020-05-11 22:37   ` Jacob Keller
  2020-05-13 13:00 ` [oss-drivers] " Simon Horman
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 19+ messages in thread
From: Jiri Pirko @ 2020-05-10 14:45 UTC (permalink / raw)
  To: netdev
  Cc: davem, kuba, parav, yuvalav, jgg, saeedm, leon,
	andrew.gospodarek, michael.chan, moshe, ayal, eranbe, vladbu,
	kliteyn, dchickles, sburla, fmanlunas, tariqt, oss-drivers,
	snelson, drivers, aelior, GR-everest-linux-l2, grygorii.strashko,
	mlxsw, idosch, markz, jacob.e.keller, valex, linyunsheng,
	lihong.yang, vikas.gupta, sridhar.samudrala

Hello guys.

Anyone has any opinion on the proposal? Or should I take it as a silent
agreement? :)

We would like to go ahead and start sending patchsets.

Thanks!

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC v2] current devlink extension plan for NICs
  2020-05-10 14:45 ` Jiri Pirko
@ 2020-05-10 16:30   ` Dave Taht
  2020-05-11  5:32     ` Jiri Pirko
  2020-05-11 22:37   ` Jacob Keller
  1 sibling, 1 reply; 19+ messages in thread
From: Dave Taht @ 2020-05-10 16:30 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Linux Kernel Network Developers, David S. Miller, Jakub Kicinski,
	parav, yuvalav, jgg, Saeed Mahameed, leon, andrew.gospodarek,
	michael.chan, moshe, ayal, Eran Ben Elisha, vladbu, kliteyn,
	dchickles, sburla, fmanlunas, Tariq Toukan, oss-drivers,
	Shannon Nelson, drivers, aelior, GR-everest-linux-l2,
	grygorii.strashko, mlxsw, Ido Schimmel, markz, jacob.e.keller,
	valex, linyunsheng, lihong.yang, vikas.gupta, sridhar.samudrala

On Sun, May 10, 2020 at 7:46 AM Jiri Pirko <jiri@resnulli.us> wrote:
>
> Hello guys.
>
> Anyone has any opinion on the proposal? Or should I take it as a silent
> agreement? :)
>
> We would like to go ahead and start sending patchsets.

I gotta say that the whole thing makes my head really hurt, and while
this conversation is about how to go about configuring things,
I've been unable to get a grip on how flows will actually behave with
these offloads present.

My overall starting point for thinking about this stuff was described
in this preso to broadcom a few years back:
http://flent-fremont.bufferbloat.net/~d/broadcom_aug9.pdf

More recently I did what I think is my funniest talk ever on these
subjects: https://blog.apnic.net/2020/01/22/bufferbloat-may-be-solved-but-its-not-over-yet/

Make some popcorn, take a look. :) I should probably have covered
ecn's (mis)behaviors at the end, but I didn't.

Steven hemminger's lca talk on these subjects was also a riot...

so somehow going from my understanding of how stuff gets configured,
to the actual result, is needed, for me to have any opinion at all.
You
have this stuff basically running already? Can you run various
flent.org tests through it?

>
> Thanks!



-- 
"For a successful technology, reality must take precedence over public
relations, for Mother Nature cannot be fooled" - Richard Feynman

dave@taht.net <Dave Täht> CTO, TekLibre, LLC Tel: 1-831-435-0729

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC v2] current devlink extension plan for NICs
  2020-05-10 16:30   ` Dave Taht
@ 2020-05-11  5:32     ` Jiri Pirko
  0 siblings, 0 replies; 19+ messages in thread
From: Jiri Pirko @ 2020-05-11  5:32 UTC (permalink / raw)
  To: Dave Taht
  Cc: Linux Kernel Network Developers, David S. Miller, Jakub Kicinski,
	parav, yuvalav, jgg, Saeed Mahameed, leon, andrew.gospodarek,
	michael.chan, moshe, ayal, Eran Ben Elisha, vladbu, kliteyn,
	dchickles, sburla, fmanlunas, Tariq Toukan, oss-drivers,
	Shannon Nelson, drivers, aelior, GR-everest-linux-l2,
	grygorii.strashko, mlxsw, Ido Schimmel, markz, jacob.e.keller,
	valex, linyunsheng, lihong.yang, vikas.gupta, sridhar.samudrala

Sun, May 10, 2020 at 06:30:59PM CEST, dave.taht@gmail.com wrote:
>On Sun, May 10, 2020 at 7:46 AM Jiri Pirko <jiri@resnulli.us> wrote:
>>
>> Hello guys.
>>
>> Anyone has any opinion on the proposal? Or should I take it as a silent
>> agreement? :)
>>
>> We would like to go ahead and start sending patchsets.
>
>I gotta say that the whole thing makes my head really hurt, and while
>this conversation is about how to go about configuring things,
>I've been unable to get a grip on how flows will actually behave with
>these offloads present.

As you said, this is about configuration. Not the actual packet
processing.


>
>My overall starting point for thinking about this stuff was described
>in this preso to broadcom a few years back:
>http://flent-fremont.bufferbloat.net/~d/broadcom_aug9.pdf
>
>More recently I did what I think is my funniest talk ever on these
>subjects: https://blog.apnic.net/2020/01/22/bufferbloat-may-be-solved-but-its-not-over-yet/
>
>Make some popcorn, take a look. :) I should probably have covered
>ecn's (mis)behaviors at the end, but I didn't.
>
>Steven hemminger's lca talk on these subjects was also a riot...
>
>so somehow going from my understanding of how stuff gets configured,
>to the actual result, is needed, for me to have any opinion at all.
>You
>have this stuff basically running already? Can you run various
>flent.org tests through it?
>
>>
>> Thanks!
>
>
>
>-- 
>"For a successful technology, reality must take precedence over public
>relations, for Mother Nature cannot be fooled" - Richard Feynman
>
>dave@taht.net <Dave Täht> CTO, TekLibre, LLC Tel: 1-831-435-0729

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC v2] current devlink extension plan for NICs
  2020-05-10 14:45 ` Jiri Pirko
  2020-05-10 16:30   ` Dave Taht
@ 2020-05-11 22:37   ` Jacob Keller
  1 sibling, 0 replies; 19+ messages in thread
From: Jacob Keller @ 2020-05-11 22:37 UTC (permalink / raw)
  To: Jiri Pirko, netdev
  Cc: davem, kuba, parav, yuvalav, jgg, saeedm, leon,
	andrew.gospodarek, michael.chan, moshe, ayal, eranbe, vladbu,
	kliteyn, dchickles, sburla, fmanlunas, tariqt, oss-drivers,
	snelson, drivers, aelior, GR-everest-linux-l2, grygorii.strashko,
	mlxsw, idosch, markz, valex, linyunsheng, lihong.yang,
	vikas.gupta, sridhar.samudrala



On 5/10/2020 7:45 AM, Jiri Pirko wrote:
> Hello guys.
> 
> Anyone has any opinion on the proposal? Or should I take it as a silent
> agreement? :)

I am still reading through this whole thing. I haven't given feedback
yet because it is a lot to process.

-Jake

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [oss-drivers] [RFC v2] current devlink extension plan for NICs
  2020-05-01  9:14 [RFC v2] current devlink extension plan for NICs Jiri Pirko
  2020-05-04  2:12 ` Samudrala, Sridhar
  2020-05-10 14:45 ` Jiri Pirko
@ 2020-05-13 13:00 ` Simon Horman
  2020-05-14  6:07   ` Jiri Pirko
  2020-05-14 23:52 ` Jacob Keller
  2020-05-19  9:22 ` [RFC v3] " Jiri Pirko
  4 siblings, 1 reply; 19+ messages in thread
From: Simon Horman @ 2020-05-13 13:00 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, davem, kuba, parav, yuvalav, jgg, saeedm, leon,
	andrew.gospodarek, michael.chan, moshe, ayal, eranbe, vladbu,
	kliteyn, dchickles, sburla, fmanlunas, tariqt, oss-drivers,
	snelson, drivers, aelior, GR-everest-linux-l2, grygorii.strashko,
	mlxsw, idosch, markz, jacob.e.keller, valex, linyunsheng,
	lihong.yang, vikas.gupta, sridhar.samudrala

On Fri, May 01, 2020 at 11:14:49AM +0200, Jiri Pirko wrote:
> Hi all.
> 
> First, I would like to apologize for very long email. But I think it
> would be beneficial to the see the whole picture, where we are going.
> 
> Currently we are working internally on several features with
> need of extension of the current devlink infrastructure. I took a stab
> at putting it all together in a single txt file, inlined below.
> 
> Most of the stuff is based on a new port sub-object called "func"
> (called "slice" previously" and "subdev" originally in Yuval's patchsets
> sent some while ago).
> 
> The text describes how things should behave and provides a draft
> of user facing console input/outputs. I think it is important to clear
> that up before we go in and implement the devlink core and
> driver pieces.
> 
> I would like to ask you to read this and comment. Especially, I would
> like to ask vendors if what is described fits the needs of your
> NIC/e-switch.
> 
> Please note that something is already implemented, but most of this
> isn't (see "what needs to be implemented" section).
> 
> v1->v2
> - mainly move from separate slice object into port/func subobject
> - couple of small fixes here and there
> 
> 
> 
> 
> ==================================================================
> ||                                                              ||
> ||            Overall illustration of example setup             ||
> ||                                                              ||
> ==================================================================
> 
> Note that there are 2 hosts in the picture. Host A may be the smartnic host,
> Host B may be one of the hosts which gets PF. Also, you might omit
> the Host B and just see Host A like an ordinary nic in a host.
> 
> Note that the PF is merged with physical port representor.
> That is due to simpler and flawless transition from legacy mode and back.
> The devlink_ports and netdevs for physical ports are staying during
> the transition.

Hi Jiri,

I'm probably missing something obvious but this merge seems at odds with
the Netronome hardware.

We model a PF as, in a nutshell, a PCIE link to a host. A chip may have
one or more, and these may go to the same or different hosts. A chip may
also have one or more physical ports. And there is no strict relationship
between a PF and a physical port.

Of course in SR-IOV legacy mode, there is such a relationship, but its not
inherent to the hardware nor the NFP driver implementation of SR-IOV
switchdev mode.

...

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [oss-drivers] [RFC v2] current devlink extension plan for NICs
  2020-05-13 13:00 ` [oss-drivers] " Simon Horman
@ 2020-05-14  6:07   ` Jiri Pirko
  0 siblings, 0 replies; 19+ messages in thread
From: Jiri Pirko @ 2020-05-14  6:07 UTC (permalink / raw)
  To: Simon Horman
  Cc: netdev, davem, kuba, parav, yuvalav, jgg, saeedm, leon,
	andrew.gospodarek, michael.chan, moshe, ayal, eranbe, vladbu,
	kliteyn, dchickles, sburla, fmanlunas, tariqt, oss-drivers,
	snelson, drivers, aelior, GR-everest-linux-l2, grygorii.strashko,
	mlxsw, idosch, markz, jacob.e.keller, valex, linyunsheng,
	lihong.yang, vikas.gupta, sridhar.samudrala

Wed, May 13, 2020 at 03:00:09PM CEST, simon.horman@netronome.com wrote:
>On Fri, May 01, 2020 at 11:14:49AM +0200, Jiri Pirko wrote:
>> Hi all.
>> 
>> First, I would like to apologize for very long email. But I think it
>> would be beneficial to the see the whole picture, where we are going.
>> 
>> Currently we are working internally on several features with
>> need of extension of the current devlink infrastructure. I took a stab
>> at putting it all together in a single txt file, inlined below.
>> 
>> Most of the stuff is based on a new port sub-object called "func"
>> (called "slice" previously" and "subdev" originally in Yuval's patchsets
>> sent some while ago).
>> 
>> The text describes how things should behave and provides a draft
>> of user facing console input/outputs. I think it is important to clear
>> that up before we go in and implement the devlink core and
>> driver pieces.
>> 
>> I would like to ask you to read this and comment. Especially, I would
>> like to ask vendors if what is described fits the needs of your
>> NIC/e-switch.
>> 
>> Please note that something is already implemented, but most of this
>> isn't (see "what needs to be implemented" section).
>> 
>> v1->v2
>> - mainly move from separate slice object into port/func subobject
>> - couple of small fixes here and there
>> 
>> 
>> 
>> 
>> ==================================================================
>> ||                                                              ||
>> ||            Overall illustration of example setup             ||
>> ||                                                              ||
>> ==================================================================
>> 
>> Note that there are 2 hosts in the picture. Host A may be the smartnic host,
>> Host B may be one of the hosts which gets PF. Also, you might omit
>> the Host B and just see Host A like an ordinary nic in a host.
>> 
>> Note that the PF is merged with physical port representor.
>> That is due to simpler and flawless transition from legacy mode and back.
>> The devlink_ports and netdevs for physical ports are staying during
>> the transition.
>
>Hi Jiri,
>
>I'm probably missing something obvious but this merge seems at odds with
>the Netronome hardware.
>
>We model a PF as, in a nutshell, a PCIE link to a host. A chip may have
>one or more, and these may go to the same or different hosts. A chip may
>also have one or more physical ports. And there is no strict relationship
>between a PF and a physical port.

Yeah, no problem. You can have multiple physical ports under the same
devlink instance. In that case, from devlink perspective it is not
important if the physical port is backed by PF or not. I will rephrase a
bit so this is clear.


>
>Of course in SR-IOV legacy mode, there is such a relationship, but its not
>inherent to the hardware nor the NFP driver implementation of SR-IOV
>switchdev mode.
>
>...

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC v2] current devlink extension plan for NICs
  2020-05-01  9:14 [RFC v2] current devlink extension plan for NICs Jiri Pirko
                   ` (2 preceding siblings ...)
  2020-05-13 13:00 ` [oss-drivers] " Simon Horman
@ 2020-05-14 23:52 ` Jacob Keller
  2020-05-15  9:30   ` Jiri Pirko
  2020-05-19  9:22 ` [RFC v3] " Jiri Pirko
  4 siblings, 1 reply; 19+ messages in thread
From: Jacob Keller @ 2020-05-14 23:52 UTC (permalink / raw)
  To: Jiri Pirko, netdev
  Cc: davem, kuba, parav, yuvalav, jgg, saeedm, leon,
	andrew.gospodarek, michael.chan, moshe, ayal, eranbe, vladbu,
	kliteyn, dchickles, sburla, fmanlunas, tariqt, oss-drivers,
	snelson, drivers, aelior, GR-everest-linux-l2, grygorii.strashko,
	mlxsw, idosch, markz, valex, linyunsheng, lihong.yang,
	vikas.gupta, sridhar.samudrala



On 5/1/2020 2:14 AM, Jiri Pirko wrote:
> ==================================================================
> ||                                                              ||
> ||          SF (subfunction) user cmdline API draft             ||
> ||                                                              ||
> ==================================================================
> 
> Note that some of the "devlink port" attributes may be forgotten,
> misordered or omitted on purpose.
> 
> $ devlink port show
> pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
> pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f0np2
> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 type eth netdev enp6s0pf0vf0
>                     func: hw_addr 10:22:33:44:55:66 state active
> 
> There is one VF on the NIC.
> 
> Now create subfunction of SF0 on PF1, index of the port is going to be 100:
> 

Here, you say "SF0 on PF1", but you then specify sfnum as 10 below.. Is
there some naming scheme or terminology here?

> $ devlink port add pci/0000.06.00.0/100 flavour pcisf pfnum 1 sfnum 10
> 

Can you clarify what sfnum means here? and why is it different from the
index? I get that the index is a unique number that identifies the port
regardless of type, so sfnum must be some sort of hardware internal
identifier?

When looking at this with colleagues, there was a lot of confusion about
the difference between the index and the sfnum.

> The devlink kernel code calls down to device driver (devlink op) and asks
> it to create a SF port with particular attributes. Driver then instantiates
> the SF port in the same way it is done for VF.
> 

What do you mean by attributes here? what sort of attributes can be
requested?

> 
> Note that it may be possible to avoid passing port index and let the
> kernel assign index for you:
> $ devlink port add pci/0000.06.00.0 flavour pcisf pfnum 1 sfnum 10
> 
> This would work in a similar way as devlink region id assignment that
> is being pushed now.
> 

Sure, this makes sense to me after seeing Jakub's recent patch for
regions. I like this approach. Letting the user not have to pick an ID
ahead of time is useful.

Is it possible to skip providing an sfnum, and let the kernel or driver
pick one? Or does that not make sense?

> ==================================================================
> ||                                                              ||
> ||   VF manual creation and activation user cmdline API draft   ||
> ||                                                              ||
> ==================================================================
> 
> To enter manual mode, the user has to turn off VF dummies creation:
> $ devlink dev set pci/0000:06:00.0 vf_dummies disabled
> $ devlink dev show
> pci/0000:06:00.0: vf_dummies disabled
> 
> It is "enabled" by default in order not to break existing users.
> 
> By setting the "vf_dummies" attribute to "disabled", the driver
> removes all dummy VFs. Only physical ports are present:
> 
> $ devlink port show
> pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
> pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f0np2
> 
> Then the user is able to create them in a similar way as SFs:
> 
> $ devlink port add pci/0000:06:00.0/99 flavour pcivf pfnum 1 vfnum 8
> 

So in this case, you have to specify the VF index to create? So this
vfum is very similar to the sfnum (and pfnum?) above?

What about the ability to just say "please give me a VF, but I don't
care which one"?

> The devlink kernel code calls down to device driver (devlink op) and asks
> it to create a VF port with particular attributes. Driver then instantiates
> the VF port with func.
> 

> 
> ==================================================================
> ||                                                              ||
> ||                             PFs                              ||
> ||                                                              ||
> ==================================================================
> 
> There are 2 flavours of PFs:
> 1) Parent PF. That is coupled with uplink port. The flavour is:
>     a) "physical" - in case the uplink port is actual port in the NIC.
>     b) "virtual" - in case this Parent PF is actually a leg to
>        upstream embedded switch.

So "physical" is for the physical NIC port. Ok. And "virtual" is one
side of an internal embedded switch. This makes sense.

> 
>    $ devlink port show
>    pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
> 
>    If there is another parent PF, say "0000:06:00.1", that share the
>    same embedded switch, the aliasing is established for devlink handles.
> 
>    The user can use devlink handles:
>    pci/0000:06:00.0
>    pci/0000:06:00.1
>    as equivalents, pointing to the same devlink instance.
> 
>    Parent PFs are the ones that may be in control of managing
>    embedded switch, on any hierarchy leve>
> 2) Child PF. This is a leg of a PF put to the parent PF. It is
>    represented by a port a port with a netdevice and func:
> 
>    $ devlink port show
>    pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
>    pci/0000:06:00.0/1: flavour pcipf pfnum 2 type eth netdev enp6s0f0pf2
>        func: hw_addr aa:bb:cc:aa:bb:87 state active
> 
>    This is a typical smartnic scenario. You would see this list on
>    the smartnic CPU. The port pci/0000:06:00.0/1 is a leg to
>    one of the hosts. If you send packets to enp6s0f0pf2, they will
>    go to the child PF.
> 
>    Note that inside the host, the PF is represented again as "Parent PF"
>    and may be used to configure nested embedded switch.
> 
> 

I'm not sure I understand this section. Child PF? Is this like a PF in
another host? Or representing the other side of the virtual link?
> 
> ==================================================================
> ||                                                              ||
> ||            Dynamic PFs user cmdline API draft                ||
> ||                                                              ||
> ==================================================================
> 
> User might want to create another PF, similar as VF.
> TODO
> 

Obviously this is a TODO, but how does this differ from the current
port_split and port_unsplit?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC v2] current devlink extension plan for NICs
  2020-05-14 23:52 ` Jacob Keller
@ 2020-05-15  9:30   ` Jiri Pirko
  2020-05-15 21:36     ` Jacob Keller
  0 siblings, 1 reply; 19+ messages in thread
From: Jiri Pirko @ 2020-05-15  9:30 UTC (permalink / raw)
  To: Jacob Keller
  Cc: netdev, davem, kuba, parav, yuvalav, jgg, saeedm, leon,
	andrew.gospodarek, michael.chan, moshe, ayal, eranbe, vladbu,
	kliteyn, dchickles, sburla, fmanlunas, tariqt, oss-drivers,
	snelson, drivers, aelior, GR-everest-linux-l2, grygorii.strashko,
	mlxsw, idosch, markz, valex, linyunsheng, lihong.yang,
	vikas.gupta, sridhar.samudrala

Fri, May 15, 2020 at 01:52:54AM CEST, jacob.e.keller@intel.com wrote:
>
>
>On 5/1/2020 2:14 AM, Jiri Pirko wrote:
>> ==================================================================
>> ||                                                              ||
>> ||          SF (subfunction) user cmdline API draft             ||
>> ||                                                              ||
>> ==================================================================
>> 
>> Note that some of the "devlink port" attributes may be forgotten,
>> misordered or omitted on purpose.
>> 
>> $ devlink port show
>> pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
>> pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f0np2
>> pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 type eth netdev enp6s0pf0vf0
>>                     func: hw_addr 10:22:33:44:55:66 state active
>> 
>> There is one VF on the NIC.
>> 
>> Now create subfunction of SF0 on PF1, index of the port is going to be 100:
>> 
>
>Here, you say "SF0 on PF1", but you then specify sfnum as 10 below.. Is
>there some naming scheme or terminology here?

Typo, will fix.


>
>> $ devlink port add pci/0000.06.00.0/100 flavour pcisf pfnum 1 sfnum 10
>> 
>
>Can you clarify what sfnum means here? and why is it different from the
>index? I get that the index is a unique number that identifies the port
>regardless of type, so sfnum must be some sort of hardware internal
>identifier?

Basically pfnum, sfnum and vfnum could overlap. Index is unique within
all groups together.


>
>When looking at this with colleagues, there was a lot of confusion about
>the difference between the index and the sfnum.

No confusion about index and pfnum/vfnum? They behave the same.
Index is just a port handle.


>
>> The devlink kernel code calls down to device driver (devlink op) and asks
>> it to create a SF port with particular attributes. Driver then instantiates
>> the SF port in the same way it is done for VF.
>> 
>
>What do you mean by attributes here? what sort of attributes can be
>requested?

In the original slice proposal, it was possible to pass the mac address
too. However with new approach (port func subobject) that is not
possible. I'll remove this rudiment.


>
>> 
>> Note that it may be possible to avoid passing port index and let the
>> kernel assign index for you:
>> $ devlink port add pci/0000.06.00.0 flavour pcisf pfnum 1 sfnum 10
>> 
>> This would work in a similar way as devlink region id assignment that
>> is being pushed now.
>> 
>
>Sure, this makes sense to me after seeing Jakub's recent patch for
>regions. I like this approach. Letting the user not have to pick an ID
>ahead of time is useful.
>
>Is it possible to skip providing an sfnum, and let the kernel or driver
>pick one? Or does that not make sense?

Does not. The sfnum is something that should be deterministic. The sfnum
is then visible on the other side on the virtbus device:
/sys/bus/virtbus/devices/mlx5_sf.1/sfnum
and it's name is generated accordingly: enp6s0f0s10



>
>> ==================================================================
>> ||                                                              ||
>> ||   VF manual creation and activation user cmdline API draft   ||
>> ||                                                              ||
>> ==================================================================
>> 
>> To enter manual mode, the user has to turn off VF dummies creation:
>> $ devlink dev set pci/0000:06:00.0 vf_dummies disabled
>> $ devlink dev show
>> pci/0000:06:00.0: vf_dummies disabled
>> 
>> It is "enabled" by default in order not to break existing users.
>> 
>> By setting the "vf_dummies" attribute to "disabled", the driver
>> removes all dummy VFs. Only physical ports are present:
>> 
>> $ devlink port show
>> pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
>> pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f0np2
>> 
>> Then the user is able to create them in a similar way as SFs:
>> 
>> $ devlink port add pci/0000:06:00.0/99 flavour pcivf pfnum 1 vfnum 8
>> 
>
>So in this case, you have to specify the VF index to create? So this
>vfum is very similar to the sfnum (and pfnum?) above?

Yes.


>
>What about the ability to just say "please give me a VF, but I don't
>care which one"?

Well, that could be eventually done too, with Jakub's extension.


>
>> The devlink kernel code calls down to device driver (devlink op) and asks
>> it to create a VF port with particular attributes. Driver then instantiates
>> the VF port with func.
>> 
>
>> 
>> ==================================================================
>> ||                                                              ||
>> ||                             PFs                              ||
>> ||                                                              ||
>> ==================================================================
>> 
>> There are 2 flavours of PFs:
>> 1) Parent PF. That is coupled with uplink port. The flavour is:
>>     a) "physical" - in case the uplink port is actual port in the NIC.
>>     b) "virtual" - in case this Parent PF is actually a leg to
>>        upstream embedded switch.
>
>So "physical" is for the physical NIC port. Ok. And "virtual" is one
>side of an internal embedded switch. This makes sense.

Yes.


>
>> 
>>    $ devlink port show
>>    pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
>> 
>>    If there is another parent PF, say "0000:06:00.1", that share the
>>    same embedded switch, the aliasing is established for devlink handles.
>> 
>>    The user can use devlink handles:
>>    pci/0000:06:00.0
>>    pci/0000:06:00.1
>>    as equivalents, pointing to the same devlink instance.
>> 
>>    Parent PFs are the ones that may be in control of managing
>>    embedded switch, on any hierarchy leve>
>> 2) Child PF. This is a leg of a PF put to the parent PF. It is
>>    represented by a port a port with a netdevice and func:
>> 
>>    $ devlink port show
>>    pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
>>    pci/0000:06:00.0/1: flavour pcipf pfnum 2 type eth netdev enp6s0f0pf2
>>        func: hw_addr aa:bb:cc:aa:bb:87 state active
>> 
>>    This is a typical smartnic scenario. You would see this list on
>>    the smartnic CPU. The port pci/0000:06:00.0/1 is a leg to
>>    one of the hosts. If you send packets to enp6s0f0pf2, they will
>>    go to the child PF.
>> 
>>    Note that inside the host, the PF is represented again as "Parent PF"
>>    and may be used to configure nested embedded switch.
>> 
>> 
>
>I'm not sure I understand this section. Child PF? Is this like a PF in
>another host? Or representing the other side of the virtual link?

It's both actually, at the same time.



>> 
>> ==================================================================
>> ||                                                              ||
>> ||            Dynamic PFs user cmdline API draft                ||
>> ||                                                              ||
>> ==================================================================
>> 
>> User might want to create another PF, similar as VF.
>> TODO
>> 
>
>Obviously this is a TODO, but how does this differ from the current
>port_split and port_unsplit?

Does not have anything to do with port splitting. This is about creating
a "child PF" from the section above.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC v2] current devlink extension plan for NICs
  2020-05-15  9:30   ` Jiri Pirko
@ 2020-05-15 21:36     ` Jacob Keller
  2020-05-18  6:52       ` Jiri Pirko
  0 siblings, 1 reply; 19+ messages in thread
From: Jacob Keller @ 2020-05-15 21:36 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, davem, kuba, parav, yuvalav, jgg, saeedm, leon,
	andrew.gospodarek, michael.chan, moshe, ayal, eranbe, vladbu,
	kliteyn, dchickles, sburla, fmanlunas, tariqt, oss-drivers,
	snelson, drivers, aelior, GR-everest-linux-l2, grygorii.strashko,
	mlxsw, idosch, markz, valex, linyunsheng, lihong.yang,
	vikas.gupta, sridhar.samudrala



On 5/15/2020 2:30 AM, Jiri Pirko wrote:
> Fri, May 15, 2020 at 01:52:54AM CEST, jacob.e.keller@intel.com wrote:
>>> $ devlink port add pci/0000.06.00.0/100 flavour pcisf pfnum 1 sfnum 10
>>>
>>
>> Can you clarify what sfnum means here? and why is it different from the
>> index? I get that the index is a unique number that identifies the port
>> regardless of type, so sfnum must be some sort of hardware internal
>> identifier?
> 
> Basically pfnum, sfnum and vfnum could overlap. Index is unique within
> all groups together.
> 

Right. Index is just an identifier for which port this is.

> 
>>
>> When looking at this with colleagues, there was a lot of confusion about
>> the difference between the index and the sfnum.
> 
> No confusion about index and pfnum/vfnum? They behave the same.
> Index is just a port handle.
> 

I'm less confused about the difference between index and these "nums",
and more so questioning what pfnum/vfnum/sfnum represent? Are they
similar to the vf ID that we have in the legacy SRIOV functions? I.e. a
hardware index?

I don't think in general users necessarily care which "index" they get
upfront. They obviously very much care about the index once it's
selected. I do believe the interfaces should start with the capability
for the index to be selected automatically at creation (with the
optional capability to select a specific index if desired, as shown here).

I do not think most users want to care about what to pick for this
number. (Just as they would not want to pick a number for the port index
either).

> 
>>
>>> The devlink kernel code calls down to device driver (devlink op) and asks
>>> it to create a SF port with particular attributes. Driver then instantiates
>>> the SF port in the same way it is done for VF.
>>>
>>
>> What do you mean by attributes here? what sort of attributes can be
>> requested?
> 
> In the original slice proposal, it was possible to pass the mac address
> too. However with new approach (port func subobject) that is not
> possible. I'll remove this rudiment.
> 

Ok.

> 
>>
>>>
>>> Note that it may be possible to avoid passing port index and let the
>>> kernel assign index for you:
>>> $ devlink port add pci/0000.06.00.0 flavour pcisf pfnum 1 sfnum 10
>>>
>>> This would work in a similar way as devlink region id assignment that
>>> is being pushed now.
>>>
>>
>> Sure, this makes sense to me after seeing Jakub's recent patch for
>> regions. I like this approach. Letting the user not have to pick an ID
>> ahead of time is useful.
>>
>> Is it possible to skip providing an sfnum, and let the kernel or driver
>> pick one? Or does that not make sense?
> 
> Does not. The sfnum is something that should be deterministic. The sfnum
> is then visible on the other side on the virtbus device:
> /sys/bus/virtbus/devices/mlx5_sf.1/sfnum
> and it's name is generated accordingly: enp6s0f0s10
> 

Why not have the option to say "create me an sfnum and then report it to
me" in the same way we do with region numbers now and plan to with port
indexes?

Basically: why do I as a user of the front end care what this number
actually is? What does it represent?

> 
> 
>>
>>> ==================================================================
>>> ||                                                              ||
>>> ||   VF manual creation and activation user cmdline API draft   ||
>>> ||                                                              ||
>>> ==================================================================
>>>
>>> To enter manual mode, the user has to turn off VF dummies creation:
>>> $ devlink dev set pci/0000:06:00.0 vf_dummies disabled
>>> $ devlink dev show
>>> pci/0000:06:00.0: vf_dummies disabled
>>>
>>> It is "enabled" by default in order not to break existing users.
>>>
>>> By setting the "vf_dummies" attribute to "disabled", the driver
>>> removes all dummy VFs. Only physical ports are present:
>>>
>>> $ devlink port show
>>> pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
>>> pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f0np2
>>>
>>> Then the user is able to create them in a similar way as SFs:
>>>
>>> $ devlink port add pci/0000:06:00.0/99 flavour pcivf pfnum 1 vfnum 8
>>>
>>
>> So in this case, you have to specify the VF index to create? So this
>> vfum is very similar to the sfnum (and pfnum?) above?
> 
> Yes.
> 
> 
>>
>> What about the ability to just say "please give me a VF, but I don't
>> care which one"?
> 
> Well, that could be eventually done too, with Jakub's extension.
> 

Sure. I think that's what I was asking above as well. Ok.

>>>
>>>    $ devlink port show
>>>    pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
>>>
>>>    If there is another parent PF, say "0000:06:00.1", that share the
>>>    same embedded switch, the aliasing is established for devlink handles.
>>>
>>>    The user can use devlink handles:
>>>    pci/0000:06:00.0
>>>    pci/0000:06:00.1
>>>    as equivalents, pointing to the same devlink instance.
>>>
>>>    Parent PFs are the ones that may be in control of managing
>>>    embedded switch, on any hierarchy leve>
>>> 2) Child PF. This is a leg of a PF put to the parent PF. It is
>>>    represented by a port a port with a netdevice and func:
>>>
>>>    $ devlink port show
>>>    pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
>>>    pci/0000:06:00.0/1: flavour pcipf pfnum 2 type eth netdev enp6s0f0pf2
>>>        func: hw_addr aa:bb:cc:aa:bb:87 state active
>>>
>>>    This is a typical smartnic scenario. You would see this list on
>>>    the smartnic CPU. The port pci/0000:06:00.0/1 is a leg to
>>>    one of the hosts. If you send packets to enp6s0f0pf2, they will
>>>    go to the child PF.
>>>
>>>    Note that inside the host, the PF is represented again as "Parent PF"
>>>    and may be used to configure nested embedded switch.
>>>
>>>
>>
>> I'm not sure I understand this section. Child PF? Is this like a PF in
>> another host? Or representing the other side of the virtual link?
> 
> It's both actually, at the same time.
> 
> 

Ok. I still don't think I fully grasp this yet.


>> Obviously this is a TODO, but how does this differ from the current
>> port_split and port_unsplit?
> 
> Does not have anything to do with port splitting. This is about creating
> a "child PF" from the section above.
> 

Hmm. Ok so this is about internal connections in the switch, then?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC v2] current devlink extension plan for NICs
  2020-05-15 21:36     ` Jacob Keller
@ 2020-05-18  6:52       ` Jiri Pirko
  2020-05-18 21:05         ` Jacob Keller
  0 siblings, 1 reply; 19+ messages in thread
From: Jiri Pirko @ 2020-05-18  6:52 UTC (permalink / raw)
  To: Jacob Keller
  Cc: netdev, davem, kuba, parav, yuvalav, jgg, saeedm, leon,
	andrew.gospodarek, michael.chan, moshe, ayal, eranbe, vladbu,
	kliteyn, dchickles, sburla, fmanlunas, tariqt, oss-drivers,
	snelson, drivers, aelior, GR-everest-linux-l2, grygorii.strashko,
	mlxsw, idosch, markz, valex, linyunsheng, lihong.yang,
	vikas.gupta, sridhar.samudrala

Fri, May 15, 2020 at 11:36:19PM CEST, jacob.e.keller@intel.com wrote:
>
>
>On 5/15/2020 2:30 AM, Jiri Pirko wrote:
>> Fri, May 15, 2020 at 01:52:54AM CEST, jacob.e.keller@intel.com wrote:
>>>> $ devlink port add pci/0000.06.00.0/100 flavour pcisf pfnum 1 sfnum 10
>>>>
>>>
>>> Can you clarify what sfnum means here? and why is it different from the
>>> index? I get that the index is a unique number that identifies the port
>>> regardless of type, so sfnum must be some sort of hardware internal
>>> identifier?
>> 
>> Basically pfnum, sfnum and vfnum could overlap. Index is unique within
>> all groups together.
>> 
>
>Right. Index is just an identifier for which port this is.
>
>> 
>>>
>>> When looking at this with colleagues, there was a lot of confusion about
>>> the difference between the index and the sfnum.
>> 
>> No confusion about index and pfnum/vfnum? They behave the same.
>> Index is just a port handle.
>> 
>
>I'm less confused about the difference between index and these "nums",
>and more so questioning what pfnum/vfnum/sfnum represent? Are they
>similar to the vf ID that we have in the legacy SRIOV functions? I.e. a
>hardware index?
>
>I don't think in general users necessarily care which "index" they get
>upfront. They obviously very much care about the index once it's
>selected. I do believe the interfaces should start with the capability
>for the index to be selected automatically at creation (with the
>optional capability to select a specific index if desired, as shown here).
>
>I do not think most users want to care about what to pick for this
>number. (Just as they would not want to pick a number for the port index
>either).

I see your point. However I don't think it is always the right
scenario. The "nums" are used for naming of the netdevices, both the
eswitch port representor and the actual SF (in case of SF).

I think that in lot of usecases is more convenient for user to select
the "num" on the cmdline.



>
>> 
>>>
>>>> The devlink kernel code calls down to device driver (devlink op) and asks
>>>> it to create a SF port with particular attributes. Driver then instantiates
>>>> the SF port in the same way it is done for VF.
>>>>
>>>
>>> What do you mean by attributes here? what sort of attributes can be
>>> requested?
>> 
>> In the original slice proposal, it was possible to pass the mac address
>> too. However with new approach (port func subobject) that is not
>> possible. I'll remove this rudiment.
>> 
>
>Ok.
>
>> 
>>>
>>>>
>>>> Note that it may be possible to avoid passing port index and let the
>>>> kernel assign index for you:
>>>> $ devlink port add pci/0000.06.00.0 flavour pcisf pfnum 1 sfnum 10
>>>>
>>>> This would work in a similar way as devlink region id assignment that
>>>> is being pushed now.
>>>>
>>>
>>> Sure, this makes sense to me after seeing Jakub's recent patch for
>>> regions. I like this approach. Letting the user not have to pick an ID
>>> ahead of time is useful.
>>>
>>> Is it possible to skip providing an sfnum, and let the kernel or driver
>>> pick one? Or does that not make sense?
>> 
>> Does not. The sfnum is something that should be deterministic. The sfnum
>> is then visible on the other side on the virtbus device:
>> /sys/bus/virtbus/devices/mlx5_sf.1/sfnum
>> and it's name is generated accordingly: enp6s0f0s10
>> 
>
>Why not have the option to say "create me an sfnum and then report it to
>me" in the same way we do with region numbers now and plan to with port
>indexes?

Sure, why not.


>
>Basically: why do I as a user of the front end care what this number
>actually is? What does it represent?

See my answer above.


>
>> 
>> 
>>>
>>>> ==================================================================
>>>> ||                                                              ||
>>>> ||   VF manual creation and activation user cmdline API draft   ||
>>>> ||                                                              ||
>>>> ==================================================================
>>>>
>>>> To enter manual mode, the user has to turn off VF dummies creation:
>>>> $ devlink dev set pci/0000:06:00.0 vf_dummies disabled
>>>> $ devlink dev show
>>>> pci/0000:06:00.0: vf_dummies disabled
>>>>
>>>> It is "enabled" by default in order not to break existing users.
>>>>
>>>> By setting the "vf_dummies" attribute to "disabled", the driver
>>>> removes all dummy VFs. Only physical ports are present:
>>>>
>>>> $ devlink port show
>>>> pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
>>>> pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f0np2
>>>>
>>>> Then the user is able to create them in a similar way as SFs:
>>>>
>>>> $ devlink port add pci/0000:06:00.0/99 flavour pcivf pfnum 1 vfnum 8
>>>>
>>>
>>> So in this case, you have to specify the VF index to create? So this
>>> vfum is very similar to the sfnum (and pfnum?) above?
>> 
>> Yes.
>> 
>> 
>>>
>>> What about the ability to just say "please give me a VF, but I don't
>>> care which one"?
>> 
>> Well, that could be eventually done too, with Jakub's extension.
>> 
>
>Sure. I think that's what I was asking above as well. Ok.
>
>>>>
>>>>    $ devlink port show
>>>>    pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
>>>>
>>>>    If there is another parent PF, say "0000:06:00.1", that share the
>>>>    same embedded switch, the aliasing is established for devlink handles.
>>>>
>>>>    The user can use devlink handles:
>>>>    pci/0000:06:00.0
>>>>    pci/0000:06:00.1
>>>>    as equivalents, pointing to the same devlink instance.
>>>>
>>>>    Parent PFs are the ones that may be in control of managing
>>>>    embedded switch, on any hierarchy leve>
>>>> 2) Child PF. This is a leg of a PF put to the parent PF. It is
>>>>    represented by a port a port with a netdevice and func:
>>>>
>>>>    $ devlink port show
>>>>    pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
>>>>    pci/0000:06:00.0/1: flavour pcipf pfnum 2 type eth netdev enp6s0f0pf2
>>>>        func: hw_addr aa:bb:cc:aa:bb:87 state active
>>>>
>>>>    This is a typical smartnic scenario. You would see this list on
>>>>    the smartnic CPU. The port pci/0000:06:00.0/1 is a leg to
>>>>    one of the hosts. If you send packets to enp6s0f0pf2, they will
>>>>    go to the child PF.
>>>>
>>>>    Note that inside the host, the PF is represented again as "Parent PF"
>>>>    and may be used to configure nested embedded switch.
>>>>
>>>>
>>>
>>> I'm not sure I understand this section. Child PF? Is this like a PF in
>>> another host? Or representing the other side of the virtual link?
>> 
>> It's both actually, at the same time.
>> 
>> 
>
>Ok. I still don't think I fully grasp this yet.
>
>
>>> Obviously this is a TODO, but how does this differ from the current
>>> port_split and port_unsplit?
>> 
>> Does not have anything to do with port splitting. This is about creating
>> a "child PF" from the section above.
>> 
>
>Hmm. Ok so this is about internal connections in the switch, then?

Yes. Take the smartnic as an example. On the smartnic cpu, the
eswitch management is being done. There's devlink instance with all
eswitch port visible as devlink ports. One PF-type devlink port per
host. That are the "child PFs".

Now from perspective of the host, there are 2 scenarios:
1) have the "simple dumb" PF, which just exposes 1 netdev for host to
   run traffic over. smartnic cpu manages the VFs/SFs and sees the
   devlink ports for them. This is 1 level switch - merged switch

2) PF manages a sub-switch/nested-switch. The devlink/devlink ports are
   created on the host and the devlink ports for SFs/VFs are created
   there. This is multi-level eswitch. Each "child PF" on a parent
   manages a nested switch. And could in theory have other PF child with
   another nested switch.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC v2] current devlink extension plan for NICs
  2020-05-18  6:52       ` Jiri Pirko
@ 2020-05-18 21:05         ` Jacob Keller
  2020-05-19  5:17           ` Parav Pandit
  2020-05-19  5:19           ` Jiri Pirko
  0 siblings, 2 replies; 19+ messages in thread
From: Jacob Keller @ 2020-05-18 21:05 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, davem, kuba, parav, yuvalav, jgg, saeedm, leon,
	andrew.gospodarek, michael.chan, moshe, ayal, eranbe, vladbu,
	kliteyn, dchickles, sburla, fmanlunas, tariqt, oss-drivers,
	snelson, drivers, aelior, GR-everest-linux-l2, grygorii.strashko,
	mlxsw, idosch, markz, valex, linyunsheng, lihong.yang,
	vikas.gupta, sridhar.samudrala



On 5/17/2020 11:52 PM, Jiri Pirko wrote:
> Fri, May 15, 2020 at 11:36:19PM CEST, jacob.e.keller@intel.com wrote:
>>
>>
>> On 5/15/2020 2:30 AM, Jiri Pirko wrote:
>>> Fri, May 15, 2020 at 01:52:54AM CEST, jacob.e.keller@intel.com wrote:
>>>>> $ devlink port add pci/0000.06.00.0/100 flavour pcisf pfnum 1 sfnum 10
>>>>>
>>>>
>>>> Can you clarify what sfnum means here? and why is it different from the
>>>> index? I get that the index is a unique number that identifies the port
>>>> regardless of type, so sfnum must be some sort of hardware internal
>>>> identifier?
>>>
>>> Basically pfnum, sfnum and vfnum could overlap. Index is unique within
>>> all groups together.
>>>
>>
>> Right. Index is just an identifier for which port this is.
>>

Ok, so whether or not a driver uses this internally is an implementation
detail that doesn't matter to the interface.


>>>
>>>>
>>>> When looking at this with colleagues, there was a lot of confusion about
>>>> the difference between the index and the sfnum.
>>>
>>> No confusion about index and pfnum/vfnum? They behave the same.
>>> Index is just a port handle.
>>>
>>
>> I'm less confused about the difference between index and these "nums",
>> and more so questioning what pfnum/vfnum/sfnum represent? Are they
>> similar to the vf ID that we have in the legacy SRIOV functions? I.e. a
>> hardware index?
>>
>> I don't think in general users necessarily care which "index" they get
>> upfront. They obviously very much care about the index once it's
>> selected. I do believe the interfaces should start with the capability
>> for the index to be selected automatically at creation (with the
>> optional capability to select a specific index if desired, as shown here).
>>
>> I do not think most users want to care about what to pick for this
>> number. (Just as they would not want to pick a number for the port index
>> either).
> 
> I see your point. However I don't think it is always the right
> scenario. The "nums" are used for naming of the netdevices, both the
> eswitch port representor and the actual SF (in case of SF).
> 
> I think that in lot of usecases is more convenient for user to select
> the "num" on the cmdline.
> 

Agreed, based on the below statements. Basically "let users specify or
get it automatically chosen", just like with the port identifier and
with the region numbers now.


Thanks for the explanations!

>>
>>>> Obviously this is a TODO, but how does this differ from the current
>>>> port_split and port_unsplit?
>>>
>>> Does not have anything to do with port splitting. This is about creating
>>> a "child PF" from the section above.
>>>
>>
>> Hmm. Ok so this is about internal connections in the switch, then?
> 
> Yes. Take the smartnic as an example. On the smartnic cpu, the
> eswitch management is being done. There's devlink instance with all
> eswitch port visible as devlink ports. One PF-type devlink port per
> host. That are the "child PFs".
> 
> Now from perspective of the host, there are 2 scenarios:
> 1) have the "simple dumb" PF, which just exposes 1 netdev for host to
>    run traffic over. smartnic cpu manages the VFs/SFs and sees the
>    devlink ports for them. This is 1 level switch - merged switch
> 
> 2) PF manages a sub-switch/nested-switch. The devlink/devlink ports are
>    created on the host and the devlink ports for SFs/VFs are created
>    there. This is multi-level eswitch. Each "child PF" on a parent
>    manages a nested switch. And could in theory have other PF child with
>    another nested switch.
> 

Ok. So in the smart NIC CPU, we'd see the primary PF and some child PFs,
and in the host system we'd see a "primary PF" that is the other end of
the associated Child PF, and might be able to manage its own subswitch.

Ok this is making more sense now.

I think I had imagined that was what subfuntions were. But really
subfunctions are a bit different, they're more similar to expanded VFs?

Thanks,
Jake

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [RFC v2] current devlink extension plan for NICs
  2020-05-18 21:05         ` Jacob Keller
@ 2020-05-19  5:17           ` Parav Pandit
  2020-05-19 19:45             ` Jacob Keller
  2020-05-19  5:19           ` Jiri Pirko
  1 sibling, 1 reply; 19+ messages in thread
From: Parav Pandit @ 2020-05-19  5:17 UTC (permalink / raw)
  To: Jacob Keller, Jiri Pirko
  Cc: netdev, davem, kuba, Yuval Avnery, jgg, Saeed Mahameed, leon,
	andrew.gospodarek, michael.chan, Moshe Shemesh, Aya Levin,
	Eran Ben Elisha, Vlad Buslov, Yevgeny Kliteynik, dchickles,
	sburla, fmanlunas, Tariq Toukan, oss-drivers, snelson, drivers,
	aelior, GR-everest-linux-l2, grygorii.strashko, mlxsw,
	Ido Schimmel, Mark Zhang, Alex Vesker, linyunsheng, lihong.yang,
	vikas.gupta, sridhar.samudrala

Hi Jake,

> From: netdev-owner@vger.kernel.org <netdev-owner@vger.kernel.org> On
> Behalf Of Jacob Keller
> 
> 
> On 5/17/2020 11:52 PM, Jiri Pirko wrote:
> > Fri, May 15, 2020 at 11:36:19PM CEST, jacob.e.keller@intel.com wrote:
> >>
> >>
> >> On 5/15/2020 2:30 AM, Jiri Pirko wrote:
> >>> Fri, May 15, 2020 at 01:52:54AM CEST, jacob.e.keller@intel.com wrote:
> >>>>> $ devlink port add pci/0000.06.00.0/100 flavour pcisf pfnum 1
> >>>>> sfnum 10
> >>>>>
> >>>>
> >>>> Can you clarify what sfnum means here? and why is it different from
> >>>> the index? I get that the index is a unique number that identifies
> >>>> the port regardless of type, so sfnum must be some sort of hardware
> >>>> internal identifier?
> >>>
> >>> Basically pfnum, sfnum and vfnum could overlap. Index is unique
> >>> within all groups together.
> >>>
> >>
> >> Right. Index is just an identifier for which port this is.
> >>
> 
> Ok, so whether or not a driver uses this internally is an implementation
> detail that doesn't matter to the interface.
> 
> 
> >>>
> >>>>
> >>>> When looking at this with colleagues, there was a lot of confusion
> >>>> about the difference between the index and the sfnum.
> >>>
> >>> No confusion about index and pfnum/vfnum? They behave the same.
> >>> Index is just a port handle.
> >>>
> >>
> >> I'm less confused about the difference between index and these
> >> "nums", and more so questioning what pfnum/vfnum/sfnum represent?
> Are
> >> they similar to the vf ID that we have in the legacy SRIOV functions?
> >> I.e. a hardware index?
> >>
> >> I don't think in general users necessarily care which "index" they
> >> get upfront. They obviously very much care about the index once it's
> >> selected. I do believe the interfaces should start with the
> >> capability for the index to be selected automatically at creation
> >> (with the optional capability to select a specific index if desired, as shown
> here).
> >>
> >> I do not think most users want to care about what to pick for this
> >> number. (Just as they would not want to pick a number for the port
> >> index either).
> >
> > I see your point. However I don't think it is always the right
> > scenario. The "nums" are used for naming of the netdevices, both the
> > eswitch port representor and the actual SF (in case of SF).
> >
> > I think that in lot of usecases is more convenient for user to select
> > the "num" on the cmdline.
> >
> 
> Agreed, based on the below statements. Basically "let users specify or get it
> automatically chosen", just like with the port identifier and with the region
> numbers now.
> 
> 
> Thanks for the explanations!
> 
> >>
> >>>> Obviously this is a TODO, but how does this differ from the current
> >>>> port_split and port_unsplit?
> >>>
> >>> Does not have anything to do with port splitting. This is about
> >>> creating a "child PF" from the section above.
> >>>
> >>
> >> Hmm. Ok so this is about internal connections in the switch, then?
> >
> > Yes. Take the smartnic as an example. On the smartnic cpu, the eswitch
> > management is being done. There's devlink instance with all eswitch
> > port visible as devlink ports. One PF-type devlink port per host. That
> > are the "child PFs".
> >
> > Now from perspective of the host, there are 2 scenarios:
> > 1) have the "simple dumb" PF, which just exposes 1 netdev for host to
> >    run traffic over. smartnic cpu manages the VFs/SFs and sees the
> >    devlink ports for them. This is 1 level switch - merged switch
> >
> > 2) PF manages a sub-switch/nested-switch. The devlink/devlink ports are
> >    created on the host and the devlink ports for SFs/VFs are created
> >    there. This is multi-level eswitch. Each "child PF" on a parent
> >    manages a nested switch. And could in theory have other PF child with
> >    another nested switch.
> >
> 
> Ok. So in the smart NIC CPU, we'd see the primary PF and some child PFs,
> and in the host system we'd see a "primary PF" that is the other end of the
> associated Child PF, and might be able to manage its own subswitch.
> 
> Ok this is making more sense now.
> 
> I think I had imagined that was what subfuntions were. But really
> subfunctions are a bit different, they're more similar to expanded VFs?
>
 
1. Sub functions are more light weight than VFs because,
2. They share the same PCI device (BAR, IRQs) as that of PF/VF on which it is deployed.
3. Unlike VFs which are enabled/disabled in bulk, subfunctions are created, deployed in unit of 1.

Since this RFC content is overwhelming, I expanded the SF plumbing details more in [1] in previous RFC version.
You can replace 'devlink slice' with 'devlink port func' in [1].

[1] https://marc.info/?l=linux-netdev&m=158555928517777&w=2

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC v2] current devlink extension plan for NICs
  2020-05-18 21:05         ` Jacob Keller
  2020-05-19  5:17           ` Parav Pandit
@ 2020-05-19  5:19           ` Jiri Pirko
  1 sibling, 0 replies; 19+ messages in thread
From: Jiri Pirko @ 2020-05-19  5:19 UTC (permalink / raw)
  To: Jacob Keller
  Cc: netdev, davem, kuba, parav, yuvalav, jgg, saeedm, leon,
	andrew.gospodarek, michael.chan, moshe, ayal, eranbe, vladbu,
	kliteyn, dchickles, sburla, fmanlunas, tariqt, oss-drivers,
	snelson, drivers, aelior, GR-everest-linux-l2, grygorii.strashko,
	mlxsw, idosch, markz, valex, linyunsheng, lihong.yang,
	vikas.gupta, sridhar.samudrala

Mon, May 18, 2020 at 11:05:45PM CEST, jacob.e.keller@intel.com wrote:
>
>
>On 5/17/2020 11:52 PM, Jiri Pirko wrote:
>> Fri, May 15, 2020 at 11:36:19PM CEST, jacob.e.keller@intel.com wrote:
>>>
>>>
>>> On 5/15/2020 2:30 AM, Jiri Pirko wrote:
>>>> Fri, May 15, 2020 at 01:52:54AM CEST, jacob.e.keller@intel.com wrote:
>>>>>> $ devlink port add pci/0000.06.00.0/100 flavour pcisf pfnum 1 sfnum 10
>>>>>>
>>>>>
>>>>> Can you clarify what sfnum means here? and why is it different from the
>>>>> index? I get that the index is a unique number that identifies the port
>>>>> regardless of type, so sfnum must be some sort of hardware internal
>>>>> identifier?
>>>>
>>>> Basically pfnum, sfnum and vfnum could overlap. Index is unique within
>>>> all groups together.
>>>>
>>>
>>> Right. Index is just an identifier for which port this is.
>>>
>
>Ok, so whether or not a driver uses this internally is an implementation
>detail that doesn't matter to the interface.
>
>
>>>>
>>>>>
>>>>> When looking at this with colleagues, there was a lot of confusion about
>>>>> the difference between the index and the sfnum.
>>>>
>>>> No confusion about index and pfnum/vfnum? They behave the same.
>>>> Index is just a port handle.
>>>>
>>>
>>> I'm less confused about the difference between index and these "nums",
>>> and more so questioning what pfnum/vfnum/sfnum represent? Are they
>>> similar to the vf ID that we have in the legacy SRIOV functions? I.e. a
>>> hardware index?
>>>
>>> I don't think in general users necessarily care which "index" they get
>>> upfront. They obviously very much care about the index once it's
>>> selected. I do believe the interfaces should start with the capability
>>> for the index to be selected automatically at creation (with the
>>> optional capability to select a specific index if desired, as shown here).
>>>
>>> I do not think most users want to care about what to pick for this
>>> number. (Just as they would not want to pick a number for the port index
>>> either).
>> 
>> I see your point. However I don't think it is always the right
>> scenario. The "nums" are used for naming of the netdevices, both the
>> eswitch port representor and the actual SF (in case of SF).
>> 
>> I think that in lot of usecases is more convenient for user to select
>> the "num" on the cmdline.
>> 
>
>Agreed, based on the below statements. Basically "let users specify or
>get it automatically chosen", just like with the port identifier and
>with the region numbers now.
>
>
>Thanks for the explanations!
>
>>>
>>>>> Obviously this is a TODO, but how does this differ from the current
>>>>> port_split and port_unsplit?
>>>>
>>>> Does not have anything to do with port splitting. This is about creating
>>>> a "child PF" from the section above.
>>>>
>>>
>>> Hmm. Ok so this is about internal connections in the switch, then?
>> 
>> Yes. Take the smartnic as an example. On the smartnic cpu, the
>> eswitch management is being done. There's devlink instance with all
>> eswitch port visible as devlink ports. One PF-type devlink port per
>> host. That are the "child PFs".
>> 
>> Now from perspective of the host, there are 2 scenarios:
>> 1) have the "simple dumb" PF, which just exposes 1 netdev for host to
>>    run traffic over. smartnic cpu manages the VFs/SFs and sees the
>>    devlink ports for them. This is 1 level switch - merged switch
>> 
>> 2) PF manages a sub-switch/nested-switch. The devlink/devlink ports are
>>    created on the host and the devlink ports for SFs/VFs are created
>>    there. This is multi-level eswitch. Each "child PF" on a parent
>>    manages a nested switch. And could in theory have other PF child with
>>    another nested switch.
>> 
>
>Ok. So in the smart NIC CPU, we'd see the primary PF and some child PFs,
>and in the host system we'd see a "primary PF" that is the other end of
>the associated Child PF, and might be able to manage its own subswitch.
>
>Ok this is making more sense now.
>
>I think I had imagined that was what subfuntions were. But really
>subfunctions are a bit different, they're more similar to expanded VFs?

Yeah, they are basically VFs without separate pci BDF. They reside on a
BDF of the PF they are created on. Basically a lightweight VFs.


>
>Thanks,
>Jake

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC v3] current devlink extension plan for NICs
  2020-05-01  9:14 [RFC v2] current devlink extension plan for NICs Jiri Pirko
                   ` (3 preceding siblings ...)
  2020-05-14 23:52 ` Jacob Keller
@ 2020-05-19  9:22 ` Jiri Pirko
  4 siblings, 0 replies; 19+ messages in thread
From: Jiri Pirko @ 2020-05-19  9:22 UTC (permalink / raw)
  To: netdev
  Cc: davem, kuba, parav, yuvalav, jgg, saeedm, leon,
	andrew.gospodarek, michael.chan, moshe, ayal, eranbe, vladbu,
	kliteyn, dchickles, sburla, fmanlunas, tariqt, oss-drivers,
	snelson, drivers, aelior, GR-everest-linux-l2, grygorii.strashko,
	mlxsw, idosch, markz, jacob.e.keller, valex, linyunsheng,
	lihong.yang, vikas.gupta, sridhar.samudrala

Hi all.

First, I would like to apologize for very long email. But I think it
would be beneficial to the see the whole picture, where we are going.

Currently we are working internally on several features with
need of extension of the current devlink infrastructure. I took a stab
at putting it all together in a single txt file, inlined below.

Most of the stuff is based on a new port sub-object called "func"
(called "slice" previously" and "subdev" originally in Yuval's patchsets
sent some while ago).

The text describes how things should behave and provides a draft
of user facing console input/outputs. I think it is important to clear
that up before we go in and implement the devlink core and
driver pieces.

I would like to ask you to read this and comment. Especially, I would
like to ask vendors if what is described fits the needs of your
NIC/e-switch.

Please note that something is already implemented, but most of this
isn't (see "what needs to be implemented" section).

v2->v3:
- fixed s/SF0/SF10/ typo
- added sfnum auto alloc option
- added brief SF description
- adjusted the PF section:
   - added smartnic usecase examples of dumb/nested-switch scenarios
   - added explanation of 2 models of handling PFs (Netronome/Mellanox)

v1->v2:
- mainly move from separate slice object into port/func subobject
- couple of small fixes here and there


==================================================================
||                                                              ||
||            Overall illustration of example setup             ||
||                                                              ||
==================================================================

Note that there are 2 hosts in the picture. Host A may be the smartnic host,
Host B may be one of the hosts which gets PF. Also, you might omit
the Host B and just see Host A like an ordinary nic in a host.

Note that the PF is merged with physical port representor.
That is due to simpler and flawless transition from legacy mode and back.
The devlink_ports and netdevs for physical ports are staying during
the transition.

                        +-----------+
                        |phys port 2+-----------------------------------+
                        +-----------+                                   |
                        +-----------+                                   |
                        |phys port 1+---------------------------------+ |
                        +-----------+                                 | |
                                                                      | |
+------------------------------------------------------------------+  | |
|  devlink instance for the whole ASIC                   HOST A    |  | |
|                                                                  |  | |
|  pci/0000:06:00.0  (devlink dev)                                 |  | |
|  +->health reporters, params, info, sb, dpipe,                   |  | |
|  |  resource, traps                                              |  | |
|  |                                                               |  | |
|  +-+port_pci/0000:06:00.0/0+-------------------------------------|--+ |
|  | |  flavour physical pfnum 0  (phys port and pf)               |    |
|  | |  netdev enp6s0f0np1                                         |    |
|  | +->health reporters, params                                   |    |
|  |                                                               |    |
|  +-+port_pci/0000:06:00.0/1+-------------------------------------|----+
|  | |  flavour physical pfnum 1  (phys port and pf)               |
|  | |  netdev enp6s0f0np2                                         |
|  | +->health reporters, params                                   |
|  |                                                               |
|  +-+port_pci/0000:06:00.0/2+---------------------------------+   |
|  | |  flavour pcipf pfnum 2                                  |   |
|  | |  netdev enp6s0f0pf2                                     |   |
|  | +->params                                                 |   |
|  | +->func                                                   |   |
|  |                                                           |   |
|  +-+port_pci/0000:06:00.0/3+------------------------------+  |   |
|  | |  flavour pcivf pfnum 2 vfnum 0                       |  |   |
|  | |  netdev enp6s0pf2vf0                                 |  |   |
|  | +->params                                              |  |   |
|  | +-+func                                                |  |   |
|  |   +->state, rate (qos), mpgroup, hw_addr               |  |   |
|  |                                                        |  |   |
|  +-+port_pci/0000:06:00.0/4+---------------------------+  |  |   |
|  | |  flavour pcivf pfnum 0 vfnum 0                    |  |  |   |
|  | |  netdev enp6s0pf0vf0                              |  |  |   |
|  | +->params                                           |  |  |   |
|  | +-+func                                             |  |  |   |
|  |   +->state, rate (qos), mpgroup, hw_addr            |  |  |   |
|  |                                                     |  |  |   |
|  +-+port_pci/0000:06:00.0/5+------------------------+  |  |  |   |
|  | |  flavour pcisf pfnum 0 sfnum 1                 |  |  |  |   |
|  | |  netdev enp6s0pf0sf1                           |  |  |  |   |
|  | +->params                                        |  |  |  |   |
|  | +-+func                                          |  |  |  |   |
|  |   +->state, rate (qos), mpgroup, hw_addr         |  |  |  |   |
|  |                                                  |  |  |  |   |
|  +-+port_pci/0000:06:00.0/6+-------=-------------+  |  |  |  |   |
|    |  flavour pcivf pfnum 0 vfnum 1              |  |  |  |  |   |
|    |        (non-ethernet (IB, NVME)             |  |  |  |  |   |
|    +-+func                                       |  |  |  |  |   |
|      +->state                                    |  |  |  |  |   |
|                                                  |  |  |  |  |   |
+------------------------------------------------------------------+
                                                   |  |  |  |  |
                                                   |  |  |  |  |
                                                   |  |  |  |  |
+----------------------------------------------+   |  |  |  |  |
|  devlink instance PF (other host)    HOST B  |   |  |  |  |  |
|                                              |   |  |  |  |  |
|  pci/0000:10:00.0  (devlink dev)             |   |  |  |  |  |
|  +->health reporters, info                   |   |  |  |  |  |
|  |                                           |   |  |  |  |  |
|  +-+port_pci/0000:10:00.0/1+---------------------------------+
|    |  flavour virtual                        |   |  |  |  |
|    |  netdev enp16s0f0                       |   |  |  |  |
|    +->health reporters                       |   |  |  |  |
|                                              |   |  |  |  |
+----------------------------------------------+   |  |  |  |
                                                   |  |  |  |
+----------------------------------------------+   |  |  |  |
|  devlink instance VF (other host)    HOST B  |   |  |  |  |
|                                              |   |  |  |  |
|  pci/0000:10:00.1  (devlink dev)             |   |  |  |  |
|  +->health reporters, info                   |   |  |  |  |
|  |                                           |   |  |  |  |
|  +-+port_pci/0000:10:00.1/1+------------------------------+
|    |  flavour virtual                        |   |  |  |
|    |  netdev enp16s0f0v0                     |   |  |  |
|    +->health reporters                       |   |  |  |
|                                              |   |  |  |
+----------------------------------------------+   |  |  |
                                                   |  |  |
+----------------------------------------------+   |  |  |
|  devlink instance VF                 HOST A  |   |  |  |
|                                              |   |  |  |
|  pci/0000:06:00.1  (devlink dev)             |   |  |  |
|  +->health reporters, info                   |   |  |  |
|  |                                           |   |  |  |
|  +-+port_pci/0000:06:00.1/1+---------------------------+
|    |  flavour virtual                        |   |  |
|    |  netdev enp6s0f0v0                      |   |  |
|    +->health reporters                       |   |  |
|                                              |   |  |
+----------------------------------------------+   |  |
                                                   |  |
+----------------------------------------------+   |  |
|  devlink instance SF                 HOST A  |   |  |
|                                              |   |  |
|  pci/0000:06:00.0%sf1    (devlink dev)       |   |  |
|  +->health reporters, info                   |   |  |
|  |                                           |   |  |
|  +-+port_pci/0000:06:00.0%sf1/1+--------------------+
|    |  flavour virtual                        |   |
|    |  netdev enp6s0f0s1                      |   |
|    +->health reporters                       |   |
|                                              |   |
+----------------------------------------------+   |
                                                   |
+----------------------------------------------+   |
|  devlink instance VF                 HOST A  |   |
|                                              |   |
|  pci/0000:06:00.2  (devlink dev)+----------------+
|  +->health reporters, info                   |
|                                              |
+----------------------------------------------+




==================================================================
||                                                              ||
||                 what needs to be implemented                 ||
||                                                              ||
==================================================================

1) physical port "pfnum". When PF and physical port representor
   are merged, the instance of devlink_port representing the physical port
   and PF needs to have "pfnum" attribute to be in sync
   with other PF port representors.

2) per-port health reporters are not implemented yet.

3) devlink_port instance in PF/VF/SF flavour "virtual". In PF/VF/SF devlink
   instance (in VM for example), there would make sense to have devlink_port
   instance. At least to carry link to netdevice name (otherwise we have
   no easy way to find out devlink instance and netdevice belong to each other).
   I was thinking about flavour name, we have to distinguish from eswitch
   devlink port flavours "pcipf, pcivf, pcisf".

   This was recently implemented by Parav:
commit 0a303214f8cb8e2043a03f7b727dba620e07e68d
Merge: c04d102ba56e 162add8cbae4
Author: David S. Miller <davem@davemloft.net>
Date:   Tue Mar 3 15:40:40 2020 -0800

    Merge branch 'devlink-virtual-port'

   What is missing is the "virtual" flavour for nested PF.

4) port func is not implemented yet. This is the original "vdev/subdev" concept.
   See below section "Port func user cmdline API draft".

5) SFs are not implemented.
   See below section "SF (subfunction) user cmdline API draft".

6) rate for port func are not implemented yet.
   See below section "Port func rate user cmdline API draft".

7) mpgroup for port func is not implemented yet.
   See below section "Port func mpgroup user cmdline API draft".

8) VF manual creation using devlink is not implemented yet.
   See below section "VF manual creation and activation user cmdline API draft".
 
9) PF aliasing. One devlink instance and multiple PFs sharing it as they have one
   merged e-switch.

10) Exposing maximum number of SF ports as devlink resource.            

11) Configuring more port/func capabilities (netdevice, rdma device,
    nested eswitch) and resources (irq, queues, pages). 



==================================================================
||                                                              ||
||                  Issues/open questions                       ||
||                                                              ||
==================================================================

1) "pfnum" has to be per-asic(/devlink instance), not per-host.
   That means that in smartNIC scenario, we cannot have "pfnum 0"
   for smartNIC and "pfnum 0" for host as well.
   
2) Q: for TX, RX queues reporters, should it be bound to devlink_port?
   For which flavours this might make sense?
   Most probably for flavours "physical"/"virtual".
   How about the reporters in VF/SF?

3) How the management of nested switch is handled. The PFs created dynamically
   or the ones in hosts in smartnic scenario may themselves be each a manager
   of nested e-switch. How to toggle this capability?
   During creation by a cmdline option?
   During lifetime in case the PF does not have any childs (VFs/SFs)?

   It seems to make sense to have it configurable as a port/func attribute
   for PF port/func objects.

   It might make sense to have it configurable as a port/func attribute
   for PF port/func objects. User would set this before he activates the func.



==================================================================
||                                                              ||
||             Port func user cmdline API draft                 ||
||                                                              ||
==================================================================

Note that some of the "devlink port" attributes may be forgotten or misordered.

Funcs are created as sub-objects of ports where it makes sense to have them
The driver takes care of that. The "func" is a handle to configure "the other
side of the wire". The original port object has port leve properties,
the new "func" sub-object on the other hand has device level properties".

This is example for the HOST A from the example above:

$ devlink port show
pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f0np2
pci/0000:06:00.0/2: flavour pcipf pfnum 2 type eth netdev enp6s0pf2
                    func: hw_addr 10:22:33:44:55:66 state active
pci/0000:06:00.0/3: flavour pcivf pfnum 2 vfnum 0 type eth netdev enp6s0pf2vf0
                    func: hw_addr 10:22:33:44:55:77 state active
pci/0000:06:00.0/4: flavour pcivf pfnum 0 vfnum 0 type eth netdev enp6s0pf0vf0
                    func: hw_addr 10:22:33:44:55:88 state active
pci/0000:06:00.0/5: flavour pcisf pfnum 0 sfnum 1 type eth netdev enp6s0pf0sf1
                    func: hw_addr 10:22:33:44:55:99 state active
pci/0000:06:00.0/6: flavour pcivf pfnum 1 vfnum 2 type nvme
                    func: state active

There is a fixed "state" attribute with value "active". This is by
default as the VFs are always created active. In future, it is planned
to implement manual VF creation and activation, similar to what is below
described for SFs.

Now set a different MAC address for VF1 on PF0:

$ devlink port func set pci/0000:06:00.0/3 hw_addr aa:bb:cc:dd:ee:ff

$ devlink port show
pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f0np2
pci/0000:06:00.0/2: flavour pcipf pfnum 2 type eth netdev enp6s0pf2
                    func: hw_addr 10:22:33:44:55:66 state active
pci/0000:06:00.0/3: flavour pcivf pfnum 2 vfnum 0 type eth netdev enp6s0pf2vf0
                    func: hw_addr aa:bb:cc:dd:ee:ff state active
pci/0000:06:00.0/4: flavour pcivf pfnum 0 vfnum 0 type eth netdev enp6s0pf0vf0
                    func: hw_addr 10:22:33:44:55:88 state active
pci/0000:06:00.0/5: flavour pcisf pfnum 0 sfnum 1 type eth netdev enp6s0pf0sf1
                    func: hw_addr 10:22:33:44:55:99 state active
pci/0000:06:00.0/6: flavour pcivf pfnum 1 vfnum 2 type nvme
                    func: state active



==================================================================
||                                                              ||
||          SF (subfunction) user cmdline API draft             ||
||                                                              ||
==================================================================

Subfunctions are more light weight than VFs. They share the same PCI device
(BAR, IRQs) as that of PF/VF on which it is deployed. Unlike VFs which are
enabled/disabled in bulk, subfunctions are created, deployed in unit of 1.

Note that some of the "devlink port" attributes may be forgotten,
misordered or omitted on purpose.

$ devlink port show
pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f0np2
pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 type eth netdev enp6s0pf0vf0
                    func: hw_addr 10:22:33:44:55:66 state active

There is one VF on the NIC.

Now create subfunction of SF10 on PF1, index of the port is going to be 100:

$ devlink port add pci/0000.06.00.0/100 flavour pcisf pfnum 1 sfnum 10

The devlink kernel code calls down to device driver (devlink op) and asks
it to create a SF port. Driver then instantiates the SF port in the same
way it is done for VF.


Note that it may be possible to avoid passing port index and let the
kernel assign index for you:
$ devlink port add pci/0000.06.00.0 flavour pcisf pfnum 1 sfnum 10

This would work in a similar way as devlink region id assignment that
is being pushed now.

Moreover, it may be possible to avoid passing sfnum as well and let
the kernel assign it for you as well:
$ devlink port add pci/0000.06.00.0 flavour pcisf pfnum 1


Set the func hw_address to aa:bb:cc:aa:bb:cc:

$ devlink port func set pci/0000.06.00.0/100 hw_addr aa:bb:cc:aa:bb:cc

$ devlink port show
pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f0np2
pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 type eth netdev enp6s0pf0vf0
    func: hw_addr 10:22:33:44:55:66 state active
pci/0000:06:00.0/100: flavour pcisf pfnum 1 sfnum 10 type eth netdev enp6s0pf1sf10
    func: hw_addr aa:bb:cc:aa:bb:cc state inactive


Note that the SF port is created but not active. That means the
entities are created on devlink side, the e-switch port representor
is created, but the SF device itself it not yet out there (same host
or different, depends on where the parent PF is - in this case the same host).
User might use e-switch port representor enp6s0pf1sf10 to do settings,
putting it into bridge, adding TC rules, etc.
It's like the cable is unplugged on the other side.


Now we activate (deploy) the SF port/func:
$ devlink port func set pci/0000:06:00.0/100 state active

$ devlink port show
pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f0np2
pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 type eth netdev enp6s0pf0vf0
    func: hw_addr 10:22:33:44:55:66 state active
pci/0000:06:00.0/100: flavour pcisf pfnum 1 sfnum 10 type eth netdev enp6s0pf1sf10
    func: hw_addr aa:bb:cc:aa:bb:cc state active

Upon the activation, the device driver asks the device to instantiate
the actual SF device on particular PF. Does not matter if that is
on the same host or not.

On the other side, the PF driver instance gets the event from device
that particular SF was activated. It's the cue to put the device on bus
probe it and instantiate netdev and devlink instances for it.

For every SF a device is created on virtbus with an ID assigned by the
virtbus code. For example:
/sys/bus/virtbus/devices/mlx5_sf.1

$ cat /sys/bus/virtbus/devices/mlx5_sf.1/sfnum
10

/sys/bus/virtbus/devices/mlx5_sf.1 is a symlink to:
../../../devices/pci0000:00/0000:00:03.0/0000:06:00.0/mlx5_sf.1

New devlink instance is named using alias:
$ devlink dev show
pci/0000:06:00.0%sf10

$ devlink port show
pci/0000:06:00.0%sf10/0: flavour virtual type eth netdev netdev enp6s0f0s10

You see that the udev used the sysfs files and symlink to assemble netdev name.

Note that this kind of aliasing is not implemented. Needs to be done in
devlink code in kernel. During SF devlink instance creation, there should
be passed parent PF device pointer and sfnum from which the alias dev_name
is assembled. This ensures persistent naming consistent in both smartnic
and host usecase.

If the user on smartnic or host does not want the virtbus device to get
probed automatically (for any reason), he can do it by:

$ echo "0" > /sys/bus/virtbus/drivers_autoprobe

This is enabled by default.


User can deactivate the SF port/func by:

$ devlink port func set pci/0000:06:00.0/100 state inactive

This eventually leads to event delivered to PF driver, which is a
cue to remove the SF device from virtbus and remove all related devlink
and netdev instances.

The port/func may be activated again.

Now on the teardown process, user might either remove the SF port
right away, without deactivation. However, it is possible to remove
deactivated SF too. To remove the SF, user should do:

$ devlink port del pci/0000:06:00.0/100

$ devlink port show
pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f0np2
pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 type eth netdev enp6s0pf0vf0
    func: hw_addr 10:22:33:44:55:66 state active



==================================================================
||                                                              ||
||   VF manual creation and activation user cmdline API draft   ||
||                                                              ||
==================================================================

To enter manual mode, the user has to turn off VF dummies creation:
$ devlink dev set pci/0000:06:00.0 vf_dummies disabled
$ devlink dev show
pci/0000:06:00.0: vf_dummies disabled

It is "enabled" by default in order not to break existing users.

By setting the "vf_dummies" attribute to "disabled", the driver
removes all dummy VFs. Only physical ports are present:

$ devlink port show
pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f0np2

Then the user is able to create them in a similar way as SFs:

$ devlink port add pci/0000:06:00.0/99 flavour pcivf pfnum 1 vfnum 8

The devlink kernel code calls down to device driver (devlink op) and asks
it to create a VF port with particular attributes. Driver then instantiates
the VF port with func.

Set the func hw_address to aa:bb:cc:aa:bb:cc:
$ devlink port func set pci/0000:06:00.0/99 hw_addr aa:bb:cc:aa:bb:cc

$ devlink port show
pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f0np2
pci/0000:06:00.0/99: flavour pcivf pfnum 1 vfnum 8 type eth netdev enp6s0pf1vf8
    func: hw_addr aa:bb:cc:aa:bb:cc state inactive

Now we activate (deploy) the VF:
$ devlink port func set pci/0000:06:00.0/99 state active

$ devlink port show
pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f0np2
pci/0000:06:00.0/99: flavour pcivf pfnum 1 vfnum 8 type eth netdev enp6s0pf1vf8
    func: hw_addr aa:bb:cc:aa:bb:cc state active



==================================================================
||                                                              ||
||                             PFs                              ||
||                                                              ||
==================================================================

There are 2 flavours of PFs:
1) "Parent" PF. That is coupled with uplink port. The flavour is:
    a) "physical" - in case the uplink port is actual port in the NIC.
    b) "virtual" - in case this "Parent" PF is actually a leg to
       upstream embedded switch.

   "Parent" PFs are the ones that may be in control of managing
   embedded switch, on any hierarchy level.

   In case of the SmartNIC scenario, this PF resides on the SmartNIC CPU.

   $ devlink port show
   pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1

   There are 2 cases possible for 2 port NIC:
   a) Same PF for both ports (Netronome model):
      one PF - 0000:06:00.0

      $ devlink port show
      pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
      pci/0000:06:00.0/1: flavour physical pfnum 0 type eth netdev enp6s0f0np2

   b) Separate per-port PF (Mellanox model):
      two PFs - 0000:06:00.0, 0000:06:00.1
      They both share the same embedded switch, the aliasing is established
      for devlink handles:
      pci/0000:06:00.0
      pci/0000:06:00.1
      are equivalents, pointing to the same devlink instance.

      $ devlink port show
      pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
      pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f1np2

2) "Child" PF. This is a leg of a PF put to the "parent" PF. It is
   represented by a port a port with a netdevice and func:

   In case of the SmartNIC scenario, this PF resides on the host.

   $ devlink port show
   pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
   pci/0000:06:00.0/1: flavour pcipf pfnum 2 type eth netdev enp6s0f0pf2
       func: hw_addr aa:bb:cc:aa:bb:87 state active

   This is a typical smartnic scenario. You would see this list on
   the smartnic CPU. The port pci/0000:06:00.0/1 is a leg to
   one of the hosts. If you send packets to enp6s0f0pf2, they will
   go to the "child" PF.

   Note that inside the host, the PF is represented again as "Parent PF"
   and may be used to configure nested embedded switch.


Parent PF on a host in SmartNIC usecase:

There are 2 possible scenarios in this case:
1) Have the "simple dumb" PF, which just exposes 1 netdev for host to
   run traffic over. SmartNIC CPU manages the VFs/SFs and sees the
   devlink ports for them. This is 1 level switch - merged switch.

2) PF manages a sub-switch/nested-switch. The devlink/devlink ports are
   created on the host and the devlink ports for SFs/VFs are created
   there. This is multi-level eswitch. Each "child" PF on a parent
   manages a nested switch. And could in theory have other PF "child" with
   another nested switch.

   The PF parent sets this "nested-switch management" configuration
   on the "child PF" devlink port on SmartNIC CPU side.


==================================================================
||                                                              ||
||           Port func operational state extension              ||
||                                                              ||
==================================================================

In addition to the "state" attribute that serves for the purpose
of setting the "admin state", there is "opstate" attribute added to
reflect the operational state of the port/func:


    opstate                description
    -----------            ------------
    1. attached    State when port/func devince is bound to the host
                   driver. When the func device is unbound from the
                   host driver, func device exits this state and
                   enters detaching state.

    2. detaching   State when host is notified to deactivate the
                   func device and func device may be undergoing
                   detachment from host driver. When func device is
                   fully detached from the host driver, func exits
                   this state and enters detached state.

    3. detached    State when driver is fully unbound, it enters
                   into detached state.

func state machine:
-------------------
                               func state set inactive
                              ----<------------------<---
                              | or port delete          |
                              |                         |
  __________              ____|_______              ____|_______
 |          |  port add  |            | func state |            |
 |          |-------->---|            |------>-----|            |
 | invalid  |            |  inactive  | set active |   active   |
 |          |  port del  |            |            |            |
 |__________|--<---------|____________|            |____________|



func operational state machine:
-------------------------------
  __________                ____________              ___________
 |          | func state   |            | driver bus |           |
 | invalid  |-------->-----|  detached  |------>-----| attached  |
 |          | set active   |            | probe()    |           |
 |__________|              |____________|            |___________|
                                 |                        |
                                 ^                    func set
                                 |                    inactive
                            successful detach             |
                              or pf reset                 |
                             ____|_______                 |
                            |            | driver bus     |
                 -----------| detaching  |---<-------------
                 |          |            | remove()
                 ^          |____________|
                 |   timeout      |
                 --<---------------



==================================================================
||                                                              ||
||             Port func rate user cmdline API draft            ||
||                                                              ||
==================================================================

Note that some of the "devlink port func" attributes in show commands
are omitted on purpose.

$ devlink port show
pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
pci/0000:06:00.0/1: flavour pcivf pfnum 0 vfnum 1 type eth netdev enp6s0pf0vf1
    func: hw_addr 10:22:33:44:55:66 state active
pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 2 type eth netdev enp6s0pf0vf2
    func: hw_addr 10:22:33:44:55:77 state active
pci/0000:06:00.0/3: flavour pcivf pfnum 0 vfnum 3 type eth netdev enp6s0pf0vf3
    func: hw_addr 10:22:33:44:55:88 state active
pci/0000:06:00.0/4: flavour pcisf pfnum 0 sfnum 1 type eth netdev enp6s0pf0vf1
    func: hw_addr 10:22:33:44:55:99 state active

port/func object is extended with new rate object.

$ devlink port func rate
pci/0000:06:00.0/1: type leaf
pci/0000:06:00.0/2: type leaf
pci/0000:06:00.0/3: type leaf
pci/0000:06:00.0/4: type leaf

This shows the leafs created by default alongside with port/func objects.
No min or max tx rates were set, so their values are omitted.


Now create new node rate object:

$ devlink port func rate add pci/0000:06:00.0/somerategroup

$ devlink port func rate
pci/0000:06:00.0/1: type leaf
pci/0000:06:00.0/2: type leaf
pci/0000:06:00.0/3: type leaf
pci/0000:06:00.0/4: type leaf
pci/0000:06:00.0/somerategroup: type node

New node rate object was created - the last line.


Now create another new node object was created, this time with some attributes:

$ devlink port func rate add pci/0000:06:00.0/secondrategroup min_tx_rate 20 max_tx_rate 1000

$ devlink port func rate
pci/0000:06:00.0/1: type leaf
pci/0000:06:00.0/2: type leaf
pci/0000:06:00.0/3: type leaf
pci/0000:06:00.0/4: type leaf
pci/0000:06:00.0/somerategroup: type node
pci/0000:06:00.0/secondrategroup: type node min_tx_rate 20 max_tx_rate 1000

Another new node object was created - the last line. The object has min and max
tx rates set, so they are displayed after the object type.


Now set node named somerategroup min/max rate using rate object:

$ devlink port func rate set pci/0000:06:00.0/somerategroup min_tx_rate 50 max_tx_rate 5000

$ devlink port func rate
pci/0000:06:00.0/1: type leaf
pci/0000:06:00.0/2: type leaf
pci/0000:06:00.0/3: type leaf
pci/0000:06:00.0/4: type leaf
pci/0000:06:00.0/somerategroup: type node min_tx_rate 50 max_tx_rate 5000
pci/0000:06:00.0/secondrategroup: type node min_tx_rate 20 max_tx_rate 1000


Now set leaf port/func rate using rate object:

$ devlink port func rate set pci/0000:06:00.0/2 min_tx_rate 10 max_tx_rate 10000

$ devlink port func rate
pci/0000:06:00.0/1: type leaf
pci/0000:06:00.0/2: type leaf min_tx_rate 10 max_tx_rate 10000
pci/0000:06:00.0/3: type leaf
pci/0000:06:00.0/4: type leaf
pci/0000:06:00.0/somerategroup: type node min_tx_rate 50 max_tx_rate 5000
pci/0000:06:00.0/secondrategroup: type node min_tx_rate 20 max_tx_rate 1000


Now set leaf func of port with index 2 parent node using rate object:

$ devlink port func rate set pci/0000:06:00.0/2 parent somerategroup

$ devlink port func rate
pci/0000:06:00.0/1: type leaf
pci/0000:06:00.0/2: type leaf min_tx_rate 10 max_tx_rate 10000 parent somerategroup
pci/0000:06:00.0/3: type leaf
pci/0000:06:00.0/4: type leaf
pci/0000:06:00.0/somerategroup: type node min_tx_rate 50 max_tx_rate 5000
pci/0000:06:00.0/secondrategroup: type node min_tx_rate 20 max_tx_rate 1000


Now set leaf func of port with index 1 parent node using rate object:

$ devlink port func rate set pci/0000:06:00.0/1 parent somerategroup

$ devlink port func rate
pci/0000:06:00.0/1: type leaf parent somerategroup
pci/0000:06:00.0/2: type leaf min_tx_rate 10 max_tx_rate 10000 parent somerategroup
pci/0000:06:00.0/3: type leaf
pci/0000:06:00.0/4: type leaf
pci/0000:06:00.0/somerategroup: type node min_tx_rate 50 max_tx_rate 5000
pci/0000:06:00.0/secondrategroup: type node min_tx_rate 20 max_tx_rate 1000


Now unset leaf func of port with index 1 parent node using rate object:

$ devlink port func rate set pci/0000:06:00.0/1 noparent

$ devlink port func rate
pci/0000:06:00.0/1: type leaf
pci/0000:06:00.0/2: type leaf min_tx_rate 10 max_tx_rate 10000 parent somerategroup
pci/0000:06:00.0/3: type leaf
pci/0000:06:00.0/4: type leaf
pci/0000:06:00.0/somerategroup: type node min_tx_rate 50 max_tx_rate 5000
pci/0000:06:00.0/secondrategroup: type node min_tx_rate 20 max_tx_rate 1000


Now delete node object:

$ devlink port func rate del pci/0000:06:00.0/somerategroup

$ devlink port func rate
pci/0000:06:00.0/1: type leaf
pci/0000:06:00.0/2: type leaf
pci/0000:06:00.0/3: type leaf
pci/0000:06:00.0/4: type leaf

Rate node object was removed and its only child pci/0000:06:00.0/2 automatically
detached.



==================================================================
||                                                              ||
||        Port func ib groupping user cmdline API draft         ||
||                                                              ||
==================================================================

Note that some of the "devlink port func" attributes in show commands
are omitted on purpose.

The reason for this IB groupping is that the VFs inside virtual machine
get information (via device) about which two of more VF devices should
be combined together to form one multi-port IB device. In the virtual
machine it is driver's responsibility to setup the combined
multi-port IB devices.

Consider following setup:

$ devlink port show
pci/0000:06:00.0/0: flavour physical pfnum 0 type eth netdev enp6s0f0np1
pci/0000:06:00.0/1: flavour physical pfnum 1 type eth netdev enp6s0f0np2
pci/0000:06:00.0/2: flavour pcivf pfnum 0 vfnum 0 type eth netdev enp6s0pf0vf0
    func: hw_addr 10:22:33:44:55:66 state active
pci/0000:06:00.0/3: flavour pcivf pfnum 0 vfnum 1 type eth netdev enp6s0pf0vf1
    func: hw_addr 10:22:33:44:55:77 state active
pci/0000:06:00.0/4: flavour pcivf pfnum 1 vfnum 0 type eth netdev enp6s0pf1vf0
    func: hw_addr 10:22:33:44:55:88 state active
pci/0000:06:00.0/5: flavour pcivf pfnum 1 vfnum 1 type eth netdev enp6s0pf1vf1
    func: hw_addr 10:22:33:44:55:99 state active


Each VF/SF port/func has IB leaf object related to it:

$ devlink port func ib
pci/0000:06:00.0/2: type leaf
pci/0000:06:00.0/3: type leaf
pci/0000:06:00.0/4: type leaf
pci/0000:06:00.0/5: type leaf

You see that by default, each port/func is marked as a leaf.
There is no groupping set.


User may add a ib group node by issuing following command:

$ devlink port func ib add pci/0000:06:00.0/somempgroup1

$ devlink port func ib
pci/0000:06:00.0/2: type leaf
pci/0000:06:00.0/3: type leaf
pci/0000:06:00.0/4: type leaf
pci/0000:06:00.0/5: type leaf
pci/0000:06:00.0/someibgroup1: type node

New node ib node object was created - the last line.


Now set leaf func of port with index 2 parent node using ib object:

$ devlink port func ib set pci/0000:06:00.0/2 parent someibgroup1

$ devlink port func ib
pci/0000:06:00.0/2: type leaf parent someibgroup1
pci/0000:06:00.0/3: type leaf
pci/0000:06:00.0/4: type leaf
pci/0000:06:00.0/5: type leaf
pci/0000:06:00.0/someibgroup1: type node


Now set leaf func of port with index 5 parent node using ib object:

$ devlink port func ib set pci/0000:06:00.0/5 parent someibgroup1

$ devlink port func ib
pci/0000:06:00.0/2: type leaf parent someibgroup1
pci/0000:06:00.0/3: type leaf
pci/0000:06:00.0/4: type leaf
pci/0000:06:00.0/5: type leaf parent someibgroup1
pci/0000:06:00.0/someibgroup1: type node

Now you can see there are 2 leaf funcs configured to have one parent.


To remove the parent link, user should issue following command:

$ devlink port func ib set pci/0000:06:00.0/5 noparent

$ devlink port func ib
pci/0000:06:00.0/2: type leaf parent someibgroup1
pci/0000:06:00.0/3: type leaf
pci/0000:06:00.0/4: type leaf
pci/0000:06:00.0/5: type leaf
pci/0000:06:00.0/someibgroup1: type node


Now delete node object:

$ devlink port func ib del pci/0000:06:00.0/somempgroup1
$ devlink port func ib
pci/0000:06:00.0/2: type leaf
pci/0000:06:00.0/3: type leaf
pci/0000:06:00.0/4: type leaf
pci/0000:06:00.0/5: type leaf

Node object was removed and its only child pci/0000:06:00.0/2 automatically
detached.


It is not possible to delete leafs:

$ devlink port func ib del pci/0000:06:00.0/2
devlink answers: Operation not supported



==================================================================
||                                                              ||
||            Dynamic PFs user cmdline API draft                ||
||                                                              ||
==================================================================

User might want to create another PF, similar as VF.
TODO

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC v2] current devlink extension plan for NICs
  2020-05-19  5:17           ` Parav Pandit
@ 2020-05-19 19:45             ` Jacob Keller
  2020-05-20 12:58               ` Parav Pandit
  0 siblings, 1 reply; 19+ messages in thread
From: Jacob Keller @ 2020-05-19 19:45 UTC (permalink / raw)
  To: Parav Pandit, Jiri Pirko
  Cc: netdev, davem, kuba, Yuval Avnery, jgg, Saeed Mahameed, leon,
	andrew.gospodarek, michael.chan, Moshe Shemesh, Aya Levin,
	Eran Ben Elisha, Vlad Buslov, Yevgeny Kliteynik, dchickles,
	sburla, fmanlunas, Tariq Toukan, oss-drivers, snelson, drivers,
	aelior, GR-everest-linux-l2, grygorii.strashko, mlxsw,
	Ido Schimmel, Mark Zhang, Alex Vesker, linyunsheng, lihong.yang,
	vikas.gupta, sridhar.samudrala



On 5/18/2020 10:17 PM, Parav Pandit wrote:
> Hi Jake,
> 

>> Ok. So in the smart NIC CPU, we'd see the primary PF and some child PFs,
>> and in the host system we'd see a "primary PF" that is the other end of the
>> associated Child PF, and might be able to manage its own subswitch.
>>
>> Ok this is making more sense now.
>>
>> I think I had imagined that was what subfuntions were. But really
>> subfunctions are a bit different, they're more similar to expanded VFs?
>>
>  
> 1. Sub functions are more light weight than VFs because,
> 2. They share the same PCI device (BAR, IRQs) as that of PF/VF on which it is deployed.
> 3. Unlike VFs which are enabled/disabled in bulk, subfunctions are created, deployed in unit of 1.
> 
> Since this RFC content is overwhelming, I expanded the SF plumbing details more in [1] in previous RFC version.
> You can replace 'devlink slice' with 'devlink port func' in [1].
> 
> [1] https://marc.info/?l=linux-netdev&m=158555928517777&w=2
> 

Thanks! Indeed, this makes things a lot more clear to me now.

Regards,
Jake

^ permalink raw reply	[flat|nested] 19+ messages in thread

* RE: [RFC v2] current devlink extension plan for NICs
  2020-05-19 19:45             ` Jacob Keller
@ 2020-05-20 12:58               ` Parav Pandit
  0 siblings, 0 replies; 19+ messages in thread
From: Parav Pandit @ 2020-05-20 12:58 UTC (permalink / raw)
  To: Jacob Keller, Jiri Pirko
  Cc: netdev, davem, kuba, Yuval Avnery, jgg, Saeed Mahameed, leon,
	andrew.gospodarek, michael.chan, Moshe Shemesh, Aya Levin,
	Eran Ben Elisha, Vlad Buslov, Yevgeny Kliteynik, dchickles,
	sburla, fmanlunas, Tariq Toukan, oss-drivers, snelson, drivers,
	aelior, GR-everest-linux-l2, grygorii.strashko, mlxsw,
	Ido Schimmel, Mark Zhang, Alex Vesker, linyunsheng, lihong.yang,
	vikas.gupta, sridhar.samudrala


> From: Jacob Keller <jacob.e.keller@intel.com>
> Sent: Wednesday, May 20, 2020 1:15 AM
> 
> On 5/18/2020 10:17 PM, Parav Pandit wrote:
> > Hi Jake,
> >
> 
> >> Ok. So in the smart NIC CPU, we'd see the primary PF and some child
> >> PFs, and in the host system we'd see a "primary PF" that is the other
> >> end of the associated Child PF, and might be able to manage its own
> subswitch.
> >>
> >> Ok this is making more sense now.
> >>
> >> I think I had imagined that was what subfuntions were. But really
> >> subfunctions are a bit different, they're more similar to expanded VFs?
> >>
> >
> > 1. Sub functions are more light weight than VFs because, 2. They share
> > the same PCI device (BAR, IRQs) as that of PF/VF on which it is deployed.
> > 3. Unlike VFs which are enabled/disabled in bulk, subfunctions are created,
> deployed in unit of 1.
> >
> > Since this RFC content is overwhelming, I expanded the SF plumbing details
> more in [1] in previous RFC version.
> > You can replace 'devlink slice' with 'devlink port func' in [1].
> >
> > [1] https://marc.info/?l=linux-netdev&m=158555928517777&w=2
> >
> 
> Thanks! Indeed, this makes things a lot more clear to me now.
> 
Thanks for the review Jake.

> Regards,
> Jake

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2020-05-20 12:59 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-01  9:14 [RFC v2] current devlink extension plan for NICs Jiri Pirko
2020-05-04  2:12 ` Samudrala, Sridhar
2020-05-04 11:42   ` Jiri Pirko
2020-05-10 14:45 ` Jiri Pirko
2020-05-10 16:30   ` Dave Taht
2020-05-11  5:32     ` Jiri Pirko
2020-05-11 22:37   ` Jacob Keller
2020-05-13 13:00 ` [oss-drivers] " Simon Horman
2020-05-14  6:07   ` Jiri Pirko
2020-05-14 23:52 ` Jacob Keller
2020-05-15  9:30   ` Jiri Pirko
2020-05-15 21:36     ` Jacob Keller
2020-05-18  6:52       ` Jiri Pirko
2020-05-18 21:05         ` Jacob Keller
2020-05-19  5:17           ` Parav Pandit
2020-05-19 19:45             ` Jacob Keller
2020-05-20 12:58               ` Parav Pandit
2020-05-19  5:19           ` Jiri Pirko
2020-05-19  9:22 ` [RFC v3] " Jiri Pirko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.